|
In our Appeal to the “twenty one” ten years ago, we exposed «a major flaw in the statistical analysis»5. The measurements provided by the three laboratories are completely heterogeneous for one of the cloths, sample number 1, the very sample that is supposed to come from the Holy Shroud, whereas the measurements for the three other cloths are very homogeneous. The Abbé de Nantes noticed this on the very first day: «I paused at figure 1 of this report in Nature (our figure 27, infra, p. 35). Two years later, knowing that my intuition was not mistaken and how all the proofs converged to unmask the hoaxers, I remain astounded at the audacity of Dr Tite in daring to exhibit this naked diagram of his crime before the inquisitive gaze of the scientific and believing worlds.6» Let us look at it again:
On comparing the ranges of the results provided by the three laboratories for each sample and displayed in groups that look like “fighter squadrons”, the Abbé de Nantes wrote: «The first manifestation of the truth is so glaring that I remember on the very first day almost mechanically marking it with a vertical double line: Oxford’s little aeroplane in squadron number 1, which is so curiously separate from the others, is found to be on the same vertical dropline as another of Oxford’s aeroplanes in squadron 4, spanning the identical range of dates. Whereas the two other small aeroplanes in group 4, those of Zurich and Arizona, are here, as they should be, in good flight formation with Oxford.1» The discrepancy between Arizona and Oxford is too marked not to arouse the most legitimate of suspicions: did these two laboratories really work on samples from from the same cloth? The long statistical exposition devoted to the interpretation of the results in Nature has no other purpose than to attempt to answer this question... positively2. Since then, we have continued to develop our critical examination of the most technical part of the report in Nature3, without receiving the slightest response from any of the “twenty-one” co-authors of this report, not even from Dr Tite or Mrs Leese, even though a question mark hangs over both of them and they stand accused of malpractice. Three experts, whose names we cannot reveal lest what happened to Timothy Linick happen also to them (supra, p. 34), have agreed to take up this investigation anew. I am but their reporter, and I am keen to express our gratitude for their remarkable work. The fact is that, under a showy technical apparatus which aims to rule out any suspicion of its defects, the report in Nature reveals an astounding lack of rigour. We do not intend to explain the statistical approach in its entirety, nor to make an exhaustive critique of it. Others have made this their business4, and their criticisms are generally reliable on a technical level. We simply wish to highlight certain key elements in this report, ones which are moreover easy to grasp, and which will allow us to draw the necessary conclusions within the limits set by science, that is to say with a certain measurable probability.
In experimental science, no result is “exact”. We are limited by the imperfections of our machines, the readings on our computer screens and our numeration systems. A final result is never more than a means of obtaining other results, each individually subject to “error”. Let us take for example the measurement of a weight, frequently carried out in experiments with the aid of a weighing scale. A set of measurements are obtained which are distributed around a central value, the “mean”, and this distribution is itself regulated by a “law of probability” called normal distribution. This is characterised by two parameters: the mean m and the standard deviation, commonly designated by the Greek letter σ. Generally speaking, the standard deviation represents the scatter of a distribution around its mean. A small standard deviation indicates that the results are closely grouped around m; a large standard deviation signifies an almost uniform distribution or one that consists of abnormal values widely separated from m. The standard deviation is often understood as the probable error of a measurement. Results are therefore presented in the form m ± σ. Where “normal distribution” holds true, we know that 68% of measurements must be between m - σ and m + σ, and 95% between m - 2σ and m + 2σ. Of course, different distributions will provide different intervals. So which do we choose? In the present case, that of the measurements of radiocarbon age, it appears that the measurements are not grouped around their mean in a way that meets the requirements of normal distribution. That is why Jacques Évin aptly summarised the situation when he recognised that it was necessary to «fiddle the figures» in order to arrive at the mean (supra, p. 30). Spectrometers are recent machines and there is a lack of experience with them, and there is also perhaps a lack of statistical knowledge on the part of the experimenters. As Dr Tite told us, in a moment of sincerity: «I am not a statistician.1» When a physicist knows the law of probability, he is tempted to apply it without asking himself too many questions... This is one of the principal and most reasonable criticisms made by commentators against the report in Nature. This remark is of importance, for the statistical procedures of Mrs Leese are entirely based on the preliminary hypothesis of normal distribution; that is why she blames the small number of results with which she was provided and with which she had to be content to arrive at her means2. The famous “margin” of one hundred and thirty years – 1260-1390! – (supra, fig. 19), with a confidence level of at least 95%, springs directly from this hypothesis. The first table (below) in the Nature report presents results which are precisely of the type m ± σ. Each one of the results listed summarises multiple experiments. For example, the first result, indicated by the abbreviation A1.1b, is 591 ± 30. It is obtained by combining two results: 606 ± 41 and 574 ± 45, each of these summarising around twenty measurements3. The final mean, 591, which represents the radiocarbon age of this piece of cloth no 1 analysed by the laboratory of Arizona, is therefore a mean of the preceding measurements, but a “weighted” mean, which uses a procedure we shall come back to. As a result, the real age is probably between the values 591 - 30 and 591 + 30, and still more probably between 591 - 60 and 591 + 60. The standard deviation of 30 is estimated empirically, based either on previous experiments or on the dispersion of all the measurements obtained. The laboratory of Arizona had the honesty to provide the details of certain of its calculations, whereas the statistical procedures of Oxford, and especially those of Zurich, remain in a thick fog. Arizona seems more optimistic than the others, in the sense that its standard deviations are generally smaller. |
|
Table 1 Basic data (individual measurements) |
|||||||||
|
Sample 1 |
Sample 2 |
Sample 3 |
Sample 4 |
Pretreatment
and |
|||||
| Arizona | AA-3367 | AA-3368 | AA-3369 | AA-3370 | |||||
| A1.1b* | 591 ± 30 | A2.1b | 922 ± 48 | A3.1b | 1,838 ± 47 | A4.1b |
724 ± 42 |
||
| A1.2b | 690 ± 35 | A2.2a | 986 ± 56 | A3.2a (1) | 2,041 ± 43 | A4.2a | 778 ± 88 | a, method a | |
| A1.3a | 606 ± 41 | A2.3a (1) | 829 ± 50 | A3.3a | 1,960 ± 55 | A4.3a (1) | 764 ± 45 | b, method b | |
| A1.4a | 701 ± 33 | A2.4a (2) | 996 ± 38 | A3.4a (2) | 1,983 ± 37 | A4.4a (2) | 602 ± 38 | ( ), same subsample | |
| A2.5b | 894 ± 37 | A3.5b | 2,137 ± 46 | A4.5b | 825 ± 44 | ||||
| δ13C (‰) |
-25.0 |
-23.0 |
-23.6 |
-25.0 |
|||||
| Oxford | 2575 | 2574 | 2576 | 2589 | |||||
| O1.1u | 795 ± 65 | O2.1u | 980 ± 55 | O3.1u | 1,955 ± 70 | O4.2u | 785 ± 50 | u, unbleached | |
| O1.2b | 730 ± 45 | O2.1b | 915 ± 55 | O3.1b | 1,975 ± 55 | O4.2b (1) | 710 ± 40 | b, bleached | |
| O1.1b | 745 ± 55 | O2.2b† | 925 ± 45 | O3.2b | 1,990 ± 50 | O4.2b (2) | 790 ± 45 | (
), same pretreatment/ run combination |
|
| δ13C‡ (‰) |
-27.0 |
-27.0 |
-27.0 |
-27.0 |
|||||
| Zurich | ETH-3883 | ETH-3884 | ETH-3885§ | ETH-3882 | |||||
| Z1.1u | 733 ± 61 | Z2.1u | 890 ± 59 | Z3.1u | 1,984 ± 50 | Z4.1u | 739 ± 63 | u, ultrasonic only | |
| Z1.1w | 722 ± 56 | Z2.1w | 1,036 ± 63 | Z3.2w | 1,886 ± 48 | Z4.1w | 676 ± 60 | w, weak | |
| Z1.1s | 635 ± 57 | Z2.1s | 923 ± 47 | Z3.2s | 1,954 ± 50 | Z4.1s | 760 ± 66 | s, strong | |
| Z1.2w | 639 ± 45 | Z2.2w | 980 ± 50 | Z4.2w | 646 ± 49 | ||||
| Z1.2s | 679 ± 51 | Z2.2s | 904 ± 46 | Z4.2s | 660 ± 46 | ||||
| δ13C|| (‰) |
– 25.1 |
– 23.6 |
–22.0 |
–25.5 |
|||||
In years BP, corrected for δ13C fractionation; errors at 1σ level; see text for pretreatment details. * The identification code for each measurement shows, in order, the laboratory, sample, measurement run, pretreatment and any replication involved. † One anomalous replicate (of 6) obtained for independent measurement O2.2b; if rejected, it reduces date by 40 yr; final date quoted actually reduced by 20 yr. ‡ Measured for samples 1 and 3; assumed for samples 2 and 4. § The loose weave of sample Z3.1 led to its disintegration during strong and weak chemical treatments. Z3.2 was centrifuged to avoid the same loss of material. || Average of separate determinations by AMS. As table 1 contains such an excess of information, it is for the statistician to extract its essential characteristics. Four pieces of cloth passed to three laboratories leads to a table of twelve m ± σ values, calculated in accordance with the “weighting” formulas proposed by Ward and Wilson, two authors who are referred to in the report of the “twenty-one”. The idea is to increase the weight of those values which are associated with smaller standard deviations, which is supposed to guarantee a higher degree of precision. For example, the three values obtained for sample 3 analysed by Zurich are 1,984 ± 50, 1,886 ± 48 and 1,954 ± 50. If we use the letter i to refer to measurements 1, 2 and 3, then the weight of each measurement i is as follows:
It can be seen that the sum of the weights is equal to 1. Where mi ± σi is the result of normal independent variables, the weighted mean of the three results is: m = m1a1 + m2a2 + m3a3 = 1,940 And the weighted standard deviation is:
We can proceed in the same way, using the same hypothesis of normal distribution, for the eleven other samples. We then obtain a table of the twelve following results:
We now compare these results with the twelve results below taken from table 2 of the Nature report. We observe that the means correspond closely, with differences of no more than two or three units. But when we come to the standard deviations, a surprise awaits us: although our figures are identical in certain cases, they are clearly much smaller for each of Arizona’s results.
Extract from Table 2 in the Nature report Conclusion: “One” have made use of the formula of Ward and Wilson except where the standard deviations were too small; in these cases, they have been replaced by other values. What values? This is a mystery. Now we are going to see the importance that this fact takes on in the χ2 test: if we keep the values obtained by the formulas of Ward and Wilson, this test does not work at all...
This test constitutes a necessary check on the homogeneity of the results, sample by sample. For example, suppose the chemical treatment were to change the nature of certain samples and that, for this reason, the means provided by each laboratory differed excessively from each other. In this case, the final mean would scarcely be justified and, most importantly, one would not be able to apply a confidence range to it. This test is therefore a good precaution. Still, it needs to be applied in accordance with the rules.
Let us apply the χ2 test to sample 3, supposed to have come from the collection of the Department of Egyptian Antiquities at the British Museum, «associated with an early second century AD mummy of Cleopatra from Thebes»1. Table 2 provides the values of the three laboratories: mA ± σA The weighted mean of the three is: m3 = 1977 The hypothesis to be tested, designated by the letter H is as follows: (H) These three values come from one and the same normal distribution, with a mean of m3. If this hypothesis is true, the three values must appear to be well grouped around m3. So we calculate the value of χ2 which is a measure of the scatter in relation to m3:
A small value will confirm hypothesis (H). On the other hand, if χ2 is too large, we will be forced to reject the hypothesis. The law of probability regarding χ2 is well known. So we can calculate the probability of our being wrong if we reject the hypothesis: it is the significance level p. The figure generally used for the rejection value is 5.99, for which p = 5%. If we find χ2 greater or equal to 5.99, we know that there is at least a 95% chance that (H) is false. It is customary in such situations to reject the hypothesis of homogeneity until further information is available. In any case, no statistician worthy of the name would continue his analysis by looking for a confidence range for the age of sample 3 in accordance with the law of normal distribution. But the reader may be reassured: for sample 3, we find that χ2 = 2.5, well above the rejection value. Hypothesis (H) is accepted: the values obtained for sample 3 are homogeneous. Now that we know the distribution is normal, we may try to find the confidence range. The standard deviation σ3 works out as 15 if we use our table, or 20 if we use the table in the Nature report. In the first case, we can deduce a radiocarbon age between 1947 and 2007 with 95% confidence; in the second case, a radiocarbon age between 1937 and 2017 with 95% confidence. In either case, the result is impeccable from a statistical point of view. «Four stars», as Jacques Évin would say, and with no «fiddling of the figures»! Now here is the table summarising our calculations for all four samples:
The χ2 test using our calculations Compare this with table 2 in the Nature report:
The χ2 test extracted from Table 2 of the Nature report Samples 2, 3 and 4 easily pass the test, in both tables. It is reasonable to accept the homogeneity of their results. One could almost regard the results for each of the three samples as having been provided by one spectrometer, one chemical treatment process and one man... consequently, by one sample! The same cannot be said for sample 1, as it is clear from both tables that one would have to reject the hypothesis of homogeneity. This is very obvious from our results where the figure of 8.64 is a long way from 5.99; it is less obvious from the results in Nature, since the figure 6.4 is not so far from the rejection value. But all the same, it remains above it and should therefore be rejected. It is here that we come face to face with the importance of the standard deviation calculations carried out earlier. By bumping up these figures, the value of χ2 is reduced and (H) becomes more acceptable. The “twenty-one” co-authors of the Nature report find it very acceptable in any case, and they want to go even further still: even to the point of calculating a confidence range by appealing to Student’s distribution theory (cf. Eng CRC no 238, p. 21 and 24)!. We will not follow them on to this terrain. A Student test is not theoretically admissible unless hypothesis (H) is well founded. This is far from being the case with sample 1. Therefore, the statistical analysis in the Nature report appears to be vitiated right from the start... However, they do give us the results provided by the spectrometers. How should we interpret these?
Let us leave the hypothesis of normal distribution and the standard deviations in table 1 of the Nature report which were obtained in rather murky circumstances. There still remain the means found in this same table 1. Let us summarise them:
Dates proposed by the three laboratories for the four samples Let us try to verify whether this data is really homogeneous, avoiding any hypothesis based on the probabilist model (normal distribution or not). There exists a test for this purpose: the Wilcoxon-Mann-Whitney rank test, which is very simple to perform. Two series of measurements are involved, one of m results, ranging from x1 to xm, the other of n results. By convention m is less than or equal to n. It is a matter of verifying whether these two series form part of the same homogeneous set or whether, on the contrary, they measure two different quantities. The hypothesis (H) can be expressed as previously: (H) These two series of measurements form one single homogeneous series of m + n values. To test H, we first arrange these m + n values in ascending order; then we calculate the rank of each value xi in the new series and the sum total of these ranks: W = rank (xi) + ... + rank (xm+n). If (H) is true, ranks (xi) will be uniformly distributed betweeen 1 and m + n; their mean W/m will be close to the mean of m + n ranks, that is
Or
else W will be close to the value
Now, we know the theoretical distribution of W in hypothesis (H);
and for each value of m and n, the tables allow us to find
the two values Wmin and Wmax
between which W must lie with a probability of 95%. These two
values are symmetrical with respect to We calculate W. If W is between Wmin and Wmax, we can accept hypothesis (H). Otherwise we will reject it, for in that case ranks (xi) do not have a uniform distribution and it is therefore at least 95% probable that the two series of measurements are not homogeneous. The chief interest of this test clearly lies in the fact that it presupposes nothing about the distribution of the two series of numbers (in particular, the hypothesis of normal distribution). On the other hand, it does not tell us much if the gap [Wmin, Wmax] wide, as in this case it would be difficult to reject (H). In other words, it is an optimistic test which favours homogeneity and which refuses to be used in any other way... Let us just add that there exists another test of the same type for comparing three series of measurements. But when the numbers in each series are small, it is preferable to proceed by two sample tests.
The twelve measurements for sample 1 (cf. column 1 in the table on page 38) are arranged in ascending order: 591, 606, 635, 639, 679, 690, 701, 722, 730, 733, 745, 795, then replaced by A, Z or O according to their laboratory of origin. This gives the series: A, A, Z, Z, Z, A, A, Z, O, Z, O, O. Let us test the hypothesis: (H) The values of Oxford are homogeneous with the combined values of Arizona and Zurich.
Here m = 3 and n = 9. Using a significance level of 0.05,
the table gives Wmin = 10 and Oxford's results occupy ranks 9, 11, 12. Therefore W = 32, well above Wmax. We can therefore reject (H) with a confidence level of 95%. We can even go one better: if a significance level of 0.025 is used, we get: Wmin = 8, therefore Wmax = 31, still smaller than 32. In conclusion: With a probability factor of 97.5%, we must reject the hypothesis that Oxford and Arizona + Zurich measured samples possessing the same characteristics.
The thirteen measurements for sample 2 in the second column are arranged in ascending order, which gives the series: A, Z, A, Z, O, A, Z, O, O - Z, A, A, Z. where O - Z corresponds to the two 980 dates. It seems evident on
first view that the mix is perfect. If, for example, we test the
homogeneity of O against A + Z, we find that m = 3, n =
10,
The eleven measurements in the third column are arranged in ascending order: A, Z, Z, O, A, O, A, Z, O, A, A. The thirteen measurements in the fourth column give the series: A, Z, Z, Z, O, A, Z, Z, A, A, O, O, A. For each of these series, any of the tests A against O + Z, O against A + Z, or Z against A + O lead us to accept the hypothesis of homogeneity.
The preceding tests prove that the measurements of cloth samples 2, 3 and 4 are homogeneous. The three laboratories therefore worked as “a single man”, despite their differences of chemical treatment and test protocol.
It is no less proven that the measurements of cloth sample 1 are heterogeneous. With a probability factor of 97.5%, it is as though the three laboratories had worked on two different samples. Several explanations for this are possible: 1. The different chemical treatments affected the measurements, but only in the case of sample 1. Unacceptable. 2. The equipment was calibrated in different ways. But this only affected sample 1. Ditto! 3. Under the title “sample 1”, the three laboratories received and analysed different samples. In the light of Brother Bruno Bonnet-Eymard’s checks on the weights and the dimensions, we are led to have suspicions about the identity of the different samples or the integrity, horreco referens! of those who handled them. Page 34 Page 35 Page 36 Page 37 |
Back | Next | Online Editions | Home