V.  FOUR CLOTHS AND THREE LABORATORIES:
STATISTICAL DISCREPANCIES

     In our Appeal to the “twenty one” ten years ago, we exposed «a major flaw in the statistical analysis»5. The measurements provided by the three laboratories are completely heterogeneous for one of the cloths, sample number 1, the very sample that is supposed to come from the Holy Shroud, whereas the measurements for the three other cloths are very homogeneous. The Abbé de Nantes noticed this on the very first day: «I paused at figure 1 of this report in Nature (our figure 27, infra, p. 35). Two years later, knowing that my intuition was not mistaken and how all the proofs converged to unmask the hoaxers, I remain astounded at the audacity of Dr Tite in daring to exhibit this naked diagram of his crime before the inquisitive gaze of the scientific and believing worlds.6»

     Let us look at it again:

Fig 27: First diagram in the report published by the review Nature, 16 February 1989, summarising all the results obtained by the three laboratories (A, Arizona; O, Oxford; Z, Zurich) in radiocarbon age, that is to say, in the number of years before the present era (1950), the conventional age directly measured by carbon 14, before any calibration and conversion into calendar age.
     Each horizontal line represents the range of results for a laboratory, identified by its initial.
     Squadron number 1 is the sample substituted for the Holy Shroud: the strip of cloth measuring 1
x 7. Curiously, it is the only one to present a certain conflict among the three laboratories. A divergence which contrasts with the magnificent harmony of the three other results provided by samples 2, 3 and4; number 4 being the cope of  Saint Louis d'Anjou.

     Our lines, added in red ink, underline the (all too) identical age of the supposed shroud and the cope of Saint Louis d'Anjou, both being the age demanded by Tite!

     On comparing the ranges of the results provided by the three laboratories for each sample and displayed in groups that look like “fighter squadrons”, the Abbé de Nantes wrote: «The first manifestation of the truth is so glaring that I remember on the very first day almost mechanically marking it with a vertical double line: Oxford’s little aeroplane in squadron number 1, which is so curiously separate from the others, is found to be on the same vertical dropline as another of Oxford’s aeroplanes in squadron 4, spanning the identical range of dates. Whereas the two other small aeroplanes in group 4, those of Zurich and Arizona, are here, as they should be, in good flight formation with Oxford.1»

     The discrepancy between Arizona and Oxford is too marked not to arouse the most legitimate of suspicions: did these two laboratories really work on samples from from the same cloth? The long statistical exposition devoted to the interpretation of the results in Nature has no other purpose than to attempt to answer this question... positively2.

     Since then, we have continued to develop our critical examination of the most technical part of the report in Nature3, without receiving the slightest response from any of the “twenty-one” co-authors of this report, not even from Dr Tite or Mrs Leese, even though a question mark hangs over both of them and they stand accused of malpractice.

     Three experts, whose names we cannot reveal lest what happened to Timothy Linick happen also to them (supra, p. 34), have agreed to take up this investigation anew. I am but their reporter, and I am keen to express our gratitude for their remarkable work.

     The fact is that, under a showy technical apparatus which aims to rule out any suspicion of its defects, the report in Nature reveals an astounding lack of rigour. We do not intend to explain the statistical approach in its entirety, nor to make an exhaustive critique of it. Others have made this their business4, and their criticisms are generally reliable on a technical level. We simply wish to highlight certain key elements in this report, ones which are moreover easy to grasp, and which will allow us to draw the necessary conclusions within the limits set by science, that is to say with a certain measurable probability.


MEASUREMENTS AND MEASUREMENT ERRORS

     In experimental science, no result is “exact”. We are limited by the imperfections of our machines, the readings on our computer screens and our numeration systems. A final result is never more than a means of obtaining other results, each individually subject to “error”.

     Let us take for example the measurement of a weight, frequently carried out in experiments with the aid of a weighing scale. A set of measurements are obtained which are distributed around a central value, the “mean”, and this distribution is itself regulated by a “law of probability” called normal distribution. This is characterised by two parameters: the mean m and the standard deviation, commonly designated by the Greek letter σ. Generally speaking, the standard deviation represents the scatter of a distribution around its mean. A small standard deviation indicates that the results are closely grouped around m; a large standard deviation signifies an almost uniform distribution or one that consists of abnormal values widely separated from m. The standard deviation is often understood as the probable error of a measurement.  Results are therefore presented in the form m ± σ. Where “normal distribution” holds true, we know that 68% of measurements must be between m - σ and m + σ, and 95% between m - 2σ and m + 2σ.

     Of course, different distributions will provide different intervals. So which do we choose? In the present case, that of the measurements of radiocarbon age, it appears that the measurements are not grouped around their mean in a way that meets the requirements of normal distribution. That is why Jacques Évin aptly summarised the situation when he recognised that it was necessary to «fiddle the figures» in order to arrive at the mean (supra, p. 30). Spectrometers are recent machines and there is a lack of experience with them, and there is also perhaps a lack of statistical knowledge on the part of the experimenters. As Dr Tite told us, in a moment of sincerity: «I am not a statistician.1» When a physicist knows the law of probability, he is tempted to apply it without asking himself too many questions... This is one of the principal and most reasonable criticisms made by commentators against the report in Nature.

     This remark is of importance, for the statistical procedures of Mrs Leese are entirely based on the preliminary hypothesis of normal distribution; that is why she blames the small number of results with which she was provided and with which she had to be content to arrive at her means2. The famous “margin” of one hundred and thirty years – 1260-1390! – (supra, fig. 19), with a confidence level of at least 95%, springs directly from this hypothesis.

     The first table (below) in the Nature report presents results which are precisely of the type m ± σ. Each one of the results listed summarises multiple experiments. For example, the first result, indicated by the abbreviation A1.1b, is 591 ± 30. It is obtained by combining two results: 606 ± 41 and 574 ± 45, each of these summarising around twenty measurements3. The final mean, 591, which represents the radiocarbon age of this piece of cloth no 1 analysed by the laboratory of Arizona, is therefore a mean of the preceding measurements, but a “weighted” mean, which uses a procedure we shall come back to. As a result, the real age is probably between the values 591 - 30 and 591 + 30, and still more probably between 591 - 60 and 591 + 60.

     The standard deviation of 30 is estimated empirically, based either on previous experiments or on the dispersion of all the measurements obtained. The laboratory of Arizona had the honesty to provide the details of certain of its calculations, whereas the statistical procedures of Oxford, and especially those of Zurich, remain in a thick fog. Arizona seems more optimistic than the others, in the sense that its standard deviations are generally smaller.

 

Table 1   Basic data (individual measurements)

 

Sample 1

Sample 2

Sample 3

Sample 4

 Pretreatment and
 replication codes

 Arizona AA-3367 AA-3368 AA-3369 AA-3370  
     A1.1b* 591 ± 30 A2.1b 922 ± 48 A3.1b 1,838 ± 47 A4.1b

724 ± 42

 
  A1.2b 690 ± 35 A2.2a 986 ± 56      A3.2a (1) 2,041 ± 43 A4.2a 778 ± 88  a, method a
  A1.3a 606 ± 41      A2.3a (1) 829 ± 50 A3.3a 1,960 ± 55      A4.3a (1) 764 ± 45  b, method b
  A1.4a 701 ± 33      A2.4a (2) 996 ± 38      A3.4a (2) 1,983 ± 37      A4.4a (2) 602 ± 38  ( ), same subsample
      A2.5b 894 ± 37 A3.5b 2,137 ± 46 A4.5b 825 ± 44  
 δ13C (‰)  

-25.0

 

-23.0

 

-23.6

 

-25.0

 
 Oxford 2575 2574 2576 2589  
  O1.1u 795 ± 65 O2.1u 980 ± 55 O3.1u 1,955 ± 70 O4.2u 785 ± 50  u, unbleached
  O1.2b 730 ± 45 O2.1b 915 ± 55 O3.1b 1,975 ± 55      O4.2b (1) 710 ± 40  b, bleached
  O1.1b 745 ± 55   O2.2b† 925 ± 45 O3.2b 1,990 ± 50      O4.2b (2) 790 ± 45  ( ), same pretreatment/
 run combination
 δ13C‡ (‰)  

-27.0

 

-27.0

 

-27.0

 

-27.0

 
 Zurich ETH-3883 ETH-3884 ETH-3885§ ETH-3882  
  Z1.1u 733 ± 61 Z2.1u 890 ± 59 Z3.1u 1,984 ± 50 Z4.1u 739 ± 63  u, ultrasonic only
  Z1.1w 722 ± 56 Z2.1w 1,036 ± 63 Z3.2w 1,886 ± 48 Z4.1w 676 ± 60  w, weak
  Z1.1s 635 ± 57 Z2.1s 923 ± 47 Z3.2s 1,954 ± 50 Z4.1s 760 ± 66  s, strong
  Z1.2w 639 ± 45 Z2.2w 980 ± 50     Z4.2w 646 ± 49  
  Z1.2s 679 ± 51 Z2.2s 904 ± 46     Z4.2s 660 ± 46  
 δ13C|| (‰)  

– 25.1

 

– 23.6

 

–22.0

 

–25.5

 

In years BP, corrected for δ13C fractionation; errors at 1σ level; see text for pretreatment details.
* The identification code for each measurement shows, in order, the laboratory, sample, measurement run, pretreatment and any replication involved.
† One anomalous replicate (of 6) obtained for independent measurement O2.2b; if rejected, it reduces date by 40 yr; final date quoted actually reduced by 20 yr.
‡ Measured for samples 1 and 3; assumed for samples 2 and 4.
§ The loose weave of sample Z3.1 led to its disintegration during strong and weak chemical treatments. Z3.2 was centrifuged to avoid the same loss of material.
|| Average of separate determinations by AMS.

     As table 1 contains such an excess of information, it is for the statistician to extract its essential characteristics. Four pieces of cloth passed to three laboratories leads to a table of twelve m ± σ values, calculated in accordance with the “weighting” formulas proposed by Ward and Wilson, two authors who are referred to in the report of the “twenty-one”. The idea is to increase the weight of those values which are associated with smaller standard deviations, which is supposed to guarantee a higher degree of precision.

     For example, the three values obtained for sample 3 analysed by Zurich are 1,984 ± 50, 1,886 ± 48 and 1,954 ± 50. If we use the letter i to refer to measurements 1, 2 and 3, then the weight of each measurement i is as follows:

     It can be seen that the sum of the weights is equal to 1. Where mi ± σi is the result of normal independent variables, the weighted mean of the three results is:

m  =  m1a1 + m2a2 + m3a3  =  1,940

     And the weighted standard deviation is:

     We can proceed in the same way, using the same hypothesis of normal distribution, for the eleven other samples. We then obtain a table of the twelve following results:

  1 2 3 4
Arizona 646 ± 17 927 ± 20 1995 ± 20 723 ± 20
Oxford 749 ± 31 938 ± 29 1977 ± 33 756 ± 26
Zurich 676 ± 24 941 ± 23 1940 ± 28 685 ± 25

     We now compare these results with the twelve results below taken from table 2 of the Nature report. We observe that the means correspond closely, with differences of no more than two or three units. But when we come to the standard deviations, a surprise awaits us: although our figures are identical in certain cases, they are clearly much smaller for each of Arizona’s results.

  1 2 3 4
Arizona 646 ± 31 927 ± 32 1995 ± 46 722 ± 43
Oxford 750 ± 30 940 ± 30 1980 ± 35 755 ± 30
Zurich 676 ± 24 941 ± 23 1940 ± 30 685 ± 34

Extract from Table 2 in the Nature report 

     Conclusion: “One” have made use of the formula of Ward and Wilson except where the standard deviations were too small; in these cases, they have been replaced by other values. What values? This is a mystery. Now we are going to see the importance that this fact takes on in the χ2 test: if we keep the values obtained by the formulas of Ward and Wilson, this test does not work at all...


THE χ2 TEST

     This test constitutes a necessary check on the homogeneity of the results, sample by sample. For example, suppose the chemical treatment were to change the nature of certain samples and that, for this reason, the means provided by each laboratory differed excessively from each other. In this case, the final mean would scarcely be justified and, most importantly, one would not be able to apply a confidence range to it. This test is therefore a good precaution. Still, it needs to be applied in accordance with the rules.


Sample 3.

     Let us apply the χ2 test to sample 3, supposed to have come from the collection of the Department of Egyptian Antiquities at the British Museum, «associated with an early second century AD mummy of Cleopatra from Thebes»1. Table 2 provides the values of the three laboratories:

mA ± σA
m
O ± σO
m
Z ± σZ

     The weighted mean of the three is:

m3 = 1977

     The hypothesis to be tested, designated by the letter H is as follows:

     (H) These three values come from one and the same normal distribution, with a mean of m3.

     If this hypothesis is true, the three values must appear to be well grouped around m3. So we calculate the value of χ2 which is a measure of the scatter in relation to m3:

     A small value will confirm hypothesis (H). On the other hand, if χ2 is too large, we will be forced to reject the hypothesis. The law of probability regarding χ2 is well known. So we can calculate the probability of our being wrong if we reject the hypothesis: it is the significance level p.

     The figure generally used for the rejection value is 5.99, for which p = 5%. If we find χ2 greater or equal to 5.99, we know that there is at least a 95% chance that (H) is false. It is customary in such situations to reject the hypothesis of homogeneity until further information is available. In any case, no statistician worthy of the name would continue his analysis by looking for a confidence range for the age of sample 3 in accordance with the law of normal distribution.

     But the reader may be reassured: for sample 3, we find that χ2 = 2.5, well above the rejection value. Hypothesis (H) is accepted: the values obtained for sample 3 are homogeneous. Now that we know the distribution is normal, we may try to find the confidence range. The standard deviation σ3 works out as 15 if we use our table, or 20 if we use the table in the Nature report. In the first case, we can deduce a radiocarbon age between 1947 and 2007 with 95% confidence; in the second case, a radiocarbon age between 1937 and 2017 with 95% confidence. In either case, the result is impeccable from a statistical point of view. «Four stars», as Jacques Évin would say, and with no «fiddling of the figures»!

     Now here is the table summarising our calculations for all four samples:

  1 2 3 4
mi 672 934 1977 720
χ2 8.64 0.24 2.5 4

The χ2 test using our calculations

     Compare this with table 2 in the Nature report:

  1 2 3 4
mi 689 937 1964 724
χ2 6.4 0.1 1.3 2.4

The χ2 test extracted from Table 2 of the Nature report

     Samples 2, 3 and 4 easily pass the test, in both tables. It is reasonable to accept the homogeneity of their results. One could almost regard the results for each of the three samples as having been provided by one spectrometer, one chemical treatment process and one man... consequently, by one sample!

     The same cannot be said for sample 1, as it is clear from both tables that one would have to reject the hypothesis of homogeneity. This is very obvious from our results where the figure of 8.64 is a long way from 5.99; it is less obvious from the results in Nature, since the figure 6.4 is not so far from the rejection value. But all the same, it remains above it and should therefore be rejected.

     It is here that we come face to face with the importance of the standard deviation calculations carried out earlier. By bumping up these figures, the value of χ2 is reduced and (H) becomes more acceptable. The “twenty-one” co-authors of the Nature report find it very acceptable in any case, and they want to go even further still: even to the point of calculating a confidence range by appealing to Student’s distribution theory (cf. Eng CRC no 238, p. 21 and 24)!. We will not follow them on to this terrain. A Student test is not theoretically admissible unless hypothesis (H) is well founded. This is far from being the case with sample 1.

     Therefore, the statistical analysis in the Nature report appears to be vitiated right from the start... However, they do give us the results provided by the spectrometers. How should we interpret these?


ANOTHER TEST

     Let us leave the hypothesis of normal distribution and the standard deviations in table 1 of the Nature report which were obtained in rather murky circumstances. There still remain the means found in this same table 1. Let us summarise them:

  1 2 3 4
Arizona 591
690
606
701
922
986
829
996
894
1,838
2,041
1,960
1,983
2,137
724
778
764
602
825
Oxford 795
730
745
980
915
925
1,955
1,975
1,990
785
710
790
Zurich 733
722
635
639
679
890
1,036
923
980
904
1,984
1,886
1,954
739
676
760
646
660

Dates proposed by the three laboratories for the four samples

     Let us try to verify whether this data is really homogeneous, avoiding any hypothesis based on the probabilist model (normal distribution or not). There exists a test for this purpose: the Wilcoxon-Mann-Whitney rank test, which is very simple to perform.

     Two series of measurements are involved, one of m results, ranging from x1 to xm, the other of n results. By convention m is less than or equal to n. It is a matter of verifying whether these two series form part of the same homogeneous set or whether, on the contrary, they measure two different quantities.

     The hypothesis (H) can be expressed as previously:

     (H) These two series of measurements form one single homogeneous series of m + n values.

     To test H, we first arrange these m + n values in ascending order; then we calculate the rank of each value xi in the new series and the sum total of these ranks:

W = rank (xi) + ... + rank (xm+n).

     If (H) is true, ranks (xi) will be uniformly distributed betweeen 1 and m + n; their mean W/m will be close to the mean of m + n ranks, that is

Or else W will be close to the value = m (m + n + 1) / 2.

Fig. 28: Location where the sample was taken on 21 April 1988, exposing the holland cloth on which the Holy Shroud was «sewn with tacking stitches» in 1534 by the Poor Clares of Chambéry (supra, p. 21).

     Now, we know the theoretical distribution of W in hypothesis (H); and for each value of m and n, the tables allow us to find the two values Wmin and Wmax between which W must lie with a probability of 95%. These two values are symmetrical with respect to . Therefore the test proceeds as follows:

     We calculate W. If W is between Wmin and Wmax, we can accept hypothesis (H). Otherwise we will reject it, for in that case ranks (xi) do not have a uniform distribution and it is therefore at least 95% probable that the two series of measurements are not homogeneous.

     The chief interest of this test clearly lies in the fact that it presupposes nothing about the distribution of the two series of numbers (in particular, the hypothesis of normal distribution). On the other hand, it does not tell us much if the gap [Wmin, Wmax] wide, as in this case it would be difficult to reject (H). In other words, it is an optimistic test which favours homogeneity and which refuses to be used in any other way... Let us just add that there exists another test of the same type for comparing three series of measurements. But when the numbers in each series are small, it is preferable to proceed by two sample tests.


Application to sample 1.

     The twelve measurements for sample 1 (cf. column 1 in the table on page 38) are arranged in ascending order: 591, 606, 635, 639, 679, 690, 701, 722, 730, 733, 745, 795, then replaced by A, Z or O according to their laboratory of origin. This gives the series: 

A, A, Z, Z, Z, A, A, Z, O, Z, O, O.

     Let us test the hypothesis: (H) The values of Oxford are homogeneous with the combined values of Arizona and Zurich.

     Here m = 3 and n = 9. Using a significance level of 0.05, the table gives Wmin = 10 and = 19.5, therefore Wmax = 29.

     Oxford's results occupy ranks 9, 11, 12. Therefore W = 32, well above Wmax. We can therefore reject (H) with a confidence level of 95%. We can even go one better: if a significance level of 0.025 is used, we get: Wmin = 8, therefore Wmax = 31, still smaller than 32.

     In conclusion: With a probability factor of  97.5%, we must reject the hypothesis that Oxford and Arizona + Zurich measured samples possessing the same characteristics.


Application to sample 2.

     The thirteen measurements for sample 2 in the second column are arranged in ascending order, which gives the series:

A, Z, A, Z, O, A, Z, O, O - Z, A, A, Z.

where O - Z corresponds to the two 980 dates. It seems evident on first view that the mix is perfect. If, for example, we test the homogeneity of O against A + Z, we find that m = 3, n = 10, = 21 and, with a 0.05 significance level, Wmin = 10 and Wmax = 32. Now the ranks of O are 5, 8, and 9 or 10, which makes W = 22 or 23, very close to . The hypothesis of homogeneity is safe.


Application to samples 3 and 4.

     The eleven measurements in the third column are arranged in ascending order: A, Z, Z, O, A, O, A, Z, O, A, A.

     The thirteen measurements in the fourth column give the series: A, Z, Z, Z, O, A, Z, Z, A, A, O, O, A.

     For each of these series, any of the tests A against O + Z, O against A + Z, or Z against A + O lead us to accept the hypothesis of homogeneity.


CONCLUSION

     The preceding tests prove that the measurements of cloth samples 2, 3 and 4 are homogeneous. The three laboratories therefore worked as “a single man”, despite their differences of chemical treatment and test protocol.

Fig. 29: (from left to right), Hedges, Donahue, Hall, Damon and Wölfli in the canons’ stalls at Turin on 21 April 1988. They seem like condemned men, condemned not by the Inquisition, but by... statistics!

     It is no less proven that the measurements of cloth sample 1 are heterogeneous. With a probability factor of 97.5%, it is as though the three laboratories had worked on two different samples. Several explanations for this are possible:

     1. The different chemical treatments affected the measurements, but only in the case of sample 1. Unacceptable.

     2. The equipment was calibrated in different ways. But this only affected sample 1. Ditto!

     3. Under the title “sample 1”, the three laboratories received and analysed different samples. In the light of Brother Bruno Bonnet-Eymard’s checks on the weights and the dimensions, we are led to have suspicions about the identity of the different samples or the integrity, horreco referens! of those who handled them.


Page 34
(5) SS II, p. 156.  –  (6) Eng CRC no 238, p. 37.

Page 35
(1) Eng CRC no 238, p. 37.  –  (2) Cf. Eng CRC no 295, p. 24.  –  (3) Cf. Statistical analysis and its strange liberties, Eng CRC no 238, p. 13-28.  –  (4) R Van Haelst and R P Jouvenroux at the Rome symposium (1993).

Page 36
(1) Inquiry into the laboratories, in Eng CRC no 238, p. 19.  –  (2) Ibid.  –  (3) Unpublished results from Arizona, ibid., p. 22-23.

Page 37
(1) Eng CRC no 238, p. 16. The full text of the report in Nature can be found on pages 14, 16, 18 and 20 of this same edition.


Back  Next  |  Online Editions  |  Home