Not random enough

In the story about Research 2000, a polling firm was caught out when statisticians noticed the data was not random enough. There have been other, more famous cases, where lack of enough randomness cast some doubt on statistical results.

Gregor Mendel

Gregor Mendel (wiki), from experiments on peas, discovered the basic laws of genetic inheritance. (His 1865 paper in English). In 1936, R. A. Fisher, the Babe Ruth of statisticians, burrowed through Mendel's data, assessing the goodness-of-fit of the reported data to the genetic theory. For example, in one set of "bifactorial" experiments, 529 plants were classified according to the genotype of their seeds' form (A = round or a = wrinkled) and color (B = yellow or b = green). (Thus each plant had two form letters and two color letters.) The results (as in Fisher's Table I):

Observed (Theory)	AA	Aa	aa
BB	38 (1)	60 (2)	28 (1)
Bb	65 (2)	138 (4)	68 (2)
bb	35 (1)	67 (2)	30 (1)

Theory says the observations should be in the ratio 1:2:1 for each row, and for each column, so that the interior of the table should be in the ratio shown in the table in parentheses. Do the data fit the theory? A chi-squared test has observed value 2.8110 on 8 degrees of freedom, which yields a p-value of 0.9457, very large, meaning the data fit very well. Fisher looked at several other chi-squared tests. Here's a summary (from Fisher's Table V):


Experiment	Chi-square statistic	Degrees of freedom	p-value
3:1 ratio: seeds	0.2779	2	0.8694
3:1 ratio: plants	1.8610	5	0.8680
2:1 ratio: seed	0.5983	2	0.7414
2:1 ratio: seed	4.5750	6	0.5994
Bifactorial	2.8110	8	0.9457
Gametic ratios	3.6730	15	0.9986
Trifactorial	15.3224	26	0.9511
Plant variation	12.4870	20	0.8983
Total	41.6056	84	0.99997

Note that in every experiment, the data fit very well. In fact, the chance that overall data would fit as well as it did is only 3/100,000. The data fit a little too well for comfort. Fisher's observation brewed up quite a storm, one that still rages. Did Mendel cheat? My opinion is no, but he probably didn't report all his results, e.g., the ones that he didn't think fit well. (Note that every experiment fit better than the median fit, which suggests consistency to Mendel's approach.) Nowadays that behavior would be considered a no-no, but back then the standards for statistical experimentation had not yet been established.

An interesting paper, A Statistical Model to Explain the Mendel-Fisher Controversy by Ana M. Pires and João A. Branco (Statistical Science, 2010, Vol. 25, pp. 545-565), presents a model that possibly explains Mendel's results. Briefly, the idea is that "the data to be presented can be modeled by assuming that an experiment is repeated whenever its p-value is smaller than α, where 0 ≤ α ≤ 1 is a parameter fixed by the experimenter, and then only the one with the largest p-value is reported." Mendel's reported data fits this model very well, suggesting something close to this process could have been followed by Mendel, whether formally or not.

Cyril Burt

Another British scientist, the educational psychologist Cyril Burt, has had his research questioned because the data are not random enough (among other things). See The Cyril Burt Affair. A key element of Burt's theory rested on data he collected on identical twins reared separately (so they'd have the same genetics, but different environments). The correlation coefficient between IQ's for the two individuals in 53 such pairs of twins was 0.771. Very high! The data for the 53 pairs were collected over time, and the correlation was updated as more data came in. The three stages yielded the following:

Year reported	# Twins so far	Correlation coefficient so far
1943	15	0.770
1955	21	0.771
1966	53	0.771

That is, the 15 are part of the 21, and the 21 are part of the 53. What's suspicious is that adding 6 pairs only moved the correlation coefficient by 0.001 point, and the subsequent additional 32 pairs didn't move the correlation at all! Did he just make up those extra twins, to reenforce his theory? The controversy still rages. But there still is a question of how unlikely it is to have such closely agreeing correlations in such a situation, without cheating. How about simulation? First, I simulated 53 bivariate normal observations from a population with correlation 0.771; calculated the correlation coefficient for the first 15 observations, first 21 observations, and complete set of 53 observations; then rounded the three values to three decimal places. I did this a million times. The first few sets of coefficients:

0.838 0.722 0.800
0.767 0.821 0.786
0.792 0.805 0.819
0.816 0.786 0.847
0.760 0.756 0.753

What is the chance the three are as close to each other as (0.770, 0.771, 0.771) are? Of the million simulated triples, 47 triples were exactly equal (e.g., 0.749, 0.749, 0.749), and 247 were just as close as the observed (e.g., 0.832, 0.833, 0.833 and 0.796, 0.797, 0.797). Thus the answer is estimated to be 294/1,000,000, i.e., about 3/10,000. The chance is very small, but not infinitesimal. The fit-too-well chances of 3/100,000 for Mendel and 3/10,000 for Burt were small enough to raise eyebrows, but not quite the slam-dunk that the Research 2000 data showed, chances like 10^-228.