On statistical methods to test if sampling in trials is genuinely random


“Man is an orderly animal. He finds it very hard to imitate the disorder of nature.” [1].

Humans are not good at identifying randomness: our minds naturally look for patterns, even when there are none. Furthermore, we are poor at creating random data. Famously, as a result of listener complaints, the first iPod ‘shuffle function’ had to be changed to make it less random, but appear more random to the human ear (see http://electronics.howstuffworks.com/ipod-shuffle2.htm).

Random sampling in research (e.g. by computer- rather than human-generated random numbers) importantly reduces the potential for bias. In this issue of Anaesthesia, Carlisle offers persuasive evidence that the sampling upon which the results of Fujii’s many published trials are based are so unlikely to arise from chance, that it is appropriate to disregard them from further scientific consideration [2]. The purpose of this commentary is to try to simplify Carlisle’s rigorous analysis so that readers might more easily follow his arguments.

Is a coin or set of dice fair?

“…it raises in a sharp and concrete way the question of what is meant by randomness, a question which, I believe, has not been fully worked out.” [1].

Carlisle was fundamentally interested in the ‘fairness’ of the sampling used in Fujii’s data. The mathematics evolved from the 16th century, from an interest in gambling where the fundamental question was: are the dice/coins/cards fair? Cardano formally investigated the statistics of gambling in his book Liber de Ludo Aleae, and the analysis was continued by Pascal and Fermat, in a famous correspondence that began when advising a mutual gambling friend [3].

Anaesthetists are quite used to statistical testing where, say, two groups are subjected to two different interventions (one of which may be control), and an outcome (e.g. blood pressure (BP)) is assessed using a t-test or nonparametric equivalent to generate a p value. Simply, this indicates the likelihood that the observed BP differences could have arisen from chance; i.e. p < 0.05 implies that the observed difference has a 5% probability or less of arising by chance (conventionally regarded as ‘significant’).

However, Carlisle was not interested in this sort of comparison. Rather than assess differences between Fujii’s test and control data, or between Fujii’s data and the results of other workers, Carlisle instead asked a more subtle question: if we confine our analysis solely to data within Fujii’s samples (and particularly the control samples), how likely is it that their reported distributions could have arisen by chance? (Separately within the paper he also asked this of other authors). For example, were the relative proportions of males and females, the incidence of nausea/vomiting, etc, those that would be expected? To answer this entirely different question, Carlisle did not perform a statistical comparison of one experimental dataset versus another but rather, a comparison of the experimental results (in absence of any intervention) with those that would be expected by chance.

But how can we predict what chance can produce? We may be tempted to think that any pattern is possible but in fact, chance produces remarkably predictable outcomes in the long run. Carlisle used methods that parallel those described long ago by the biologist JBS Haldane (son of the Oxford physiologist JS Haldane) in two letters to Nature, describing his analysis of suspicious data [1, 4, 5]. Haldane, like Carlisle, drew back from an accusation of fraud, but likened the chance to a monkey typing out Hamlet by sheer luck. The p values found by Carlisle and Haldane are similar.

There are broadly two types of data in question: categorical, grouped into distinct types (e.g. male/female or headache/no headache); and continuous, having any value within a scale (e.g. BP, in mmHg).

We can try to understand the expected-by-chance distributions of categorical data by using much simpler analogies of tossing coins or throwing dice. For both, the results can only have fixed values of heads/tails or the numbers on the dice, but no value in between. Unsurprisingly, the probability of obtaining a certain value when throwing a single six-sided die is ∼16% (Fig. 1) but this is only the average expectation. The variance (V), i.e. the degree of departure from expected (or SD, which is √V), becomes smaller as the number of throws increases (Fig. 1). This is described by a mathematical function known as the binomial probability distribution (which applies to any case of independent events where there are only two possible outcomes; here, throwing a six on a die vs not throwing a six). If the number of throws is n and the probability of the event is p, then the mean rate (μ) of the event (in this case throwing a six) happening is given by:

Figure 1.

 (a) Representation of the average expected probability (bars) of throwing the number (on x-axis) with a single die. The results of simulations throwing a single die 10 (inline image), 50 (inline image) and 250 (inline image) times are plotted (for clarity) as lines, with the dots (––) representing a slightly loaded die thrown 250 times. (b) Plot of the sum of squared errors for the number of dice throws, showing that with increasing throws, the result gets closer to the expected, becoming a trivial difference after ∼50 throws. The single dot (with arrow) shows the sum of squared error for the loaded die (which has a similar sum of squared error as the red line in panel A, but after 250 throws rather than just 10).

In this case, μ is 16/100 throws, 32/200 throws, etc. The SD of this (which can be proved mathematically for the binomial distribution) is given by:


Therefore for 60 throws, the mean (SD) number of sixes should be ∼10 (3), for 100 throws it is ∼16 (4), for 1000 throws, it will be ∼160 (12), and so on. Readers should see that, because we now have a variance (or SD), we can use this to assess statistically the departures of any actual data from what is expected (I will not detail the calculations here). Thus if a friend offers a die that results in 30 sixes in 100 throws, we can use statistical testing (using the principles of variation above) to assess its fairness (the actual chance of this is p < 0.005; Fig. 1). Incidentally, another approach that can be applied to all the numbers thrown is to use the chi-squared test; this yields the same result.

When throwing two dice, the plot of possible totals now resembles something readily recognisable as a normal (Gaussian) distribution for continuous data, with 7 being the most likely total as it arises from most combinations (Fig. 2). If our friend’s dice deviate from this overall pattern (Fig. 2), we know exactly how to calculate the probability of that result (this is the basis of the t-test or other tests using variances to compare datasets). Readers might compare the general forms in Fig. 2 with figures 210 and 13 in Carlisle’s paper [2].

Figure 2.

 (a) Simulation of the ideal sums of throwing two dice 100 times, resembling a normal (Gaussian) distribution. (b) The simulated sum of throwing a slightly loaded dice 100 times, where only the sums 4–8 appear.

Similar considerations apply with coin throwing. With one coin, the probability of a head (H) or tail (T) is 1:1. With two coins, the ratio of HH, HT, TT is 1:2:1. With three coins, the ratio of 3H, 2HT, 2TH, 3T is 1:3:3:1, and so on. These ratios can be arranged to the pattern commonly known as Pascal’s triangle (Fig. 3), which is also obtained by a mathematical function known as the binomial expansion (a term used in Carlisle’s paper [2]). This is mathematically related to the binomial probability distribution described above. It is possible to expand any power of x + y, denoted (x + y)n, into an expression with a general form (Box 1).

Figure 3.

 Pascal’s triangle. By convention the 1st row (containing only 1) is called row zero.

Box 1


and so on. The coefficients (the bold numbers) form the numbers in Pascal’s triangle and are useful as short-cuts in probability calculations. For example, the answer to: ‘what is the chance of getting exactly 2 heads with 3 coin tosses?’ is obtained by looking at the 3rd row of the triangle, 2nd position along. The sum of numbers (indicating the total possible results (Fig. 3 and Equation 4) is 8, so that chance is 3/8, or ∼37.5%. To summarise: binomial probabilities can be described mathematically, in a manner linked to Pascal’s triangle, which is in turn a useful shortcut to the calculation of those probabilities.

Superficially, there seems one limitation to applying these examples of coins and dice to real life: we know in advance the precise probability of their average outcomes. How can we know in advance how many headaches there should be in any group of people? The answer is that we don’t, but then we don’t need to. Instead, we can look for how symptoms like headaches (or other binomial factors like sex, etc) are distributed across randomly selected groups. If 100 women are randomly divided between two groups, we expect there to be 50 women in each group on average (but not precisely; SD = 5 by Equation 2). If the baseline incidence of headache is 10%, then in a group of 100 people there should be 10 headaches on average (but not exactly; SD = 3 by Equation 2). Therefore, the analyses do, in fact, resemble coin tossing. Reported distributions for such things can then be unusual in two ways. First, because they are more aberrant than expected (as in our friend’s single slightly-loaded die, Fig. 1) or second, because they are less variable than expected (as in the friend’s two slightly-loaded dice in Fig. 2). Either way, we can calculate (using the mathematics of binomial distribution) a p value for the difference between actual and expected distributions. In short: if we tried to fabricate a dataset, we would find it easy to approximate expected mean values, but very difficult to reproduce the expected variation in values, especially across a range of datasets and especially for binomial data.

Carlisle also uses the notion of ‘central limit theorem’ in his analysis of continuous data. Variously expressed, this has several important consequences for large datasets. First, the theorem states that when we take multiple samples from a population and measure a characteristic of interest, then a histogram of the sample means resembles ever closer a normal distribution with an increasing number of samples, even if the histogram of the actual population is not normally distributed. This is a surprising but very fundamental and robustly proven principle of statistics (Fig. 4). Another aspect of the theorem is that as the number of samples increases, not only does the mean of all the samples ever more closely approximate the population mean, but its variance (known technically as the standard error of the mean) becomes smaller in a precise way. All this is important because while any single random sample may differ greatly from another random sample, combining their means should follow the predictions of central limit theorem. What Carlisle found for Fujii’s data is that even when the ‘less unusual’ data from trials were sequentially combined, the results became more, rather than less, deviant from expected distributions (see the dotted black line in Fig. 14 of Carlisle’s paper [2]).

Figure 4.

 Demonstration of central limit theorem. The underlying distribution of this characteristic (arbitrary units, x-axis) resembles a sine wave (a), where there are a very large number of data points (> 200 000). Repeated random sampling of 100 values 100 times from this population of points (b) and 100 values 10 000 times (c), and plotting the means of these sample values, yields a pattern ever closer to a normal distribution. Readers can check other distributions at: http://elonen.iki.fi/articles/centrallimit/index.en.html#demo.

Does biological variation matter?

“In genetical work also, duplicates rarely agree unless they are faked.” [1].

One potential defence of an unusually-distributed dataset is that the vagaries of biology cause it to be so: patients can be odd or respond strangely. Yet even biology shows certain mathematically predictable patterns and statistical analysis can counter this ‘biological defence’ of unusual data in at least two ways.

The first rests upon an observation developed by GH Hardy. Many traits are strongly determined by genetic factors, and some are determined by ‘dominant’ alleles. It might be predicted that this would cause the population characteristic to gravitate towards the dominant trait over succeeding generations, resulting in ever narrower variations in human phenotype. But in contrast it is clear that the overall variation in many characteristics (e.g. height, weight) within a population remains constant (and often Gaussian) from one generation to the next. How can the constancy of Gaussian distribution be reconciled with a dominant effect of certain alleles? The answer is in part explained by the Hardy-Weinberg Law, which I have discussed before [6]. According to the Law, allele distributions are fixed for all generations (given conditions such as random mating and no breeder selection). For a characteristic governed only by two alleles, the relative proportions of homozygote recessives (pp), heterozygotes (pq) and homozygote dominants (qq) follow the distribution p2:2pq:q2. These are (as Hardy well knew) the same proportions that describe the outcome of tossing two coins, represented by a binomial distribution (Equation 3, above, whose coefficients also correspond to the 2nd line of Pascal’s triangle – 1:2:1). Thus for binary characteristics, distributions should follow the proportions predicted by the Hardy-Weinberg Law, and any other proportions reported by an author must be regarded as unusual. Furthermore, Fisher extended this argument to multi-allele traits [7, 8] to show that where a large number of alleles made a small contribution to a continuous trait (e.g. height), the trait (i.e. phenotype in the population) would be normally distributed but each of the allele pairs would nonetheless follow the Hardy-Weinberg equilibrium (Fig. 5). Therefore, it we wished to fabricate a dataset, we would face the difficult task of ensuring that the phenotype distribution in the population was Gaussian, but that the corresponding allele distributions in our invented data (if later discoverable from the information provided) conformed to the predictions of the Hardy-Weinberg Law (adapted for multiple alleles). This is easy for nature, but not so easy for us. Carlisle did not analyse Fujii’s data in this way, but the recent discovery of at least one allele associated with postoperative nausea and vomiting (and whose distribution in the population follows Hardy-Weinberg equilibrium) makes possible further analysis of Fujii’s voluminous data using these genetic principles [9].

Figure 5.

 Fisher’s argument to demonstrate how a normal distribution in phenotype can arise even when alleles follow the proportions predicted by Hardy-Weinberg equilibrium (for multiple alleles). Suppose three alleles determine height (average 68 cm): h0 (neutral), h+ (which adds 2 cm) and h (which subtracts 2 cm); h0 is twice as frequent than the others, which are equally frequent. (a): Punnett square for the population (explained in ref [6], where the characters in bold represent the gametes that combine) with the relative resulting proportions. (b): the histogram of the resulting heights in the population resembling a normal distribution. Adding more loci to the model results in an even smoother histogram.

A second reason why it is difficult to invent biology is that many biological traits are themselves inter-related. If we invent one trait then we commit ourselves automatically to inventing several others. Simple examples might be the relationships between height, weight and body mass index, or those between tidal volume, frequency and inspiratory/expiratory time. Other biological traits are exclusive. To adopt the example used by Haldane [1]: suppose three classes of animal have frequencies p1, p2, p3, and the total is 200. If we invent the ratios 50 p1 and 40 p2 to satisfy the conclusion we wish to reach, in p3 (perhaps of no immediate interest) has to be 110. Yet, a different value may be needed to satisfy other biological ratios and interactions.


“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” [11].

Those wishing to invent data have a hard task. They must ensure that all the data satisfy several layers of statistical cross-examination. Haldane referred to these as the ‘orders of faking’ [1]. In his words, ‘first-order faking’ is to ensure simply that the mean values match what is expected. For his ‘second-order faking’, things become more difficult since the variances of these means must also be within those expected, and further consistent with several possibly inter-related variables. His ‘third-order faking’ is extremely difficult because the results must also match several established laws of nature or mathematics, described by patterns like central limit theorem, the Hardy-Weinberg Law, the law of conservation of energy or mass, and so on. It is therefore always so much easier actually to do the experiment than to invent its results.

It is the very motivation to publish so much that is the undoing of those whose work is questioned or retracted. High publication rates are evident in the retracted work of Reuben and Boldt [12], and the sheer volume of data produced by Fujii is astonishing [2]. Toss a coin just twice and if it gives two heads then nobody notices the loading (the chance of this in a fair coin is anyway 25%). But a 100 heads in 100 tosses is probably more than chance (Fig. 1). These high publication rates leave a rich source of data for us to analyse, so that we can learn aberrant patterns and in time, detect much earlier the warning signs. Carlisle is to be congratulated: his is an astonishing, altruistic piece of work that helps expunge the literature of some (at best) highly unusual data.

The purpose of experimentation is to learn about nature. If the results of experiments are not genuine, then however prolific, influential or politically powerful their author, the results will not withstand statistical scrutiny, cannot be repeated, or will lead to models for our understanding of nature that are so bizarre as to be proven false. For nature cannot be fooled.

Competing interests

No external funding or competing interests declared.