In the 1920s in Cambridge, England, Muriel Bristol, a biologist at the new Rothamsted Experimental Station, claimed that she could tell by the taste of a cup of tea whether the milk or the tea had been poured into the cup first (Fig. 1). Pompous dons scorned the idea: “It just can't be done, don't y'know!” But why not? Could it be true? And if so, how could we tell?
Dr. Bristol's future husband, William Roach, suggested that she be given a chance to prove her claim, an event that not only makes entertaining reading, but is important in the history of science.1, 2 That is because another Rothamsted scientist, Ronald A. Fisher (1890-1962), who was a founder of modern statistics, suggested a way to make sense or nonsense of Dr. Bristol's claim.
If Dr. Bristol was just making an attention-seeking claim, she should have been able to taste a cup of tea and have a 50-50 probability of getting or guessing, just by chance, the right pouring order. By itself, however, a single sip between cup and lip would be quite useless in evaluating her claim. On the other hand, Fisher thought he could formalize a test that might be more convincing. What would that require, and what would the result mean? What do we even mean by a ‘test’? And by the way, where does that obvious-sounding 50-50 probability come from?
In this case, the test would have two possible outcomes: tea first, then milk, or milk first, then tea. If you were taking the test and you really were just guessing, your guess should bear no resemblance to the truth. But you might get lucky. The key to Fisher's strategy lies in the word ‘probability,’ or the idea of getting it right ‘just by chance.’ The classical way to determine the difference between chance and causal truth is to undertake a series of repeated trials of the same condition, in which probability refers to the frequency of hits, or correct responses.
In the event, Fisher confronted Dr. Bristol with eight cups of tea (Fig. 2), four of which were milk-first, and the other four tea-first. She guessed right all eight times. Or was it guessing? (The Brits do know their tea, after all.) Fisher worked out the number of ways in which a person who was just guessing could get one, two, three … up to all eight guesses correct. He did this by assuming a fixed probability ½ that any given guess would be correct ‘just by chance.’
For example, there are eight waysto get one guessright and sevenwrong: RWWWWWWW, WRWWWWWW, and so on to WWWWWWWR. If we assume that the sip from each cup was an independent tasting, the probability of each guess being right just by chance is 1/2½; the probability of each specific set of the preceding string of answers being right is (1/2)8. There is a similarly specifiable number of ways that one could make seven of the eight guesses right, again “just by chance.” If you total up all of these sets of possibilities, from none right to eight right, they have to add up to 1.0; that is, all possible outcomes. Then you can ask, if Dr. Bristol was really just guessing, what is the chance that she would be as successful as she in fact was? Does her palate match her performance?
In 1925, Fisher worked out all of these probabilities in a famous book, Statistical Methods for Research Workers (http://psychclassics.yorku. ca /Fisher/Methods/). Later, in a retrospective chapter, he described the tea-tasting experiment.3 I've simplified it a bit because, for example, one might also take into consideration the order in which the cups were presented or the fraction of cups that were poured milk-first.
It seems natural or even obvious that the set-up would be four cups of tea-first and then four of half-milk but that, perhaps, is just a cultural bias. The cups could be, say, six with milk first and two with tea first. But 50-50 optimizes the power of the test to discriminate luck from Lapsong. In fact, Fisher told Dr. Bristol that there were four cups of each kind, though not the order in which they were presented to her. Could she have detected the knowing but inadvertent expressions on Fisher's countenance, the way that soothsayers do? Could the pourer have used a not exactly halfway line in the cup for the first item put in, leaving a color difference between the two pouring orders? Were they all stirred thoroughly? Could knowing that half the cups were milk-first affect her guessing and the probabilities used to evaluate the results? Presumably Dr. Bristol wasn't told how she was doing, because then her guesses really would not have been independent. She would have known, for example, that when she came to the last cup it had to be whichever state she had identified three times up to that point (remembering what's already been played is one way people try to win at cards.)
REPEATABLE EXPERIMENTS OR OBSERVATIONS AND STATISTICAL SIGNIFICANCE
In the end, Fisher showed that the chance that Dr. Bristol would get all her assessments right is only 1 in 70, because there are 70 ways of correctly identifying four out of eight. That's about 1.4%, which can be written p = 0.014. Not impossible, but very lucky indeed! Or was it luck?
There are many criteria for making decisions in science, but one of the most central is the significance test. Significance testing has become so entrenched in our thinking that it is, in many ways, at the very heart of the scientific method. We do such tests because probabilistic aspects of research involve not just the causal phenomena (rolling dice, mutations, or transmission of genetic variants from parent to offspring), but also because we make observational errors. Even then, if we want to be able to interpret our results meaningfully relative to the real world, we have to assume that we are collecting random results; that is, they are representative of the actual ‘trials’ taking place in the real world.
Significance testing usually also makes the assumption that our work, at least in principle, could be repeated many times over. Given that, we choose some cutoff in the results by which we judge that the result is suspiciously different from nothing-going-on. We express this cutoff criterion in terms of a p-value, the chance that the result we have observed, or something even more extreme, would have occurred even if, in fact, nothing is going on. The classic decision-making cutoff value, and by far the most predominant one today, is p = 0.05.
Fisher was the first to state the 5% solution formally, but similar informal criteria had been used earlier in the twentieth century.4 The idea is that we're willing to be wrong once in 20 times (5% of the time) in deciding that our result is unusual enough to support our hypothesis. Some authors use this as a strict cutoff; others simply report the p-value and let readers make their own judgments; still others use more stringent criteria, such as p = 0.01 or even 0.001. But in every case, these are judgments.
Sherlock Holmes made the “7% solution” (of cocaine) a household phrase, but p = 0.05 is so deeply engrained in scientific work that it has attained cosmic, almost religious, status. Why not use p = 0.06 or 0.0734567588? To find the answer to that question, I have thoroughly searched the ultimate authorities: the Bible, the Koran, the Book of Mormon, Science and Health,Analects of Confucius, Vedas, and even Quotations from Chairman Mao Tse-Tung. But all my searching has been in vain: Nowhere have I found it ordained that “p shalt be 0.05.” This so cries out by its absence that I even checked Darwin, but he, too, was silent on the subject.
Since the 5% criterion had an informal history before the Cambridge tea party, two Canadian researchers have investigated whether it might have a psychological and hence, potentially biological or even evolutionary basis.5 They found that pre-Fisherian discussions revealed that people (at least Western educated people) feel that events that happen only between 10% and 1% of the time are suspiciously unlikely. Five percent is midway between those two intuitive levels of doubt. But if people tend to believe that 5% is a rare event, does this also make them think that something other than chance is responsible?
Cowles and Davis felt that students in their statistics classes had such views, so they tested their idea by giving 36 students a version of a shell game in which, although the students did not know it, one could never get a successful hit; that is, there was no button under any of the shells. Cowles and Davis then determined when the students decided that the game was rigged. The data showed that, on average, the students became suspicious when the run of incorrect guesses was so long that its probability was only about 0.098. They quit the game (in disgust?) when that value reached 0.0093. The students played individually and without working out the statistics, but again, their intuitive range of suspicious rarity was between 1% and 10%.
Whether such a result would have cross-cultural validity is unclear, but suppose it did. Would this indicate that we have somehow evolved to live with a vaguely 5% level of risk? If the chance of a lion or armed enemy hiding in ambush round the next tree was less than that, did our brutish but wiser ancestors press confidently ahead in pursuit of the wildebeest while the too-daring were eaten? Or is this entirely too vague for evolutionary hypotheses? Or, more likely, is it entirely untestable in such terms?
NOT EVERYTHING THAT IS SIGNIFICANT HAS TO BE IMPORTANT…
When I teach human genetics, I orient my course around questions of epistemology: How do we know what we think we know about genetics? I point out that most of the things we study require understanding probabilistic data, at least conceptually. That is because probability is fundamental to how genes affect traits, including disease, in current populations, and because evolution works probabilistically through mutation, genetic drift, and natural selection. Many students are not happy. They protest that they are premeds and are paying tuition for answers, not questions!
Unfortunately, the role of probability in nature is subtle but inescapable.6 If we are going to use our significance-based decision, then it could matter greatly whether we have designed our study well enough that we can accurately interpret the probabilistic result. Yet the decision is inevitably subjective. A critical but vastly misunderstood fact of life is that even if a significance test yields a probabilistic judgment about rarity, it says nothing at all about importance. That is another judgment call. So we should not let our students escape from a realistic sense of the unavoidable probabilism in nature. From doctors' prognoses to evaluation of news stories about life-style risks, and on to evolutionary scenarios about what was adaptive and why and what genes do what, statistical issues are inescapable. Yet all p-values can do is help us frame our subjective judgments.
AND NOT EVERYTHING THAT IS IMPORTANT HAS TO BE SIGNIFICANT
The era in the seventeenth century known as the Enlightenment marks the beginning of modern science. Since then, we have systematically developed ever more rigorous, or perhaps rigid criteria for methodologically deciding what we accept as truth. We call this the scientific method; it involves testing hypotheses, which inevitably involves subjective decision making. In practice, being only human, we usually do not stick too close to the book: When an experiment designed to prove our brilliant idea generates a p-value of 0.07 instead of 0.05, we call it “suggestive” and plow ahead rather than dropping the idea and taking our students out for a cup of tea. But there is something much more serious afoot than our hedging on our favorite beliefs. We have become convinced that if an idea is true, then data should be able to show it in a properly designed study. If we can't obtain adequate significance levels in a given study, we look for faults in its design. Commonly, we take the most fundamental, classical route and increase our sample size, because p-values inherently depend on sample size.
It may sometimes be, however, that even a true idea simply cannot be tested in this way. In genetics, everyone is different. The two copies of our genome that we carry differ by millions of nucleotides. Moreover, most of our 3 billion nucleotides probably vary among populations. Ignoring environmental effects, it often seems clear that hundreds or even thousands of genetic variants contribute to interesting physical, metabolic, or behavioral traits, in disease as well as the normal state. The same applies to the number of variants that respond to natural selection.7 Most of these variants will have low to very low frequency in the population, and small to very small effects on a trait or its evolutionary fitness.
Under these conditions, it may be literally impossible, even in principle, to obtain adequate samples for the bulk of these effects to generate even remotely convincing p-values; that is, the causal effects may be true but not demonstrably “significant” by the scientific method. In addition, many effects that do reach some nominal significance in one sample will not do so in another sample. A routine reason for this is that we expect chance variation from sample to sample. For example, if we are counting heads and tails in flips of identical coins, we expect that if the coins are fair 5% of experiments will have such an unequal head-to-tail ratio as to seem to be unfair. But one fundamental tenet of genetic andevolutionary life is that we can never actually repeat an experiment because every individual is functionally different. Accordingly, natural genetic and evolutionary effects may be true, but not replicable. Yet we tend to rely on significance tests, which amounts to saying that something in nature is real only if it is ‘significant;’ that is, if it passes some subjective judgment on our part.
If things are true but we can't show it, it's no joke, because it means that the fundamental criteria that we have come to know as the powerful core of Enlightenment, empirical, systematic science may simply be inadequate for understanding nature beyond a certain level of resolution. Some phenomena, such as the probabilistic position of electrons in an atom in quantum mechanics, have orderly distributions, but biological causation is the result of heterogeneous, historical, ad hoc evolutionary events in which every genetic element in DNA is different. These genetic elements do not have those kinds of neatly specifiable probabilities. We can specify what size of effect is undetectable in principle but, by the same token, we cannot identify the causal elements that we're dealing with.
HOW DO YOU TAKE YOUR TEA, MADAM?
Statisticians have raised many criticisms of the tea-tasting experiment, as well as others like it. First, the test was only done once, making it vulnerable to a small sample-size and other issues. More profoundly, debates rage in statistics over the best way to evaluate hypotheses. in genetics and evolution, the most relevant issues include the question of whether data can ever be truly replicated. If they cannot, then the idea of probabilities in repeat experiments loses cogency. That is an important loss.
In the founding experiment, Dr. Bristol may have been more than just lucky.8 In 1982, the tea-tasting test was repeated, but using a panel of 155 people who did their testing in the dark to avoid extraneous cues and clues. Their success rate was ‘well above the chance level,’ suggesting that there were, in fact, some sensory cues. Similar results were obtained in a different test of 131 people, not even tea aficionados, who did their testing in the light. It's a strange world!
Now suppose that Dr. Bristol had correctly guessed, say, 7 rather than 8. The result would still have been at least somewhat unusual, p = 0.24, and we'd perhaps say she could have had a fine career working for Tetley's. But then why didn't she get all of the tests right? It's easy enough to say that everybody makes mistakes, but what does that mean about the causal phenomenon being tested? It's not a very satisfying scientific conclusion that “some people can sort of tell.” One can concoct countless possible explanations, but they are largely untestable. If you think about your work, it should be easy for you to see how, on a daily basis, you are affected by these kinds of unknowns and uncertainties. You can also the seriousness of the challenge to grasp Nature's mysteries.
Whether we like it or not, our world is probabilistic. Adding to the difficulties are that the assumptions on which testing methods are based usually cannot be met exactly, the theory we have is not as precise as it would ideally be, and the decision-making criteria inevitably remain subjective. This is true enough when we try to account for the genetic basis of traits. But it is even more true in efforts to reconstruct evolution, which involves fundamentally unusual past events that we cannot observe directly.
Clearly, a lot of extremely unusual things in nature really are true: You are true! The many issues in statistical inference go well beyond the idea of significance that we've been discussing here and that have been argued about voluminously (for cogent discussions, see Cohen9, 10). These issues and challenges are not at all new, but they are mind-bending and quite tiring. But we've been thinking hard, and it's time for a break. One lump or two?
I welcome comments on this column: firstname.lastname@example.org. I co-author a blog on relevant topics at
EcoDevoEvo.blogspot.com. I thank John, an otherwise anonymous regular visitor to our blog, for once having pointed out the papers by Cowles and Davis in a different context, and Bill Jungers for pointing out the Cohen references. I also thank Anne Buchanan, John Fleagle, and Jim Rohlf for critically commenting on the manuscript. This column is written with financial assistance from funds provided to Penn State Evan Pugh professors.