## Introduction

Biologists routinely need to simultaneously evaluate the significance of multiple hypotheses. This need has always existed for experiments that are more complicated than a simple comparison between a control group and a single treatment group. As large-scale factorial experiments have become more common and developments in informatics have allowed us to use increasingly vast amounts of data to address problems in ecology and evolution, the need for effective multiple comparison procedures has become more prevalent and the numbers of hypotheses requiring simultaneous multiple comparison is larger than ever.

The problem with performing multiple simultaneous hypothesis tests is that, as the number of hypotheses increases, so too does the probability of wrongly rejecting a null hypothesis because of random chance. The traditional solution which was advocated by Fisher (1935) in his classic text is to reduce the threshold *P*-value (which is usually called α) that is used to determine what we call a significant difference. Of course, if we were to leave α at the conventional level of 0·05 while undertaking simultaneous statistical inference, one in every 20 null hypotheses would be wrongly rejected.

The most renowned method for reducing the threshold of significance (which is named in honour of the probability theorist Carlo Bonferroni) is an outcome of observing Boole’s inequality: one simply divides α by the number of hypotheses being simultaneously tested, *m* (Miller 1966). This approach has been refined by a number of authors including Keuls (1952), Scheffé (1953), Tukey (1953), Duncan (1955), Dunnett (1955), Šidák (1967), Dunn (1974), Holm (1979), Simes (1986) and Hochberg (1988). All of these procedures assume that the deciding criterion for calling a hypothesis significant is keeping the probability of wrongly rejecting even one of the many null hypotheses less than α. The significance of a single hypothesis is thus firmly tied to our statistical confidence in each and every comparison that we call non-significant. This approach is entirely appropriate if there are serious consequences to occasionally making a wrong declaration of significance. When this is not the case, testing the significance of multiple hypotheses by controlling the probability of making even one wrong claim of significance can be overly stringent and misleading: with an increasing number of hypotheses, the probability of wrongly *accepting* null hypotheses becomes unacceptably large as the probability of wrongly *rejecting* null hypotheses becomes very small. Furthermore, it should be noted that it will often be more costly to use multiple comparison procedures that ignore real differences than to use procedures that occasionally misclassify non-significant differences.

Sorić (1989) proposed that, when making simultaneous inferences, it will often be more interesting to consider what proportion of rejected null hypotheses have been wrongly rejected. Sorić called a rejected null hypothesis a ‘discovery’ and a wrongly rejected null hypothesis a ‘false discovery’. It follows then that, in developing a method of estimating the proportion of wrongly rejected null hypotheses, Benjamini & Hochberg (1995) opted to call this value the false discovery rate (FDR).

A better understanding of the quantity which we call the FDR can be had by considering a histogram of *P*-values obtained for a set of simultaneous multiple comparisons (Fig. 1). In Fig. 1, 25% of inferences are truly significant and the other 75% accord with the null hypotheses. Provided that the statistical analyses were appropriate to the distribution and error structure of the data, we would expect that the *P*-values would be uniformly spread between 0 and 1 for those inferences for which the null is true. This expectation is derived from the fact that all *P*-values are equally likely under the global null hypothesis if the statistical test used to calculate the *P*-values employs the correct distribution for the sample population. (For example, if the *F*-statistics of the familiar anova procedure are used appropriately to compare group means with residuals that are normally distributed and homoscedastic, the distribution of calculated *P*-values will be uniform if there is no difference among the groups. The uniformity comes about because we are essentially sampling at random from the same population.) A uniform null distribution of *P*-values is likely to be true for the vast majority of studies in ecology and evolution and this distribution will, of course, hold true regardless of sample size.

The area underneath the dotted horizontal line in Fig. 1 thus represents those comparisons for which there is truly no significant difference. The dashed vertical line represents the threshold value, α, which we are accustomed to defining before deciding which *P*-values to call significant. Having overlaid these intersecting lines on our distribution of *P*-values, we can assign labels to each of the four demarcated areas. Bearing in mind that comparisons that are *truly* significant are above the dotted line and that the comparisons which we elect to *call* significant are to the left of the dashed line, we can identify the true positives (TP), the false positives (FP), the true negatives (TN), and the false negatives (FN). (FP is also known as Type 1 error while FN is also known as Type II error.) The FDR is the error rate in the set of comparisons that are called significant, or, in other words, the proportion of comparisons which are wrongly called significant: FDR = FP/(TP + FP).

It is worth pointing out that, although a false positive and a false discovery are the same thing, the false positive rate (FPR) is crucially different from the FDR. The FPR (which is the basis from which *P*-values become measures of significance) is the error rate in the set of comparisons that are truly not significant, or, equivalently, the proportion of non-significant comparisons that are wrongly called significant: FPR = FP/(FP + TN).

Of course, in real experiments, we do not know the proportion of truly significant comparisons. It is thus necessary to estimate this proportion in order to make an estimate of the FDR. Various methods of estimation have been developed: three methods which are likely to be most useful in contexts of ecology and evolution are presented in detail in the following section.

So what benefits are to be had from using the FDR to decide which comparisons are significantly different? The key benefit is that FDR-based comparison procedures are much more powerful (i.e. they are much less likely to misclassify a hypothesis that is actually significant) than Bonferroni-type comparisons. If we compare the threshold values obtained from an FDR analysis against each comparison ranked in order of ascending *P*-values, the relationship is one of monotonic linear increase. In contrast, there is only one fixed and highly conservative threshold value produced from a classical Bonferroni correction. For most ‘improved’ Bonferroni-type corrections, although the significance threshold does increase according to an exponential function, this threshold remains unduly conservative over the vast majority of *P*-values. (These different thresholding functions are plotted in Fig. 2.) FDR procedures provide a mechanism to avoid neglecting truly significant comparisons on a large scale. This increase in power comes at the cost of some increase in the number of comparisons that are wrongly called significant and the FDR is used to ensure that this error rate remains within an acceptable limit.

False discovery rate based multiple comparison procedures are also robust to the false positive paradox to which traditional Bonferroni-type procedures are susceptible. (The false positive paradox occurs when declarations of significance are more likely to be wrong than right (i.e. false positives are more common than true positives). Because FDR procedures estimate the baseline distribution of null comparisons (i.e. the dotted line in Fig. 1), false positives are accurately incorporated into the FDR that is deemed acceptable as a significance threshold and the bias against true positives is thus alleviated.)

False discovery rate based multiple comparison procedures are appropriate in situations in which comparisons that are declared significant can be meaningfully interpreted even if one (or several) of these declarations is wrong. They are ideal for researchers who subscribe to Fisher’s opinion that tests of significance ‘are *provisional*, and involve an intelligent attempt to *understand* the experimental situation’ (Fisher 1955). Like all significant results, comparisons called significant based on FDR-based analyses should thus be earmarked for further study and verification.