Using false discovery rates for multiple comparisons in ecology and evolution

Authors


  • Correspondence site: http://www.respond2articles.com/MEE/

  • Author & editor comment [added after online publication, 17 November 2010]

    It has come to our attention that a recently accepted article in Methods in Ecology and Evolution (Pike 2010) covers the same topic as a previous article in the journal Oikos (Verhoeven et al. 2005, Oikos 108, 643–647).

    Both articles explain and advocate the use of False Discovery Rate corrections for an audience of ecologists and evolutionary biologists. Both papers have similar goal, approach and form, and there is overlap in their content. Unfortunately, Pike (2010) did not cite the paper by Verhoeven et al. (2005), but this was a genuine oversight.

    Pike (2010) received a rigorous and independent review by two established experts, overseen by the journal’s experienced editorial team, none of whom were aware of the Oikos publication.

    To remedy the oversight, we encourage readers to use and reference the 2005 Oikos paper as this has precedence in raising the awareness of ecologists to the use of False Discovery Rates.

    N. Pike, K. Verhoeven, L. McIntyre, K. Simonsen, T. Benton & R. Freckleton

Correspondence author. Email: nathan.pike.1998@pem.cam.ac.uk

Summary

1. Ecologists and evolutionary biologists often need to simultaneously evaluate the significance of multiple related hypotheses. Multiple comparisons need to be corrected to avoid inappropriately increasing the number of null hypotheses that are wrongly rejected. The traditional method of correction involves Bonferroni-type multiple comparison procedures which are highly conservative, tending to increase the number of wrong rejections of true hypotheses as the number of hypotheses being simultaneously tested increases.

2. Newer procedures which are based on False Discovery Rates and which do not suffer the same loss of power as traditional methods are described. Algorithms and spreadsheet-based software routines for three procedures which are especially useful in ecology and evolution are provided.

3. The strengths and potential pitfalls of FDR-based analysis and of presenting results as FDR-adjusted P-values are discussed with reference to traditional methods such as the sequential Bonferroni correction.

4. FDR-based multiple comparison procedures should be more widely adopted because they are often more appropriate than traditional methods for identifying truly significant results.

Introduction

Biologists routinely need to simultaneously evaluate the significance of multiple hypotheses. This need has always existed for experiments that are more complicated than a simple comparison between a control group and a single treatment group. As large-scale factorial experiments have become more common and developments in informatics have allowed us to use increasingly vast amounts of data to address problems in ecology and evolution, the need for effective multiple comparison procedures has become more prevalent and the numbers of hypotheses requiring simultaneous multiple comparison is larger than ever.

The problem with performing multiple simultaneous hypothesis tests is that, as the number of hypotheses increases, so too does the probability of wrongly rejecting a null hypothesis because of random chance. The traditional solution which was advocated by Fisher (1935) in his classic text is to reduce the threshold P-value (which is usually called α) that is used to determine what we call a significant difference. Of course, if we were to leave α at the conventional level of 0·05 while undertaking simultaneous statistical inference, one in every 20 null hypotheses would be wrongly rejected.

The most renowned method for reducing the threshold of significance (which is named in honour of the probability theorist Carlo Bonferroni) is an outcome of observing Boole’s inequality: one simply divides α by the number of hypotheses being simultaneously tested, m (Miller 1966). This approach has been refined by a number of authors including Keuls (1952), Scheffé (1953), Tukey (1953), Duncan (1955), Dunnett (1955), Šidák (1967), Dunn (1974), Holm (1979), Simes (1986) and Hochberg (1988). All of these procedures assume that the deciding criterion for calling a hypothesis significant is keeping the probability of wrongly rejecting even one of the many null hypotheses less than α. The significance of a single hypothesis is thus firmly tied to our statistical confidence in each and every comparison that we call non-significant. This approach is entirely appropriate if there are serious consequences to occasionally making a wrong declaration of significance. When this is not the case, testing the significance of multiple hypotheses by controlling the probability of making even one wrong claim of significance can be overly stringent and misleading: with an increasing number of hypotheses, the probability of wrongly accepting null hypotheses becomes unacceptably large as the probability of wrongly rejecting null hypotheses becomes very small. Furthermore, it should be noted that it will often be more costly to use multiple comparison procedures that ignore real differences than to use procedures that occasionally misclassify non-significant differences.

Sorić (1989) proposed that, when making simultaneous inferences, it will often be more interesting to consider what proportion of rejected null hypotheses have been wrongly rejected. Sorić called a rejected null hypothesis a ‘discovery’ and a wrongly rejected null hypothesis a ‘false discovery’. It follows then that, in developing a method of estimating the proportion of wrongly rejected null hypotheses, Benjamini & Hochberg (1995) opted to call this value the false discovery rate (FDR).

A better understanding of the quantity which we call the FDR can be had by considering a histogram of P-values obtained for a set of simultaneous multiple comparisons (Fig. 1). In Fig. 1, 25% of inferences are truly significant and the other 75% accord with the null hypotheses. Provided that the statistical analyses were appropriate to the distribution and error structure of the data, we would expect that the P-values would be uniformly spread between 0 and 1 for those inferences for which the null is true. This expectation is derived from the fact that all P-values are equally likely under the global null hypothesis if the statistical test used to calculate the P-values employs the correct distribution for the sample population. (For example, if the F-statistics of the familiar anova procedure are used appropriately to compare group means with residuals that are normally distributed and homoscedastic, the distribution of calculated P-values will be uniform if there is no difference among the groups. The uniformity comes about because we are essentially sampling at random from the same population.) A uniform null distribution of P-values is likely to be true for the vast majority of studies in ecology and evolution and this distribution will, of course, hold true regardless of sample size.

Figure 1.

 A frequency distribution of P-values obtained from a large set of simultaneous multiple comparisons of which 25% are truly significant. The truly significant comparisons are located above the dotted horizontal line and the comparisons that are called significant are located to the left of the dashed vertical line (which indicates α, the threshold P-value for determining declarations of significance). True positives (TP), false positives (FP), true negatives (TN), and false negatives are marked on the distribution. The false discovery rate (FDR) is the mathematical quantity given by FP/(TP + FP) and the false positive rate (FPR) (which is used to arrive at P-values) is given by FP/(FP + TN).

The area underneath the dotted horizontal line in Fig. 1 thus represents those comparisons for which there is truly no significant difference. The dashed vertical line represents the threshold value, α, which we are accustomed to defining before deciding which P-values to call significant. Having overlaid these intersecting lines on our distribution of P-values, we can assign labels to each of the four demarcated areas. Bearing in mind that comparisons that are truly significant are above the dotted line and that the comparisons which we elect to call significant are to the left of the dashed line, we can identify the true positives (TP), the false positives (FP), the true negatives (TN), and the false negatives (FN). (FP is also known as Type 1 error while FN is also known as Type II error.) The FDR is the error rate in the set of comparisons that are called significant, or, in other words, the proportion of comparisons which are wrongly called significant: FDR = FP/(TP + FP).

It is worth pointing out that, although a false positive and a false discovery are the same thing, the false positive rate (FPR) is crucially different from the FDR. The FPR (which is the basis from which P-values become measures of significance) is the error rate in the set of comparisons that are truly not significant, or, equivalently, the proportion of non-significant comparisons that are wrongly called significant: FPR = FP/(FP + TN).

Of course, in real experiments, we do not know the proportion of truly significant comparisons. It is thus necessary to estimate this proportion in order to make an estimate of the FDR. Various methods of estimation have been developed: three methods which are likely to be most useful in contexts of ecology and evolution are presented in detail in the following section.

So what benefits are to be had from using the FDR to decide which comparisons are significantly different? The key benefit is that FDR-based comparison procedures are much more powerful (i.e. they are much less likely to misclassify a hypothesis that is actually significant) than Bonferroni-type comparisons. If we compare the threshold values obtained from an FDR analysis against each comparison ranked in order of ascending P-values, the relationship is one of monotonic linear increase. In contrast, there is only one fixed and highly conservative threshold value produced from a classical Bonferroni correction. For most ‘improved’ Bonferroni-type corrections, although the significance threshold does increase according to an exponential function, this threshold remains unduly conservative over the vast majority of P-values. (These different thresholding functions are plotted in Fig. 2.) FDR procedures provide a mechanism to avoid neglecting truly significant comparisons on a large scale. This increase in power comes at the cost of some increase in the number of comparisons that are wrongly called significant and the FDR is used to ensure that this error rate remains within an acceptable limit.

Figure 2.

 The threshold values for declaring significance obtained from three different multiple comparison procedures: the classical Bonferroni procedure, Holm’s (1979) improved stepwise Bonferroni procedure, and the classical false discovery rate (FDR) procedure (Benjamini & Hochberg 1995). Both α and the maximum FDR are set at 0.05. Note that the FDR-based procedure (solid line) gives a threshold that increases monotonically and is much less conservative than the thresholds produced by the other procedures.

False discovery rate based multiple comparison procedures are also robust to the false positive paradox to which traditional Bonferroni-type procedures are susceptible. (The false positive paradox occurs when declarations of significance are more likely to be wrong than right (i.e. false positives are more common than true positives). Because FDR procedures estimate the baseline distribution of null comparisons (i.e. the dotted line in Fig. 1), false positives are accurately incorporated into the FDR that is deemed acceptable as a significance threshold and the bias against true positives is thus alleviated.)

False discovery rate based multiple comparison procedures are appropriate in situations in which comparisons that are declared significant can be meaningfully interpreted even if one (or several) of these declarations is wrong. They are ideal for researchers who subscribe to Fisher’s opinion that tests of significance ‘are provisional, and involve an intelligent attempt to understand the experimental situation’ (Fisher 1955). Like all significant results, comparisons called significant based on FDR-based analyses should thus be earmarked for further study and verification.

FDR-based multiple comparison procedures

Since 1995, a variety of methods have been developed for controlling the FDR such that significance of multiple comparisons can be simultaneously inferred. FDR-based methods are also gradually becoming available in statistical software packages like R and MatLab largely because advanced users have authored routines to fulfill their own needs (Paciorek 2004; Strimmer 2009; MathWorks Inc 2010). Thankfully, the methods which are likely to be of most interest to ecologists and evolutionary biologists are amongst those that can be calculated using only simple (if somewhat repetitive) arithmetic.

The algorithms underlying three FDR-based multiple comparison procedures are presented below. While knowledge of these algorithms is useful in evaluating the veracity of each procedure’s results, it is clearly unnecessary to carry out these procedures by hand. To enable automatic computation for any set of P-values, spreadsheet programs have been created and are available at the following website: http:/www.webcitation.org/5s004b7CI. These programs can be run as an online application or downloaded as a file for use with spreadsheet software (Appendix S1, Supporting information). To run the analyses, one should simply insert P-values for all comparisons into the appropriate input column of the application. (The second input column can optionally be used to provide appropriate labels for these P-values.) In addition to providing threshold values for determining the significance of multiple comparisons, the software routines provide FDR-adjusted P-values (which are described later) and plots that facilitate graphical comparison of the results produced by the three methods.

The first algorithm for controlling FDRs was provided by Benjamini & Hochberg (1995). Although this exact algorithm previously took the guise of an improved Bonferroni-type correction (Simes 1986), it also provides control of the FDR. After deciding on an acceptable value, q, for the FDR, a set of comparisons that is arranged in order of increasing P-value from = 1 to m can be evaluated for significance using the following algorithm:

Classical one-stage method

  • (i) Set the significance threshold at the highest P-value at which the inequality Pi ≤ iq/m holds true, declaring that this and all smaller P-values correspond to significant comparisons. If the inequality never holds true, no comparison is declared significant.

When the number of truly null comparisons, m0, is less than m, the algorithm above actually controls the desired q at the conservative level of m0/m. A less conservative method could achieve a closer (or ‘sharper’) correspondence between q and the desired FDR by first estimating the number of non-significant comparisons. Benjamini, Krieger, & Yekutieli (2006) provide a two-stage method which does exactly this:

Two-stage sharpened method

  • (i) Set the significance threshold at the highest P-value at which the inequality Pi ≤ iq’/m holds true, where q’ = q/(1 + q). Obtain an estimate, inline image, for the number of non-significant comparisons by counting the number of P-values for which this inequality does not hold. If inline image= m, stop and declare that no comparison is significant. Otherwise, continue to next step.
  • (ii) Set the significance threshold at the highest P-value at which the inequality Pi ≤ iq*/m holds true, where q* = q’m/inline image. Declare that this and all smaller P-values correspond to significant comparisons.

In the above two-stage procedure, a conservative estimate of the number of non-significant comparisons is obtained by a very slightly modified version of the classical one-stage method. A less conservative estimate of m0 can be obtained by consulting a plot of P-values vs. their ranks (Schweder & Spjøtvoll 1982). The larger P-values, which correspond to non-significant comparisons, tend to adhere to a linear relationship. The absolute value of the slope of the line that best approximates this linear relationship provides the estimate of m0. Benjamini & Hochberg (2000) incorporated this ‘graphical’ method of estimating the number of non-significant comparisons into an FDR-based multiple comparison procedure:

Graphically sharpened method

  • (i) Conduct the classical one-stage procedure. If one or more comparisons are called significant, continue to next step. Otherwise, declare that no comparison is significant and stop.
  • (ii) Without necessarily plotting P-values against their ranks, obtain point estimates for m0, {inline image}, for each P-value, P(i), by calculating the slope between (1, m + 1) and (P(i), i): (+ 1 –i)/(1 –P(i)).
  • (iii) Starting at = 1, compare consecutive point estimates, inline image, until the first level at which the inequality inline image is true. Set the best estimate, inline image, to be either inline image at the level at which the inequality is true (rounding up to the nearest integer) or m, whichever value is smaller.
  • (iv) Set the significance threshold at the highest P-value at which the inequality Pi ≤ iq*/m holds true, where q* = qm/inline image. Declare that this and all smaller P-values correspond to significant comparisons.

The accuracy of all three of the above methods relies on the P-values being uniformly distributed under the global null hypothesis. This assumption of uniform distribution will be legitimate provided that the analysis used to obtain the P-values is appropriate to the data. It is nevertheless worth noting that, even in exceptional circumstances in which the P-values are not uniformly distributed, FDR estimates can still be obtained through procedures that use more computationally-demanding bootstrapping methods (Storey 2002; Storey & Tibshirani 2003; Storey, Taylor, & Siegmund 2004).

False discovery rate based multiple comparison algorithms have the perhaps unwanted potential to result in a declaration of significance even if a comparison’s original P-value was greater than α (Benjamini & Hochberg 2000; Holland & Cheung 2002). If this potential is indeed of concern, it is a simple matter to adjust an algorithm to simply disregard comparisons that exceed the original P-value threshold. In the FDR-based software routines that accompany this paper, comparisons which are called significant but have P-values that are greater than 0·05 are highlighted with exclamation marks (!) while comparisons that are significant without this violation are marked with asterisks (*).

The efficiency of FDR-based procedures does, of course, depend on having a sufficiently large number of comparisons from which the baseline rate of false discoveries can be estimated. When only a handful of comparisons is being tested for significance, FDR procedures may produce results that are less accurate than those obtained via sharpened Bonferroni-type procedures. It should also be noted that FDRs for a set of simultaneous inferences are biased downward (leading to inflation of the number of claimed discoveries) when comparisons that are known not to be significant are included. Of course, this sort of weakness is not unique to FDR-based procedures. Bonferroni-type procedures can also be manipulated in favour of discovery by illegitimately excluding comparisons which have hypotheses that are only weakly supported. Conducting either type of manipulation is to defeat the purpose of multiple comparison procedures and to forego the statistical confidence that the right procedure can provide.

FDR-adjusted P-values and q-values

Once an FDR-based multiple comparison procedure has been used to obtain threshold values for declaring significance, it is usually a relatively simple matter to adjust the original P-values so that they reflect the multiplicity correction (Yekutieli & Benjamini 1999; Troendle 2000). Such adjusted values are routinely used as a convenient means of presenting the implications of Bonferroni-type multiple comparison procedures in FPR contexts: adjusted P-values that are less than α are understood to be significant (Shaffer 1995). Similarly, in FDR contexts, adjusted P-values that are less than q (the FDR that has been set as the threshold value) are understood to be significant. Troendle (2000) pointed out that applying the same terminology of ‘adjusted P-value’ to procedures that control different quantities (FDRs vs. false rejections of null hypotheses) may lead to confusion, and suggested that the term ‘adjusted FDR-value’ could be usefully employed. Adherence to this suggestion in the subsequent literature has been only partial.

In 2003, Storey introduced the related concept of a q-value. The definition of this value is based not on the FDR, but the pFDR (or positive FDR which is a quantity that requires that at least one comparison has been called significant). The only difference between FDR and pFDR is the way in which each quantity deals with the rare situation in which no comparison is called significant: the FDR is set to zero while the pFDR is undefined (Black 2004). The subtle difference in these definitions is rarely of much significance in practice and the distinction between the two types is important mainly because pFDRs are usually estimated while FDRs are strictly controlled. The q-value of a certain comparison is the minimum pFDR at which that comparison along with all the other comparisons that have smaller P-values can be called significant (Storey 2003; Storey & Tibshirani 2003). It deserves to be explicitly noted that, just as a q-value of a comparison is the minimum pFDR required to call that comparison significant, a P-value is the minimum FPR required to call a comparison significant (Lehmann & Romano 2005). To re-iterate using the familiar terminology of probability, a P-value is the probability that a comparison is called significant given that it truly was not significant. Equivalently, a q-value is the probability that a comparison is truly not significant given that it was called significant.

A property of the q-value metric is that it gives an overall FDR for a set of comparisons without taking into account the fact that those comparisons that are closer to the q-value that characterizes the set will actually produce higher FDRs than those comparisons which are closer to zero. This is not a weakness that is unique to the FDR context: the area closer to the threshold value of a rejection region defined by a P-value will produce an FPR that is somewhat greater than α while the area closer to zero will produce an FPR that is considerably smaller. Nevertheless, the fact that q-values give FDRs for sets of comparisons (but not for individual comparisons) should be considered when interpreting FDR-based analyses.

Because pFDR and FDR are often equivalent in practical terms and because it is appealing to find the same correspondence between q-values and FDRs as occurs between P-values and FPRs, the terms ‘q-value’ and ‘adjusted P-value’ are often used interchangeably in FDR-based analyses. (To reflect this common usage, the spreadsheet program that supports this paper uses both terms side by side in referring to the same quantity.)

Conclusion

The FDR is a powerful concept by which one can retain the statistical power that would be lost to simultaneous comparisons made with Bonferroni-type procedures. Bonferroni-type procedures do control the sort of error that should ideally be controlled in analyses that seek to be confirmatory. However, controlling this error often leads to dismissing real discoveries and, of course; no multiple comparison procedure can approach the confirmatory power of further empirical study. FDR-based procedures gain sizeable power to identify truly significant comparisons in exchange for less control over wrong claims of significance. In many studies of ecology and evolution, this arrangement is the best choice available.

Ancillary