Ensuring that sound science informs policy decisions has been suggested as one of the most important societal issues faced by plant scientists in the twenty-first century (Grierson et al., 2011). Plant scientists therefore are responsible for ensuring that data are analyzed and presented in a way that facilitates good decision-making. Good statistical practices can be an important tool for ensuring objective and transparent and data analysis. Despite frequent criticism over the last few decades (Cohen, 1994; Gigerenzer, 2004; Rinella & James, 2010), null hypothesis significance testing (NHST) remains widely used in plant science. While others have argued that the persistence of NHST is due to ignorance of alternative approaches or resistance to change (Fidler et al., 2004; Gigerenzer et al., 2004), its persistence is likely at least partly due to its usefulness as a decision-making tool (Robinson & Wainer, 2002; Mogie, 2004; Stephens et al., 2007). Are different plant species equivalent in their ability to remediate toxic land? Are different management actions equivalent in their ability to control invasive plant species? Are there differences in antibiotic activity among different plant metabolites? Do more diverse plant communities provide greater levels of a particular ecosystem service than less diverse plant communities? Each of these questions could be appropriately answered using null hypothesis significance tests. Other tests might also be appropriate, but it would not be wrong to use NHST. The utility of NHST as a decision-making tool does not, however, warrant ignoring the problems that are associated with these statistical tests. The problems associated with NHST need to be addressed in order for it to better inform policy decisions that involve plant science.
The major flaws of NHST primarily surround the use of an arbitrary threshold for judging statistical significance. Although the use of α = 0.05 does represent a common criterion for Type I errors that everyone must adhere to, consistent use of this threshold results in Type II error rates that vary wildly among studies. The value of 0.05 as a significance criterion has no logical foundation, nor does the practice of holding Type I errors consistent while allowing Type II errors to vary (Cowles & Davis, 1982). Consistent use of an arbitrary significance threshold also causes statistical significance to frequently differ from biological significance (Martínez-Abraín, 2008) and makes significance heavily influenced by sample size (Johnson, 1999), with high sample sizes tending to make even trivial effects statistically significant (Nakagawa & Cuthill, 2007) and low sample sizes leading to even large effects being considered non-significant (Sedlmeier & Gigerenzer, 1989). Use of a consistent significance threshold regardless of sample size has contributed to frequent misinterpretations of P-values as being ‘highly significant’ or ‘marginally significant’, and/or as measures of how likely the alternate hypothesis is to be true (Hubbard & Bayarri, 2003).
If using a consistent arbitrary significance level is problematic, how should significance levels be set? I argue that the significance threshold for a null hypothesis significance test should be set to achieve the goal of the statistical test. One (and perhaps the only) reasonable goal of NHST is to minimize the chances and/or costs of making wrongful conclusions concerning a set of collected data. If the goal of NHST is to minimize the chances and/or costs of errors, then the decision-making threshold (α) should be set to minimize the combined probabilities and/or combined costs of Type I and Type II errors. Mudge et al. (2012a) describe a general approach for calculating study-specific optimal significance levels that minimize the combined probabilities and/or costs of Type I errors under the null hypothesis and Type II errors under the alternate hypothesis. The optimal α approach for null hypothesis significance tests is paralleled by signal detection theory in electrical engineering (Peterson et al., 1954) and psychology (Green & Swets, 1966) and has recently been applied in environmental monitoring (Mudge et al., 2012b) and physiology (Mudge et al., 2012c).
The calculation of an optimal significance level requires estimates of the same parameters that are needed to calculate statistical power for a null hypothesis significance test (sample size and a critical effect size relative to variability) and also requires an estimate of the relative costs of Type I vs Type II error, to minimize the combined costs of Type I and Type II error (the relative prior probabilities of null and alternate hypotheses, if known, can also be incorporated into the calculation of an optimal significance level, however studies with prior probability estimates typically employ Bayesian statistical methods). Although the need to explicitly consider and specify a critical effect size relative to variability and the relative costs of Type I vs Type II error may appear to constitute a barrier to calculating optimal significance levels in plant science, it should not be viewed as such. Critical effect sizes, costs of errors and prior probabilities of null and alternate hypotheses are important to consider for any research question, and failure to incorporate them into the statistical decision-making threshold leads to implicit and unexamined assumptions about them when using α = 0.05.
A critical effect size is the smallest size of effect that would be considered biologically meaningful if it were to exist. What is the minimum meaningful magnitude of antibiotic activity for a plant metabolite? How much variability among different management actions is enough to warrant a change in management policy for controlling invasive plant species? What is a biologically meaningful strength of relationship between plant species diversity and levels of particular ecosystem services? Currently in plant science, consideration of biological relevance is often discussed after the presentation of results, creating pressure to justify statistical significance as being biologically relevant (i.e. arguments that statistically significant results are meaningful and arguments that nonsignificant results are not of a meaningful observed effect size). Using α = 0.05, it is common to have statistical significance without biological significance and vice versa. Setting critical effect sizes a priori and incorporating them into the statistical decision-making process removes a component of subjectivity in results interpretation (the post hoc evaluation of biological relevance), and the subjectivity that remains (the selection of the critical effect size itself) is at least transparently stated. There may also be some objective methods for setting critical effect sizes in plant science. Previous studies can sometimes be helpful when setting a critical effect size for new study. The magnitude of antibiotic activity of compounds used in current drugs may inform the level that would be considered meaningful for a newly studied plant metabolite. Setting critical effect sizes such that they fall outside some specified range of natural variability may also be appropriate. The amount of variability in effectiveness of any single management action for controlling invasive plant species may inform the amount of variability tolerated among different management approaches. When there is no strong justification for any single critical effect size, significance thresholds can be calculated for multiple potential effect sizes representing, small, intermediate and large effects, enabling the reader to evaluate results based on the effect size that they feel is most appropriate. Ultimately, although specifying a critical effect size for a research question may be a difficult task, the authors themselves (as experts in their study subject) are nearly always among the most qualified to make this important estimate and responsibility should fall on them for doing so. Presenting data but shifting the entire burden of judging biological significance onto the reader in an attempt to circumvent the possibility of criticism discourages discussion about biological relevance among plant scientists.
Regardless of research question, statistical decision errors in plant science research always have associated costs. These costs of errors may be financial and/or any combination of other nonmonetary costs such as wasted time, loss of scientific progress, or damage to the environment. Some costs, such as wasted time, can be more easily translated into monetary terms than others, such as loss of scientific progress or environmental damage. Despite their frequent difficulties with quantification, the consideration of costs of errors is inherent in good decision-making, and plant science is no exception. Accepting a greater probability of a more costly type of error when it is possible to shift some of this probability to a less costly type of error is only rational if a small reduction of the more costly error requires accepting a sufficiently large increase in the likelihood of the less costly error such that the increased chance of error offsets the gain associated with avoiding more costly errors. There have been many attempts to quantify costs of environmental degradation (see Costanza et al., 1997 for a review of > 100 attempts to quantify the value of ecosystem goods and services). For ecosystem management, costs of errors represent potential degradation of ecosystem value vs potential unnecessary management costs and potential lost economic opportunities. For basic plant science research, costs of errors are more difficult to quantify because implications of Type I and Type II errors are often not easily foreseen. Plant scientists are not, however, totally unaccustomed to considering the relative costs of errors, as judgments concerning relative costs of errors are often made when ‘conservative’ estimates are made for particular unknown parameters. Choosing to make a ‘conservative’ estimate implies that any errors associated with underestimation are less serious than errors associated with overestimation. For cases where costs of errors are truly unknown, setting equal costs of errors assumes any errors are equally serious and results allows for the calculation of an optimal significance level that minimizes the combined probabilities of Type I and Type II error. Setting relative costs of Type I vs Type II error equal may, therefore, often be the best practice for purely scientific questions.
While determining an optimal decision-making threshold separately for each study through explicit consideration of biological relevance and relative costs of Type I vs Type II errors is a good thing, pressures to publish and self-delusion may sometimes tempt researchers to choose an inappropriate critical effect size or relative error cost ratio to produce a desired statistical outcome. However, the practice of using α = 0.05 is also not immune to the possibility of manipulating of statistical significance. When using α = 0.05, one can virtually guarantee statistical significance by taking many samples and focusing on endpoints with low variability. Similarly, nonsignificance can be nearly assured by taking few samples and introducing measurement error to increase variability. Use of α = 0.05 only creates the illusion of consistent statistical rigor among studies while hiding potential manipulation of significance behind combinations of sample size and endpoint variability. By contrast, it is more difficult to disguise the manipulation of significance when using the optimal α approach, for two reasons. First, authors would have to explicitly state and provide a rationale for their critical effect size and relative error cost estimate, and unusual critical effect sizes or error cost estimates would (hopefully) raise eyebrows of both reviewers and readers. Second, researchers would essentially ‘shoot themselves in the foot’ by attempting to increase the chance of a significant conclusion by claiming inappropriately small effect sizes would be important to be able to detect, because it would result in a higher α level, which would weaken the confidence in the statistical outcome. Also, if sample sizes and P-values are stated in a paper where a reader disagrees with the authors’ choice of critical effect size or relative error cost estimate, there is nothing preventing the reader from calculating another optimal significance level for their own critical effect size and/or error cost estimate and re-evaluating the significance accordingly.
Null hypothesis significance tests remain commonly used as a statistical decision-making technique in plant science, but consistently relying on the traditional α = 0.05 significance level results in unnecessarily high combinations of Type I and biologically relevant Type II errors and results in conclusions not tied to biological relevance or the relative costs of Type I vs Type II errors. Critical effect sizes and costs of errors are imperative to making good decisions in plant science and there is no rationale for continuing to use a consistent but arbitrary statistical decision-making threshold in plant science when an alternative exists that incorporates critical effect sizes and relative error costs to set study-specific significance thresholds that minimize the combined probabilities or costs of Type I and biologically relevant Type II errors. I believe that incorporation of explicitly considered and transparently stated estimates of critical effect sizes and relative costs of errors into statistical decision-making thresholds will improve the relevance of plant science research for policy-makers in the twenty-first century.