Insights & Perspectives
Optimizing α for better statistical decisions: A case study involving the pace-of-life syndrome hypothesis
Optimal α levels set to minimize Type I and II errors frequently result in different conclusions from those using α = 0.05
Setting optimal significance levels that minimize Type I and Type II errors allows for more transparent and well-considered statistical decision making compared to the traditional α = 0.05 significance level. We use the optimal α approach to re-assess conclusions reached by three recently published tests of the pace-of-life syndrome hypothesis, which attempts to unify occurrences of different physiological, behavioral, and life history characteristics under one theory, over different scales of biological organization. While some of the conclusions reached using optimal α were consistent to those previously reported using the traditional α = 0.05 threshold, opposing conclusions were also frequently reached. The optimal α approach reduced probabilities of Type I and Type II errors, and ensured statistical significance was associated with biological relevance. Biologists should seriously consider their choice of α when conducting null hypothesis significance tests, as there are serious disadvantages with consistent reliance on the traditional but arbitrary α = 0.05 significance level.
A better alternative to using α = 0.05 for null hypothesis significance tests in biological research
For several decades, null hypothesis significance testing has been under attack 1–3. It nevertheless remains widely used in biological research despite suggestions that Bayesian 4, confidence interval 5, or AIC 6 approaches are more valid and/or appropriate. One explanation for the continued use of null hypothesis significance testing in biological research is that biologists are ignorant of alternative approaches and/or resistant to change. Another explanation is that null hypothesis significance tests have real utility for biologists as a statistical decision-making tool 7–9. While the former may occasionally be the truth, we think the latter more commonly explains the use of null hypothesis testing. Many of the drawbacks attributed to null hypothesis significance testing revolve around the use of an arbitrary significance threshold (i.e. α = 0.05). A better alternative has recently been described 10 that allows the user to find the best possible compromise between Type I and Type II error rates for their particular study design. Application of the optimal α approach has the potential to improve results interpretation and reduce overall error rates in biological research.
Problems arising from using the α = 0.05 significance level can be easily illustrated by comparing the results of published null hypothesis significance tests using the optimal α approach with those obtained using α = 0.05. Mudge et al. 11 has shown that for null hypothesis significance tests conducted under the Canadian Environmental Effects Monitoring program, 12% of tests would have reached different conclusions had they selected an optimal α that minimized the a priori probability of making an error (i.e. Type I or II). There are clear consequences of wrong conclusions associated with choosing an inappropriate statistical decision-making threshold for applied environmental monitoring research. There can also be important consequences associated with wrong conclusions associated with choosing an inappropriate α level for pure, theoretical biological research. Here, we use three recently published papers 12–14 examining the pace-of-life syndrome (POLS) hypothesis – which predicts specific linkages among life-history, physiological, and behavioral characteristics at among-species, among-population, and within-population levels of biological organization – to compare how conclusions would change using optimal α levels versus the traditional α = 0.05 statistical threshold.
The pace-of-life syndrome hypothesis: Live fast, die young?
The POLS hypothesis suggests that suites of physiological characteristics of species have coevolved with associated behavioral and life history characteristics. Under this hypothesis, correlations among physiological, behavioral, and life history characteristics should be observable at within-population, among population, and among species levels of biological organization 15, 16. Proponents predict that organisms with lower metabolism and/or higher immune responses tend to be longer-lived and exhibit lower levels of aggressive behavior than similar organisms with higher metabolism and/or lower immune response. That is to say some organisms exhibit a slower, more cautious pace of life with significant investment in survival, while others exhibit a faster, more reckless pace of life with significant investment in rapid reproduction. Tests of the POLS typically involve checking for these expected correlations at different levels of biological organization. Some level of correlation among physiological, behavioral, and life history traits is possible to occur simply by chance. As such, an important question when testing the POLS hypothesis becomes “What level of observed correlation should be considered support for the POLS hypothesis?” This has usually been determined using null hypothesis significance tests with the traditional α = 0.05 significance threshold.
Application of the optimal α approach to re-analyze tests of the pace-of-life syndrome hypothesis
Optimal α determines the most appropriate significance level for a null hypothesis significance test by calculating the average of α (the probability of Type I error under the null hypothesis) and β (the probability of Type II error under the alternate hypothesis) over a range of possible α levels, and then choosing the α level associated with the lowest average of Type I and Type II error rates 10. Probabilities of Type II error under the alternate hypothesis depend on α, sample size, and the critical effect size (i.e. the smallest effect that would be considered biologically relevant, if real) relative to the amount of variability in the data. Critical effect sizes precisely specify a minimum level of biological relevance for the alternate hypothesis. This allows for the calculation of an optimal α level that designates the p-values at which we should behave as if the null hypothesis were true (i.e. conclude “non-significant”), from the p-values at which we should behave as if an effect as large or larger than the critical effect size is true (i.e. conclude “significant”).
As critical effect sizes were not clearly stated in our case studies (or in most biological research), we chose to report optimal α test conclusions for three potential critical effect sizes. This was to show how optimal α levels change when researchers consider small, intermediate, and large effects to be biologically relevant. The small critical effect size was chosen to represent a scenario where, if there were a direct relationship between the two variables, no more than two other variables could individually explain more of the variance in the dependent variable than the variance explained by the independent variable (i.e. an R2 = 0.33). The intermediate critical effect size was chosen to represent a scenario where, if there were a direct relationship between the two variables, no other variable could explain more of the variance in the dependent variable than the variance explained by the independent variable (i.e. an R2 = 0.5). The large critical effect size was chosen to represent a scenario where, if there were a direct relationship between the two variables, the independent variable would explain at least twice as much of the variance in the dependent variable than any other variable (i.e. an R2 = 0.67). In these cases, we have chosen critical effect sizes based on statistical criteria but if there are clear biological criteria that would supply superior critical effect sizes we recommend using them. Here, it is not clear that any absolute correlation strength between dependent and independent variables would have any particular biological relevance as a threshold for support of the POLS hypothesis.
For each test, we calculated the optimal α and associated β for each of our three chosen effect sizes and determine whether the result would be considered “significant” or “non-significant” using the optimal α approach. We also calculated the β level using α = 0.05 and the associated average of α and β for α = 0.05, to allow for the calculation of the percent reduction in the average a priori probability of Type I and II errors (assuming equal prior probabilities of null and alternate hypotheses) associated with switching to the optimal α approach.
Conclusions from pace-of-life syndrome tests sometimes differ between α = 0.05 and optimal α
Johnson et al. 12 tested whether life history characteristics can be used to predict the number and seriousness of amphibian parasite infections in different amphibian species. Under the POLS hypothesis, slow developing, long-lived amphibian species are predicted to have fewer and less serious parasite infections than fast-developing, short-lived species. The authors used a general linear model approach with data from 13 amphibian species. They focused on the relationships between PCA axis scores representing a pace-of-life life history continuum and: (1) parasite loads (p = 0.026), (2) mortality risk (p = 0.023) and (3) malformation rates (p = 0.069). Seven of nine conclusions associated with the optimal α approach agree with conclusions made by authors using α = 0.05 (Table 1). For the cases where α = 0.05 and optimal α result in different conclusions, the optimal α approach resolves an interpretation issue faced by the authors. The authors discuss the relationship between malformations and pace-of-life as if it were biologically meaningful despite not being significant at α = 0.05. Using the optimal α approach, the result would be interpreted as statistically significant for small and intermediate effect sizes, which corresponds with the authors' interpretation of the data.
Table 1. α, β and a priori average of α and β for α = 0.05 and optimal α approaches at small (R2 = 0.33), intermediate (R2 = 0.5), and large (R2 = 0.67) potential critical effect sizes, associated with null hypothesis tests for relationships between amphibian species' parasite load and their pace-of-life (as described by their life-history characteristics), between amphibian species' mortality and their life-history pace-of-life, and between amphibian species' malformation rate and their life-history pace-of-life (from 12). Also shown are the % reduction in the a priori average of α and β associated with using the optimal α instead of α = 0.05, and the test conclusions associated with the optimal α for each potential critical effect size. Test conclusions that differ from the conclusion reached using α = 0.05 are marked with an asterisk.
|R2 = 0.33||0.401||0.2255||0.164||0.175||0.1695||24.8||Significant||Significant||Significant*|
|R2 = 0.5||0.124||0.087||0.084||0.072||0.078||10.3||Significant||Significant||Significant*|
|R2 = 0.67||0.008||0.029||0.028||0.019||0.0235||19.0||Significant||Significant||Non-significant|
David et al. 13 tested whether body condition and/or personality characteristics summarized onto a pace-of-life axis can be used to predict how long it takes for zebra finches (Taeniopygia guttata) to begin feeding after deprivation. The pace-of-life hypothesis predicted relationships between: (1) PCA axis scores composed of activity levels, exploration levels, neophobia levels and risk-behavior levels, representing a reactive-proactive behavioral continuum, and latency to feed after food deprivation and (2) between body condition and latency to feed after food deprivation. Three of six conclusions based on the optimal α approach were consistent with those reached using α = 0.05, two were inconsistent and there is insufficient evidence to conclude whether there is a large relationship between body condition and latency to feed (Table 2) because the authors did not provide an exact p-value for the relationship between proactivity and feeding latency.
Table 2. α, β and a priori average of α and β for α = 0.05 and optimal α approaches at small (R2 = 0.33), intermediate (R2 = 0.5), and large (R2 = 0.67) potential critical effect sizes, associated with null hypothesis tests for relationships between zebra finch proactivity and their feeding latency after food deprivation, and between zebra finch body condition and feeding latency (from 13). Also shown are the % reduction in the a priori average of α and β associated with using the optimal α instead of α = 0.05, and the test conclusions associated with the optimal α for each potential critical effect size. Test conclusions that differ from the conclusion reached using α = 0.05 are marked with an asterisk.
|R2 = 0.33||0.01||0.03||0.024||0.023||0.0235||21.7||Significant||Significant|
|R2 = 0.5||0.00002||0.02501||0.003||0.002||0.0025||90.0||Significant||Non-significant*|
|R2 = 0.67||1 × 10−11||0.025||0.00007||0.00005||0.00006||99.8||?*||Non-significant*|
Niemelä et al. 14 tested whether the immune response of individual field crickets (Gryllus integer) is rank-correlated with boldness, growth rate and maturation time. The authors used a Spearman rank correlation approach with data from 46 individuals and tested for relationships between the strength of immune response and: (1) boldness, (2) growth rate and (3) maturation time. The optimal α approach resulted in a different conclusion for eight of nine tests – potential critical effect size combinations (Table 3). There is only statistical evidence for a weak relationship (i.e. R2 = 0.33), between maturation time and strength of immune response.
Table 3. α, β and a priori average of α and β for α = 0.05 and optimal α approaches at small (R2 = 0.33), intermediate (R2 = 0.5), and large (R2 = 0.67) potential critical effect sizes, associated with null hypothesis tests for relationships between rank of field cricket non-boldness and their rank strength of immune response following implantation of a foreign body, between rank of field cricket growth rate and their rank strength of immune response, and between rank of field cricket maturation time and their rank strength of immune response (from 14). Also shown are the % reduction in the a priori average of α and β associated with using the optimal α instead of α = 0.05, and the test conclusions associated with the optimal α for each potential critical effect size. Test conclusions that differ from the conclusion reached using α = 0.05 are marked with an asterisk.
|R2 = 0.33||0.009||0.0295||0.021||0.022||0.0215||27.1||Non-significant*||Non-significant*||Significant|
|R2 = 0.5||0.00006||0.02503||0.003||0.003||0.003||88.0||Non-significant*||Non-significant*||Non-significant*|
|R2 = 0.67||1 × 10−8||0.025||0.0001||0.0001||0.0001||99.6||Non-significant*||Non-significant*||Non-significant*|
The optimal α approach improves inferences in tests of the pace-of-life syndrome hypothesis
It is difficult to predict the impact of using optimal α instead of traditional statistical thresholds for any individual study. In the first study, we found two circumstances where results were statistically significant that had originally been interpreted as non-significant. This original study would have most appropriately concluded (based on their use of α = 0.05) that only two of the three dependent variables showed evidence consistent with the pace of life hypothesis. Re-evaluation using optimal α levels suggests that all three dependent variables show responses consistent with the pace-of-life hypothesis at small and intermediate critical effect sizes. For the remaining two studies, because they were well-designed powerful studies, a reinterpretation using optimal α concluded that many of the tests that were found statistically significant using the traditional α = 0.05 should have resulted in non-significant conclusions. In fact, in the case of Niemelä et al. 14 we would have reached conclusions that are almost completely opposed to those reached by the authors. This demonstrates the dramatic impact that using optimal α approach can have on statistical inference.
We suspect that authors are going to be more receptive to using optimal α when it results in a switch from non-significant to significant than the reverse. This is, in part, because our intuition tells us that if the probability of making an error in rejecting the null is <0.05 then how can the correct decision be to fail to reject the null? For example, when we reinterpret Niemelä et al. 14, optimal α leads us to not reject the null while α = 0.05 led the authors to reject the null. If the null hypothesis is true, the probability that a relationship between boldness and immune response as large as the one observed in Niemelä et al. 14 occurred by chance is 0.036 and yet, optimal α recommends that we fail to reject the null. How is it possible that it is a mistake to reject the null when it is so unlikely that the observed results could have occurred by chance? In this situation, Niemelä et al. 14 can only have made a Type I error (because they rejected the null) and using the traditional statistical threshold they have a 5% chance of making a mistake if the null is true. On the other hand, we can only make a Type II error using optimal α (because we have failed to reject the null) and there is only a 0.3% chance of having made a Type II error, if a moderate size effect is considered to be biologically meaningful. This means that Niemelä et al. could conclude their result is significantly different from the null hypothesis with a relatively low chance of Type I error if the null hypothesis is really true (α = 0.05). However, the authors could, instead, conclude that there is insufficient evidence for a biologically meaningful effect with an even lower chance of Type II error if a biologically meaningful effect really does exist (β = 0.03). Ultimately, the correct course of action in this situation depends on the prior probabilities of the null and alternate hypotheses being true. In the absence of other such prior probability information, we may invoke Laplace's principle of indifference and assume equal prior probabilities. Under this principle, Niemelä et al. are over sixteen times more likely to have made a wrong conclusion using α = 0.05 than we are by using an optimal α.
A call for study-specific consideration of significance levels for null hypothesis tests in biology
The results of these examples imply that it is critically important that we choose the most appropriate method for selecting a statistical decision-making threshold. There is no doubt that the approach that minimizes the combined probabilities of making Type I or Type II errors is the optimal α approach. We are convinced that optimal α is always more appropriate than using a constant value for which there is no rationale other than “this is the way we have always done it”. In this application of optimal α the averages of α and β associated with the optimal α approach were 10–99.6% smaller than average combinations of Type I and Type II error resulting from the use of α = 0.05 (Tables 1–3). That is, the use of optimal α can lower the probability of making wrong conclusions. Relative costs of Type I versus Type II errors and relative prior probabilities of null and alternate hypotheses, if known, can also be incorporated into the calculation of optimal α levels and can lead to better decisions. However, relative costs of error and prior probabilities of hypotheses are often unknown in biological research, and assuming equal costs of Type I and Type II error and equal prior probabilities offer reasonable default assumptions.
The test outcomes generated using the optimal α approach differ depending on the critical effect sizes considered to be biologically meaningful. As such, this paper highlights the importance of explicitly considering biologically meaningful critical effect sizes instead of making implicit and unexamined assumptions about them, as occurs when using α = 0.05. Given the ability to calculate optimal statistical decision-making thresholds, we see no reason for or benefit from continuing to rely on an arbitrary constant for interpreting the results of biological research.