A call for statistical pluralism answered

Authors


Philip A. Stephens, Department of Mathematics, University of Bristol, University Walk, Bristol BS8 1TW, UK (fax: 0117 9287999; e-mail: Philip.Stephens@bristol.ac.uk).

We welcome Lukacs et al.'s (2007) response to our paper calling for pluralism in inferential approaches. These are important issues and we sought to clarify the strengths and weaknesses of two inferential approaches in an atmosphere that avoided denigrating either, while emphasizing that poor application of any statistical approach is a weak basis for disregarding it as a tool for science. Lukacs et al.'s (2007) contribution is helpful, clarifying the arguments in favour of information theoretic (IT) approaches. The single-parameter example is useful and does much to illustrate the application of the approach. In general, we applaud statistical formalization of the method of multiple working hypotheses, as well as the focus on acknowledging model selection uncertainty, which we see as a principal advantage of that method.

In spite of our broad concurrence with Lukacs et al. (2007), it is unsurprising that areas of disagreement remain. Here, we focus on four. First, we question their apparent view that arguments regarding null hypothesis testing (NHT) and IT are widely understood, and that confusion over statistical methods is dissipating. Second, we believe that, whether or not it is the best method for a given problem, NHT can represent a far richer approach to analysis than represented by Lukacs et al. (2007). We clarify why this is the case. Third, we are concerned that, by denigrating the statistical theory underlying NHT as relatively weak, Lukacs et al. (2007) overstate the degree to which elements of their suggested IT algorithms are established, and their performance known. Last, in disparaging exploratory data analysis (EDA), Lukacs et al. (2007) confuse different stages of scientific endeavour. We explain what we see as the purpose of EDA and its role in science.

Lukacs et al. (2007) take a positive view of statistics in ecology and see confusion over statistical methods ‘dissipating’. Although we would like to be equally optimistic, this is not yet our perception. Whittingham et al. (2006) surveyed articles published in 2004 in three leading ecological journals, and assessed approaches to multiple regression analyses (an analytical approach widely associated with the application of IT methods). They found that NHT approaches were used more than three times as often as approaches based on IT (Whittingham et al. 2006). More strikingly, they also found that, of the relatively small number of cases where IT approaches were used, the majority used IT as part of an automated, stepwise procedure. This is in sharp contrast to the recommendations of Burnham & Anderson (2002), and serves as a reminder that IT does not inherently motivate the rigorous development of biologically plausible candidate models. Hobbs & Hilborn (2006) assessed statistical methods used in literature published by the Ecological Society of America. From 1984 to 2003, they found little change in the frequency with which NHT methods had been used. In the same period, the number of articles that include the words ‘Bayesian’, ‘model selection’ or ‘likelihood’ in their text increased, but evidence of an upward trend since 1996 is lacking (Hobbs & Hilborn 2006). Our own assessment of recent issues of four ecological and evolutionary journals showed that, overall, NHT techniques were used in at least 90% of data-based papers, while IT techniques were used in less than 10% (Stephens et al. in press). Clearly, the widespread adoption of new methods, even those that have been vigorously promoted, takes time. Nevertheless, these data suggest that we must beware of complacency; statistical approaches remain a source of uncertainty and disagreement. Given prevailing practices among ecologists in a position to mentor students, novice practitioners of ecology, in particular, may be confused by the inferential options available to them.

Our second concern regarding Lukacs et al. (2007) relates to their characterization of the process of NHT. In our original paper, we argued both that null hypotheses should often be framed more imaginatively than ‘no effect’, and that the interpretation of NHT can be improved (Stephens et al. 2005). By caricaturing NHT as a process of using ‘an arbitrary α level and the resulting P-value’ to assess a default null (of ‘no effect’), Lukacs et al. (2007) overlooked a substantial part of our original arguments. Those arguments can be aptly restated with reference to their dipper Cinclus cinclus example. Using that example, Lukacs et al. (2007) show that IT can be used to compare evidence ratios for different models regarding the survival of the dipper in flood and non-flood years. They also demonstrate that uncertainty in model selection can be incorporated in parameter estimates using multimodel averaging. In spite of this, the primary outcome of the IT approach is the ability to state that, ‘Given the available data, a difference in survival probability having occurred between normal and flood years is 10·52 times more likely than no difference having occurred’.

We concede that the authors may have used a highly simplistic example for illustrative purposes. However, given the authors’ recurrent criticisms of trivial or implausible nulls (e.g. Anderson et al. 2000), demonstrating the application of IT with a question framed as a ‘silly null’ seems counter-productive. By contrast, in our paper (Stephens et al. 2005), we argued for a stronger emphasis on critical forethought. In particular, we argued that the null should be framed in terms of a predetermined, consequential difference, and that the choice of α-level should be motivated by considerations of the trade-off between the consequences of Type I and Type II errors for the particular situation. These considerations should take place prior to analysis, and must be informed by the biology of the situation and the context of the analysis (e.g. management or research). Such a process would permit stronger inferences regarding the biological significance of the data, rather than only their implications for the statistical models considered.

Some (e.g. Link & Barker 2006) have interpreted our original paper (Stephens et al. 2005) to suggest that we are resistant to model averaging. This is not the case. Rather, we believe that model averaging is an important and valuable process, but one that requires further investigation before it can be applied with confidence in every area. Until the widespread application of the process is well understood, we would be cautious about proselytizing about any one approach with too much zeal. Related to this, our third concern arises from Lukacs et al.'s (2007) assertion that ‘the theory underlying NHT is weak’. By implication, the theory underlying the inferential procedures that they recommend (model formulation, model selection and model averaging) is strong and well-established. Other sources, by contrast, suggest that certain aspects of the model comparison process are, as yet, relatively poorly understood. For example, Akaike's information criterion (AIC) is only one possible information criterion that could be used for model selection and statisticians remain divided over which of the possible criteria is most suitable in a given set of circumstances (Kass & Raftery 1995; Guthery et al. 2005; Link & Barker 2006). Where AIC is used, rules of thumb for interpreting Δi (differences in AIC values for competing models) are also unclear in some cases (Burnham & Anderson 2002: 71), while the performance of AICc (AIC values corrected for small sample sizes) is also debatable (Richards 2005). In addition, more work on when and how to apply AIC-derived model weights to model averaging is likely to be necessary before this technique can be applied with confidence (Buckland et al. 1997; Burnham & Anderson 2002: 152–3; Richards 2005). One area ripe for rapid improvement is the development of clear, consistent approaches to describe AIC-based algorithms. For example, Lukacs et al. (2007) appear to use ‘model weight’, ‘model probability’ and ‘model likelihood’ synonymously. Although there may be valid reasons for doing so, such varied terminology is likely to contribute to confusion surrounding emerging methods.

Finally, we find it curious that Lukacs et al. (2007) regard EDA as a ‘risky method for developing scientific hypotheses’ and imply that rigorous application of EDA fails to involve focusing ‘substantial mental effort to derive a set of plausible scientific hypotheses’. We suggest that Lukacs et al. (2007) are conflating EDA as an aid to devise new hypotheses with that of a tool to test them. EDA cannot be used to test hypotheses. At its most basic level, however, science is the process of observing patterns, developing hypotheses to explain those patterns, and testing those hypotheses. We contend that EDA plays a strong role in pattern recognition and, hence, is an important tool of hypothesis generation. As Sir Peter Medawar observed, the source of scientific hypotheses is the human imagination and all that aids it (e.g. Medawar 1996). EDA is one of the powerful aids to our scientific imagination and curtailing its use would be counterproductive and stifling. Science is an iterative process, in which existing models can often be improved upon. EDA often indicates how a model must be modified for further testing. Of course, the results of a priori analyses and the hypotheses that result from EDA must be kept separate. Indeed, most scientific publications report in ‘Results’ the outcome of a priori analyses (often with the use of very simple descriptive statistics) and, in their ‘Discussion’, pose the hypotheses generated by EDA. Often EDA forms the observational foundation for the ‘hard thinking’ advocated by Lukacs et al. (2006).

In his paper on the method of multiple working hypotheses, T. C. Chamberlin supported a diverse approach to tackle scientific questions. Of the working hypothesis approach he stated, ‘This has been affirmed to be the scientific method. But it is rash to assume than any method is the method, or at least that it is the ultimate method’ (italics in the original; Chamberlin 1890, in Hilborn & Mangel 1997: 286). Our original paper also called for the acceptance of different approaches, recognizing that inferential statistics are tools, and that no current inferential toolbox holds all the tools needed by ecologists. We are pleased that, in spite of their strong preference for IT methods, Lukacs et al. (2007) seem to have endorsed pluralism. We hope that this exchange will stimulate discussion among ecologists and statisticians about how inference is best conducted, and help to reduce the confusion that remains within the field of ecological statistics.

Ancillary