Introduction
 Top of page
 Summary
 Introduction
 Problems with multiple regression
 Current use of stepwise regression
 Example
 Discussion
 Acknowledgements
 References
 Supporting Information
In the face of complexity, ecologists often strive to identify models that capture the essence of a system, explaining the observed distribution and perhaps ultimately permitting prediction. A first step toward this aim is to collect data on the response of interest, together with data on factors that it is believed might influence that response. Frequently data are observational (i.e. the variance in the data set has not been generated by experimental manipulation) leading to difficulties in determining which causal factor or factors best explain the observed responses. In these situations, scientific possibility is limited to describing the system and identifying models consistent with the observed phenomenon. One of the most commonly used techniques for this purpose is multiple regression or, more generally, a general linear model with multiple predictors. The statistical theory underlying this methodology is well understood (e.g. Draper & Smith 1981; McCullagh & Nelder 1989), as are the assumptions and limitations of the approach (e.g. Derksen & Keselman 1992; Burnham & Anderson 2002).
Although the scientific primacy of a principle of parsimony is without clear support (Guthery et al. 2005), it is usually the case that models with fewer variables also contain fewer nuisance variables and have greater generality (Ginzburg & Jensen 2004). For that reason, research is usually directed towards identifying a relatively parsimonious model that is in general agreement with observed data. A suite of model simplification techniques has been developed, and the notion of a minimum adequate model (MAM) has become commonplace in ecology. A MAM is defined as the model that contains the minimum number of predictors that satisfy some criterion, for example, the model that only contains predictors that are significant at some prespecified probability level. Finding such a model is not straightforward, and most statistical packages offer algorithms for model selection in multiple regression. These include algorithms that operate by successive addition or removal of significant or nonsignificant terms (forward selection and backward elimination, respectively), and those that operate by forwards selection but also check the previous term to see if it can now be eliminated (stepwise regression). Collectively, these algorithms are usually referred to as stepwise multiple regression.
In spite of wide recognition of the limitations of stepwise multiple regression (Hurvich & Tsai 1990; Steyerberg et al. 1999; Grafen & Hails 2002; Wintle et al. 2003; Johnson et al. 2004; Stephens et al. 2005), use of the technique in ecology remains widespread (see further below for a review of applications in major journals). In particular, three problems with the approach are frequently overlooked in ecological analyses, all of which may lead to erroneous conclusions and, potentially, misdirected research. These include bias in parameter estimation, inconsistencies among model selection algorithms, and an inappropriate focus or reliance on a single best model, where data are often inadequate to justify such confidence.
In this paper, we give a brief review of the major problems with stepwise multiple regression and we analyse how frequently the technique is used in leading ecological and behavioural journals. We present an example of how focusing on a single model may lead to difficulties of interpretation. Finally, we discuss the problems of analysing and modelling data from complex multivariable ecological data sets.
Current use of stepwise regression
 Top of page
 Summary
 Introduction
 Problems with multiple regression
 Current use of stepwise regression
 Example
 Discussion
 Acknowledgements
 References
 Supporting Information
Recognition of all of the problems outlined above is not widespread among ecologists. Recent publications have drawn attention to the problems of bias arising from variable selection on the basis of statistical significance (e.g. Anderson, Burnham & Thompson 2000; Burnham & Anderson 2002) and, as a result, alternative model selection protocols are increasingly used. In particular, use of information theoretic (IT) model selection based on Akaike's Information Criterion (AIC, see further below) has increased substantially over recent years (Johnson & Omland 2004; Rushton, Ormerod & Kerby 2004; Guthery et al. 2005). In spite of this, two of the central messages of Burnham and Anderson (e.g. Burnham & Anderson 2002) have been widely overlooked. These are that models representing different hypotheses should be compared in their entirety, rather than through automated selection procedures, and that further analysis should not be based on a single best model, but should explicitly acknowledge uncertainty among models that are similarly consistent with the data. That these points have been overlooked means that even where authors have used IT model selection, they have often retained the use of stepwise procedures, and based inference on a single best model. Some authors have attempted to overcome some of the limitations of stepwise procedures by checking for consistency between stepwise algorithms (e.g. Post 2005), but this approach is seldom explicit.
In order to assess the prevalence of different stepwise approaches in current literature, MJW reviewed 508 papers published in 2004 in three leading journals: Journal of Applied Ecology, Animal Behaviour and Ecology Letters. In all cases in which a multiple regression approach (excluding ordination techniques) was used, the analytical approach was identified as stepwise or other. Among papers employing stepwise techniques, studies were further subdivided into those that used least squares approaches and those that used IT techniques. Multipredictor regression analyses that did not use stepwise techniques were divided among those that based inference on a global model (i.e. inferences were drawn with all predictors present), and those that used other techniques (typically ITAIC) to determine a set of wellsupported models for inference.
Results of this analysis are presented in Table 1. Overall, 65 papers used a multiple regression approach, of which 57% used a stepwise procedure; however, there was no statistically significant difference between the proportion of studies using stepwise regression across the three journals (χ^{2} = 0·145, P = 0·98). Of the studies that used stepwise procedures, six of 37 (16%) used ITAIC, while the remainder used least squares techniques.
Table 1. Proportion of studies from a range of primary ecological and behavioural journals (all issues in 2004 included in this analysis) that used stepwise multiple regression for at least one component of their study. Studies using twoway anova (or similar) for replicated experiments are not included as they are not really multivariate analyses that would require this approach (see discussion)  % of studies using stepwise regression  No. of papers published by journal in 2004  Ratio of predictors to sample size for analyses using stepwise regression (no. of cases given in which based in parentheses)  Alternative approaches 


Journal of Applied Ecology  52% (12/23)*  88  24 (8)  7 studies fitted full model, 1 used heirarchical partitioning and 3 used an IT approach. 
Ecology Letters  58% (7/12)  139  66 (3)  4 studies fitted full model, 1 used an IT approach. 
Animal Behaviour  60% (18/30)  281  9 (6)  All 12 studies fitted full model. 
Example
 Top of page
 Summary
 Introduction
 Problems with multiple regression
 Current use of stepwise regression
 Example
 Discussion
 Acknowledgements
 References
 Supporting Information
As an empirical example of the problems of using stepwise multiple regression we reanalysed a published data set, collected to determine which factors influence the occurrence of yellowhammers Emberiza citrinella L. on lowland farms in the UK (Bradbury et al. 2000; see the accompanying electronic supplement for further details of the data and the analytical methods). Previous analyses were conducted using least squares stepwise regression (Bradbury et al. 2000). Here we were primarily interested in the limitations of using a single best model for inference, rather than in the limitations of the stepwise approach (which are wellestablished, see above).
We fitted models to our data set using least squares procedures (e.g. procedure ‘lm’ in ‘R’) and compared them using AIC. AIC is a likelihoodbased measure of model fit that accounts for the number of parameters estimated in a model (i.e. models with large numbers of parameters are penalized more heavily than those with smaller numbers of parameters), such that the model with the lowest AIC has the ‘best’ relative fit, given the number of parameters included (Akaike 1974).
The IT methodology developed by Burnham & Anderson (2002) is designed to conduct a comparative model fit analysis for a group of competing models. Specifically, for each model a likelihood weight (for model i termed w_{i}) is calculated. This value has a simple interpretation: it is the probability that of the set of models considered, model i would be the AICbest model, were the data collected again under identical circumstances. For a set of models the likelihood weights sum to one.
For a data set in which there is a clear ‘best’ model, one model would have a very high likelihood weight, and all other models would have very low weights. On the other hand, if all the models are poor, or if most have similar fit, then a number of models will share a similarly low probability. If there is no single model that clearly outperforms all others, the IT methodology may be used to perform model averaging, in which the parameter estimates of all models are combined, the contribution of each model being proportional to its likelihood weight. By contrast, stepwise methodology would identify a single model as preeminent, encouraging all further interpretation to be based on that model alone, ignoring the other models with similar fit to the data.
For the yellowhammer data set, there were nine predictors, and we fitted all possible subsets of these parameters. For each model we generated a likelihood weight, and we ranked all models from best fitting to worst fitting on the basis of AIC values. We plotted summed likelihood weights against model rank (Fig. 2). These plots are effectively cumulative probability plots, with the summed probability measuring the probability that the cumulative set of models would include the AICbest model were the data recollected. At a given cumulative probability level (e.g. 95%) this is sometimes termed a confidence set.
The yellowhammer data set was collected over 4 years. We analysed the data separately for each year, and for all years combined. The data from the 4 years analysed separately failed to yield a model that, in terms of likelihood weights, was clearly better than the alternative models (Fig. 2a,b). For instance, in Fig. 2(a) the 4 years of study required 77, 114, 172 and 159 models to yield a summed probability of 0·95. The implication is therefore, that any one of a large number of models could have been selected as the best fitting model in each year. The bestfitting model is, in a sense, a random draw from this set of similarly well supported models. This interpretation is backed up by Table 2, which shows the minimum adequate models selected for the four separate years. The models selected are highly variable from year to year, with no variable selected in all 4 years.
Table 2. Minimum adequate models constructed to explain the distribution of yellowhammers in four separate years. Data were collected from a variable number of farms in each year and these are indicated in brackets after each year  1994 (5)  1995 (5)  1996 (8)  1997 (9)  1994–97  IT Selection probability† 


Hedge presence  *  **    P = 0·058  0·73 
Treeline presence    *  *  ***  0·67 
Ditch presence  **  *   *  ***  1·00 
Road adjacent  *     *  0·61 
Width of margin  ***  *  ***   ***  1·00 
Pasture adjacent  **   *  ***  ***  1·00 
Silage ley adjacent       0·48 
Winter rape       0·64 
Beans adjacent   *     0·37 
n  185  185  347  387  1103  
Ratio of sample size to predictors  21  21  32  35  123  
The analysis of the combined data set yielded a smaller set of credible models, with only 42 models required to reach a probability of 0·95. However, this is still too large a number to be able to base all inference and conclusions on one model with any confidence. The MAM for this data set includes most of the variables found to be significant in the analysis of the single years. However, the likelihood weight for this model was only 0·028; it was not the AICbest model, which itself had an AIC weight of only 0·048. Either of these models would be a poor one on which to base inference.
Discussion
 Top of page
 Summary
 Introduction
 Problems with multiple regression
 Current use of stepwise regression
 Example
 Discussion
 Acknowledgements
 References
 Supporting Information
Biases and shortcomings of stepwise multiple regression are well established. Surprisingly, however, we found that of recent papers in three leading ecological and behavioural journals, approximately half of those that employed multiple regression did so using a stepwise procedure (Table 1). Our example, using detailed data on yellowhammer habitat selection highlights the dangers of this approach. In particular, although the yellowhammer field study was conducted on a large scale, a single year's data was clearly insufficient to identify a single best model to explain yellowhammer territory occupancy, or even a small number of similarly wellsupported models for that purpose. Even with 4 years’ data, representing a comprehensive autecological study, as many as 42 models provided similarly good explanations of the observed data. To select a single MAM from this set without acknowledging the considerable uncertainty that remains, would be entirely misleading. A full model approach (i.e. including all predictors and all 4 years’ data) gives, in this case, a very similar result to one derived using the IT methodology (see Table 2). This reinforces the point that conclusions based on data collected in any one year may be erroneous.
Multiple regression is a widely used statistical method within ecology with 13% of the papers we reviewed using this method. It was notable that within two of the journals sampled (Animal Behaviour and Ecology Letters) only between 8 and 9% of studies used a multiple regression approach, whereas in Journal of Applied Ecology 26% (23 of 88) used such an approach. Therefore, the problems we report may very likely be more widespread within landscape studies (which tend to collect large numbers of potentially explanatory factors) than in studies with more restricted experimental designs (e.g. laboratory experiments that are common within behavioural science).
As with our example, it is likely that many studies employing stepwise procedures conceal much uncertainty when selecting a single MAM. Most ecological data sets usually include a set of predictors with a tapered distribution of effect sizes (Burnham & Anderson 2002) and almost all analyses will therefore contain equivocal variables close to statistical significance. Estimated effects are likely to be strong, intermediate and weak, or zero. For predictors with zero or weak effects, MAMs are likely to yield biased estimates of parameters (e.g. Fig. 1) and a high Type I error rate. Furthermore, when correlations exist between the predictors, different combinations of predictors may yield models with similar explanatory power (e.g. Grafen & Hails 2002). The methodology underlying MAMs is generally not designed to analyse marginal effects.
Instead of using stepwise procedures, two analyses are arguably valid: a full model including all effects, or the analysis using ITAIC methods (the approach that we demonstrated here). The full model tests a single set of hypotheses on a single model. The expected parameter estimates are unbiased (e.g. Fig. 1), and the statistical properties of the generalized linear model are well understood (e.g. McCullagh & Nelder 1989). If the main aim of the study in question were to analyse whether each of the predictors affected the distribution of birds, and whether the effects were consistent between years, this analysis should be entirely justifiable.
The downsides of using the full model for analysis and inference are that: (1) the model may not be the ‘best’ model for the data in question, as other models may fit the data equally as well; (2) if we wished to use the model predictively, it includes variables that are nonsignificant; (3) the analysis would rely on nullhypothesis testing. The first argument is not relevant to comparisons of the effects of different predictors. The reason why this model may not be the best model is precisely that it includes predictors that are nonsignificant. The analysis is designed to reveal those predictors that are significant, and those that are not. Hence we would not expect this model to be the best model.
The second problem is that a full model will contain estimates for all parameters, irrespective of whether they are statistically significant or not. This can generate an excess of noise, resulting in a model that is unsatisfactory for prediction. By contrast, techniques exist for multimodel parameter estimation, particularly within the IT framework (e.g. Burnham & Anderson 2002). This approach allows model uncertainty to be measured at the same time as parameter uncertainty to assess the likely bias in parameters resulting from selection. The advantage of using this approach for prediction, rather than the full model, is that the contribution of each predictor (in making predictions) is determined by its performance across the whole suite of models.
The third problem with basing inference on the global model, is where tests of individual parameters (designed to determine how important they are) are conducted using null hypothesis testing (NHT). NHT has been the focus of much criticism in recent decades (e.g. Carver 1978; Cohen 1994; Johnson 1999; Anderson et al. 2000). In particular, two problems of NHT apply directly to the issue of parameter testing within the global model. First, NHT is essentially binary in nature; either the tested parameter is (statistically) ‘significant’ or it is not. Wherever the threshold for significance is drawn, this can lead to dramatic differences in inference arising from very small differences in the data set. For example, consider a threshold for significance drawn at P = 0·05. Imagine that our estimate for a parameter coefficient, β, was 2·5, with a 95% confidence interval between −0·1 < β < 5·1. Here, we would reject the estimate of β and assume that β = 0 was a more reliable estimate. However, if the estimate of β was the same but with a confidence interval 0·1 < β < 4·9, then we would accept that β = 2·5. The second problem of NHT that applies to analyses of the global model is that, assuming we have reason to include the variable of interest in the model, then a null hypothesis of ‘no effect’ (representing a coefficient estimate of β = 0) is a ‘silly null’. Indeed, in the previous example, an estimate of β = 5·0 is as plausible as an estimate of β = 0·0, and is arguably more plausible, given that we had a priori reasons to believe that the tested parameter should be important.
The full model is appropriate if the data are taken from an experiment (Burnham & Anderson 2002). This is because an experiment will be designed in order to examine all main effects as well as, potentially some of the interactions. In this case the parameter estimates for one variable should be unaffected by the inclusion (or otherwise) of other factors.
Stepwise regression is most likely to lead to problems when it is used for data mining exercises. For example, it is common within landscape ecology studies for large numbers of predictors to be collected that are potentially associated with a particular organism or group of organisms. This is often the case when the underlying ecology of an organism is poorly known. Such studies sometimes use MAMs to reduce the list of predictors down to a manageable number. As we have shown the MAM approach will lead to errors for such data sets.
In our IT analysis we considered all possible subsets of models including these. This might be considered a large number of competing models to consider. The key issue with the data set we explored here (and another discussed elsewhere by Whittingham et al. 2005) is that the variables included in the analysis represent a small proportion of the possible variables that could have been included. This subset was selected on the basis of a priori considerations (i.e. with reference to the known ecology of yellowhammers and similar farmland birds). Consequently, the analysis is not a ‘shotgun’ attempt to find significant variables, but is more precisely testing the relative effects of a realistic set of candidate predictors (a form of magnitude of effects estimation, sensu Guthery et al. 2005). That this set is large is a typical problem in ecological analyses.
We have dealt in this paper with problems in formal model selection. However, a great deal of selection occurs informally in exploratory data analysis. For example, researchers may conduct preliminary analyses to reduce the set of predictors examined and reported in publications, or may use statistical tests in the exploratory phase to guide them towards the final model. This part of the analytical process is generally not reported; however, it is clear that a great deal of selection may occur prior to the final output. Such an approach (termed ‘datadredging’ by Burnham & Anderson 2002) may suffer from all of the limitations we have outlined above, although is less straightforward to recognize or correct. It cannot be stressed enough how important it is to either specify hypotheses a priori, or to describe in detail how the final reported analysis was determined.
In summary we have demonstrated that use of stepwise multiple regression is widespread within ecology and some areas of behavioural science. We have outlined the three main weaknesses of this technique (namely: bias in parameter estimation, inconsistencies among model selection algorithms, and an inappropriate focus or reliance on a single best model) and shown how erroneous conclusions can be drawn with a worked example. We suggest that use of stepwise multiple regression is bad practice. Ecologists and behavioural scientists should make use of alternative (e.g. IT) methods or, where appropriate, should fit a full model (i.e. one containing all predictors). Full (or global) models are unlikely to be wellsuited for prediction, however, and we recommend multimodel averaging techniques where prediction is the desired end.