Abstract
 Top of page
 Abstract
 Introduction
 Box 1: a summary of the alternatives to AIC
 Beyond simple model selection
 Defining appropriate input and predictor variables
 Random factors
 Generating a set of models to compare
 Specific treatments or factors of interest
 Model selection and model averaging
 Interpretation of model estimates
 Using the model for prediction
 Conclusion
 Acknowledgments
 References
 Appendices
 Supporting Information
Information theoretic approaches and model averaging are increasing in popularity, but this approach can be difficult to apply to the realistic, complex models that typify many ecological and evolutionary analyses. This is especially true for those researchers without a formal background in information theory. Here, we highlight a number of practical obstacles to model averaging complex models. Although not meant to be an exhaustive review, we identify several important issues with tentative solutions where they exist (e.g. dealing with collinearity amongst predictors; how to compute modelaveraged parameters) and highlight areas for future research where solutions are not clear (e.g. when to use random intercepts or slopes; which information criteria to use when random factors are involved). We also provide a worked example of a mixed model analysis of inbreeding depression in a wild population. By providing an overview of these issues, we hope that this approach will become more accessible to those investigating any process where multiple variables impact an evolutionary or ecological response.
Introduction
 Top of page
 Abstract
 Introduction
 Box 1: a summary of the alternatives to AIC
 Beyond simple model selection
 Defining appropriate input and predictor variables
 Random factors
 Generating a set of models to compare
 Specific treatments or factors of interest
 Model selection and model averaging
 Interpretation of model estimates
 Using the model for prediction
 Conclusion
 Acknowledgments
 References
 Appendices
 Supporting Information
There has been a recent and significant change in the way that ecologists and evolutionary biologists analyse and draw biological inferences from their data. As an alternative to traditional null hypothesis testing (sometimes referred to as the ‘frequentist’ approach), an information theoretic or ‘IT’ approach examines several competing hypotheses simultaneously to identify the best set of models (i.e. hypotheses) via information criteria such as Akaike’s information criterion (Burnham & Anderson, 1998, 2002; Anderson et al., 2000). In addition, the IT approach makes inferences based on weighted support from several models, i.e. model averaging (detailed below).
The IT approach, and specifically model averaging, has numerous advantages over traditional hypothesis testing of a single null model where support is measured by an arbitrary probability threshold. Instead, similar to Bayesian approaches, several models can be ranked and weighted to provide a quantitative measure of relative support for each competing hypothesis. In cases where two or more models achieve similarly high levels of support, model averaging of this ‘top model set’ can provide a robust means of obtaining parameter estimates (both point and uncertainty estimates) and making predictions (Burnham & Anderson, 2002). By comparison, more traditional approaches such as stepwise methods, although also resulting in a final model, completely ignore model uncertainty (e.g. Whittingham et al., 2006). Starting with a strong base in the field of wildlife management and markrecapture studies to estimate population abundance and survival probabilities (Lebreton et al., 1992; Schwarz & Seber, 1999), the IT approach is now being used in many areas of ecology and evolution including landscape ecology, behavioural ecology, life history evolution, phylogenetics and population genetics (Johnson & Omland, 2004; Carstens et al., 2009). Although many biologists agree with the principles behind using this approach, the ways and means of applying a multimodel procedure and model averaging to various types of biological problems are still in their infancy (see also Richards, 2005).
Meanwhile, linear mixedeffects modelling and its extension to generalized linear mixedeffects models (GLMMs) are now used widely in ecology and evolutionary biology (Paterson & Lello, 2003; Bolker et al., 2009). GLMMs are extremely useful as they permit the inclusion of random effects as well as fixed effects to complex and realistic hierarchical biological systems, simultaneously dealing with nonnormal response variables (such as binary and count data). The recent popularity of GLMMs is not surprising, as they are an overarching statistical tool that encompasses older tools such as ttests, anova, ancova and generalized linear models (GLMs), and indeed many of the issues we discuss herein can be applied to other modelling approaches. Unfortunately, the handling of the random effects in the IT environment, especially when model averaging is employed, is not straightforward, as the best method of estimating Akaike Information Criterion (AIC) (see Box 1), when random effects are included is unclear (Bolker, 2009). Additional difficulties of the IT approach become quickly evident when compiling a biologically meaningful model set [i.e. the difficulties of translating biological hypotheses to statistical models (Dochtermann & Jenkins, 2010)]. Even if one succeeds in compiling a model set, the model averaging procedure is complicated when interaction and polynomial terms are included (Dochtermann & Jenkins, 2010). Furthermore, it is not entirely clear how to proceed when a top model set for averaging does not include a particular factor of interest.
Despite having a relatively good understanding of the basic theory behind the IT approach, we encountered a number of problems when applying this approach to what initially appeared as a relatively straightforward but fundamental analysis: modelling the effects of inbreeding in wild populations (Grueber et al., 2010; Laws & Jamieson, 2010; Laws et al., 2010). It is these difficulties, and the general lack of specific guidelines for overcoming these in the literature at present, that led to this paper.
The aim of this paper is to highlight some of the common practical obstacles and challenges faced when performing mixed modelling under IT and recommend potential solutions where they exist. Our manuscript is intended to accompany recent papers that review particular statistical issues with the IT approach (for example Johnson & Omland, 2004; Richards, 2005; Link & Barker, 2006; Bolker et al., 2009; Carstens et al., 2009; and a recent ‘special issue’ of Behavioral Ecology and Sociobiology [2011, Vol. 65, No. 1]). The current manuscript provides methodological guidelines for practitioners in ecology and evolution who have already decided that the IT approach is appropriate for their data and the reader is directed to relevant reviews for additional detail. The issues addressed here, with their tentative solutions, are summarized in Table 2. We further illustrate the practical difficulties posed when using IT and model averaging approaches, through reference to a worked example (see Appendix), which provides clear, stepbystep instructions for effective analysis and standardization of reporting using the IT method. The worked example focuses on modelling the fitness effects of inbreeding on a life history trait that is also affected by several demographic variables, and in which the analysis requires model averaging to predict survival estimates for different levels of inbreeding (Grueber et al., 2010). By providing a systematic overview of tentative solutions to practical challenges faced, we hope that the IT approach will become more accessible to those interested in the analysis of any process where multiple variables impact an evolutionary or ecological response.
Table 2. Overview of practical issues associated with IT approaches and model averaging in evolution and ecology covered in this manuscript, with their tentative solutions. Practical problem  Tentative solution 


General challenges in the IT approach 
Translating biological hypotheses into statistical models  This is likely to remain the most difficult aspect of using an IT approach with model averaging in ecology and evolution, because of the complexity of biological processes 
Which information criterion to use when comparing models  AIC_{C} is most widely used; where random effects are present, this problem is at present unresolved. See also Box 1 
Whether to model average  If the weight of the ‘best’ model < 0.9, model averaging is recommended 
Practical challenges for model averaging an ecological data set 
Narrowing a list of predictors from the measured input variables  Use ‘biologically reasonable’ variables; only transform if there is an a priori justification. Consider whether a priori examination and/or removal of individual variables is appropriate 
Presence of strongly correlated variables  Depends on the nature of the correlation (see text); aim to select the variables that are most biologically important 
Generating a model set  One method is to generate a global model of all biologically relevant parameters, and then generate all possible submodels from this. However, if the global models fails to converge, it may be necessary to reduce its complexity/size 
Incompatibility of global model parameters  Tailor the model set to include only plausible models 
How to compute the model average (natural average or zero method)  Depends on the aim of the study (see text) 
How to define a top model set (what cutoff to use)  Consider how many models (S) will be captured by a given cutoff. ‘Too many’ (based on N) is discouraged because of the risk of spurious results, but specific recommendations for S are lacking 
How to evaluate model goodnessoffit  In nonmixed models one can calculate R^{2}; however; calculation of model fit is much more technical in mixed models, thus presenting a practical difficulty 
How to use the model for prediction  The model can give ‘conditional estimates’, e.g. predictions for a factor of interest at the mean of all other parameters 
Special issues for complex models 
Defining random intercepts or slopes  Always fit slope if possible, otherwise use just the intercept 
Nested models in the top model set  It is recommended to remove models from the set that are complex versions of simpler ones, but clear guidelines are currently lacking 
Whether to force inclusion of a parameter of interest in the model set/final model  Perform with caution if using the zero method of model averaging. Also, force inclusion of a parameter fixes its relative importance at 1, making this metric no longer useful 
How to interpret the effect sizes of interactions and their main effects  Centring variables permits interpretation of main effects when interactions are present 
How to interpret effect sizes when predictors are on different scales  Standardization on 0.5 SD results in effect sizes that are on comparable scales 
Box 1: a summary of the alternatives to AIC
 Top of page
 Abstract
 Introduction
 Box 1: a summary of the alternatives to AIC
 Beyond simple model selection
 Defining appropriate input and predictor variables
 Random factors
 Generating a set of models to compare
 Specific treatments or factors of interest
 Model selection and model averaging
 Interpretation of model estimates
 Using the model for prediction
 Conclusion
 Acknowledgments
 References
 Appendices
 Supporting Information
Forms of the AIC, such as AIC_{C} (small sample size correction, Table 1) and QuasiAIC (QAIC: controls for overdispersion), remain the most widely used information criteria for ranking models in the IT approach. However, there is debate surrounding the utility of AIC (e.g. Spiegelhalter et al., 2002; Stephens et al., 2007), and various alternatives have been proposed. The different criteria in use today may be appropriate in different circumstances (Murtaugh, 2009), but all information criteria are in fact approximations of Bayes Factors (BFs) (Congdon, 2006a) with certain assumptions such as large sample sizes. The BF is a ratio between two models, reflecting ‘true’ model probabilities given data support, i.e. posterior model probabilities (other information criteria approximate these posterior model probabilities) (Jefferys, 1961inCongdon, 2006b):
 (1)
where p(yM_{i}) is the marginal likelihood of model i. Therefore, BFs seem to be the ideal index for model selection and averaging. However, calculations of BFs directly become quickly complicated when comparing more than two models. Although several methods for using BFs for model averaging have been suggested, it seems that currently available methods are highly technical and difficult to implement (Congdon, 2006a). Practical implementations of BFs for multimodel comparisons are an active frontier of statistical research (R. Barker, personal communication) and thus advances in the area are anticipated in the near future.
In the interim, a particular alternative to AIC, the weighted Bayesian Information Criterion (BIC) has been proposed as superior to AIC in IT model averaging approaches (Link & Barker, 2006), as it tends to favour more parsimonious models [c.f. AIC which tends to favour complex models (Burnham & Anderson, 2002; Link & Barker, 2006)] and does not require approximation of likelihood. However, BIC still does not accurately quantify k for random effects (Table 1), and AIC and BIC can in fact give similar results for particular data sets (Murtaugh, 2009). Another criterion, also in the Bayesian context, is the Deviance Information Criterion [DIC (Spiegelhalter et al., 2002)], which improves on BIC by the incorporation of the term k_{D}: effective number of parameters. DIC is a promising metric for use with mixed models; however, its application to model averaging is not yet implemented in widely used statistical packages nor has it been widely tested with either simulations or empirical data. DIC is both philosophically and mathematically more similar to AIC than BIC (Spiegelhalter et al., 2002) in that DIC suffers similar problems to AIC (R. Barker, personal communication, Table 1). Conditional AIC [cAIC (Vaida & Blanchard, 2005; Liang et al., 2008)] is another interesting prospect in that it too can control for the number of effective parameters. However, Vaida & Blanchard (2005) state that specification of the number of parameters (i.e. whether to count each random effect as 1, as per AIC_{C}, or to use the effective number of parameters, as per cAIC) depends on the question being investigated. Notably, cAIC is yet to be widely implemented in statistical packages allowing its use for model averaging.
Table 1 presents the formulae for the aforementioned information criteria, although this is by no means an exhaustive list of information criteria. Other information criteria found in the statistical literature include: the Focused Information Criterion (FIC) (Claeskens & Hjort, 2003; Claeskens et al., 2007), Akaike’s Bayesian Information Criterion, the Generalized Information Criterion (GIC), the Extended (Bootstrap) Information Criterion (EIC), the Predictive Information Criterion and Takeuchi’s Information criterion [TIC; reviewed in Konishi & Kitagawa (2008)]. Alternatives to AIC that still rely on maximum likelihood estimation and k are subject to the same issues as AIC for model averaging under IT in generalized linear mixed modelling. Overall, information criteria can be assigned to either of two broad categories: those suited for model selection (such as BIC) and those suited for minimizing predictive error (such as AIC and others outlined above) (Yang, 2005). The type of criteria chosen depends on the question being answered (Yang, 2005), which in turn influences how the number of degrees of freedom should be calculated (Vaida & Blanchard, 2005; Bolker et al., 2009).
Defining appropriate input and predictor variables
 Top of page
 Abstract
 Introduction
 Box 1: a summary of the alternatives to AIC
 Beyond simple model selection
 Defining appropriate input and predictor variables
 Random factors
 Generating a set of models to compare
 Specific treatments or factors of interest
 Model selection and model averaging
 Interpretation of model estimates
 Using the model for prediction
 Conclusion
 Acknowledgments
 References
 Appendices
 Supporting Information
The primary step is to determine which input variables to include, and whether or how to transform these into predictor variables (explanatory or independent variables) (see Appendix: Step 1). Note that we make a distinction here between input variables (raw parameters that are measured) and predictor variables (the variables used in the model, which can also include interactions and polynomial terms) (Gelman, 2008).
Burnham & Anderson (2002) suggest that only predictors with strong biological reasoning (based on a priori investigation) should be included from the outset, to prevent overparameterization. In complex ecological systems, it is plausible that any number of factors could have an important effect on the response variable; therefore, one should consider the sample size ruleofthumb of 10 : 1 subjects to predictors in multiple regression (Harrell, 2001). In addition, there are a large number of possible second and higherorder interactions and transformations (e.g. logtransformation) that may be applied to input variables. Unless there is an a priori biological reason for expecting such conversions to improve the fit to the data (for example, to improve the normality of residuals), there is little justification for including these in the predictor set. Incidentally, regression analysis by GLMM does not require predictors (input variables) to be normally distributed, although in some cases, normalization transformations can reduce residual variance and therefore affect inference regarding parameter estimates (Faraway, 2005).
Where there are large numbers of possible predictors, it might seem natural to explore each variable independently prior to generating models to identify factors impacting strongly on the response. Doing so informally, ideally graphically, is exactly what exploratory data analysis is about (Tukey, 1977; Zuur et al., 2010). However, advocates of the IT approach such as Burnham & Anderson (2002) are in principle against exploratory data analysis, because it results in post hoc creation of statistical models and thus biological hypotheses. They recommend that predictors should be selected on the basis of genuine prior knowledge, such as from pilot studies or the scientific literature (Burnham & Anderson, 2002).
An additional point to consider is collinearity amongst predictors, which has received little attention despite being a characteristic of many ecological studies (Freckleton, 2010). Collinearity amongst predictors can be a problem in model selection, as a number of models each containing different (but correlated) predictors may provide similar fits to the data and thus present difficulties when choosing the ‘best’ model and determining true relationships (Freckleton, 2010). Using simulations, Freckleton (2010) demonstrated that when predictors are correlated, IT approaches and model averaging performed just as well or even better than ordinary least squares methods at parameter estimation. However, Freckleton cautioned that measurement errors in correlated predictors can cause problems in any analysis. Whether to combine collinear variables (for example into principal components) depends on the nature of the variables themselves and the relationships that are expected (for examples see Freckleton, 2010). Incidentally, the high prevalence of correlated predictors in ecological data sets suggests to us the importance of exploratory data analysis of predictors.
Generating a set of models to compare
 Top of page
 Abstract
 Introduction
 Box 1: a summary of the alternatives to AIC
 Beyond simple model selection
 Defining appropriate input and predictor variables
 Random factors
 Generating a set of models to compare
 Specific treatments or factors of interest
 Model selection and model averaging
 Interpretation of model estimates
 Using the model for prediction
 Conclusion
 Acknowledgments
 References
 Appendices
 Supporting Information
Once it has been established which predictors are to be included, the next step is to generate a ‘model set’ of hypotheses (see Appendix: Step 2). The easiest way to generate a model set is to derive all possible submodels from a set of predictors of interest (but not necessarily all possible predictors, see previous section), including an interceptonly model (which should also contain any random factors), and then compare these (e.g. Symonds & Johnson, 2008). This method of generating a model set is acceptable insofar as each model is ecologically justifiable (Dochtermann & Jenkins, 2010). From a practical pointofview, the easiest way to accomplish this in a statistical package such as R (R Core Development Team, 2009) is to generate a global model containing all the predictors of interest and then derive submodels from this [see Appendix: Step 2; see also Symonds & Moussalli (2010) for a summary of other software that perform AICbased analyses].
There are, however, a number of potential obstacles to generating a model set in this way, such as what to do if the global model does not converge (possibly because of overparameterization in cases where sample size is small). There are two types of nonconvergence that can occur: the first is the failure to estimate parameters; the second is the overestimation of SE or confidence intervals, which can occur in the absence of any error messages from software (Bolker et al., 2009). One solution to either of these forms of nonconvergence is to follow the recommendation of Bolker et al. (2009) and reduce the size and complexity of the global model. Interactions can be removed first (particularly those where the main effects are weak), and then undertaking a priori investigation of individual factors and removing onebyone those main effects that either appear to have least impact on the response, or are of least biological interest, until the model converges. An alternative is to generate a submodel set manually; for example, if 10 parameters are to be investigated but the global model cannot converge, it may be desirable to generate a model set of all submodels with a maximum of five parameters each. However, automation would be required, as this example would result in 638 possible models (not including interactions or polynomials), far too many to generate by hand. Even so, by taking this approach, one is likely to fall victim to the ‘problem of too many models’ (Burnham & Anderson, 2002; Dochtermann & Jenkins, 2010), leading to potentially spurious results. In addition, care should be taken to avoid generating submodels that may be biologically implausible. For example, in cases where predictors are mutually exclusive or otherwise incompatible, models containing combinations of these should not be included in the model set. Again, we support the recommendations of Zuur et al. (2010) and reinforce the importance of exploratory data analysis and careful consideration of predictors.
Model selection and model averaging
 Top of page
 Abstract
 Introduction
 Box 1: a summary of the alternatives to AIC
 Beyond simple model selection
 Defining appropriate input and predictor variables
 Random factors
 Generating a set of models to compare
 Specific treatments or factors of interest
 Model selection and model averaging
 Interpretation of model estimates
 Using the model for prediction
 Conclusion
 Acknowledgments
 References
 Appendices
 Supporting Information
If the model set is large, there may be no single best model: a number of models in the set may differ in their data fit by only small amounts, as defined by an information criterion. Under these circumstances, it is best to employ an IT model averaging approach, a procedure that accounts for model selection uncertainty to obtain robust parameter estimates or predictions. This procedure entails calculating a weighted average of parameter estimates, such that parameter estimates from models that contribute little information about the variance in the response variable are given little weight. Various information criteria have been presented to determine the amount of information contained in a given model (Table 1). At present, the most commonly used is the Akaike Information Criterion, AIC (Akaike, 1973), and its correction for small sample size [AIC_{C} (Hurvich & Tsai, 1989)] although AIC may be more suitable than AIC_{C} when modelling certain nonlinear ecological responses (Richards, 2005). Simulation studies have shown that in certain circumstances, choosing the ‘best’ model (based on AIC_{C} for example) may provide similar parameter estimates when compared to model averaging. However, modelaveraged results can be more stable than those based on choosing the best model, as the former is less likely to erroneously conclude that weak parameter estimates are zero (Richards, 2005; Richards et al., 2010). It should be borne in mind, however, that assigning the incorrect sign to a weak parameter estimate is a possibility in any regression (Gelman & Tuerlinckx, 2000), and further research as to the effects of model averaging on this type of error would be useful.
An important issue with the broad application of AIC_{C} to GLMMs is in the calculation of the number of parameters (k) when random factors are included (Spiegelhalter et al., 2002; see also Box 1). Tentative solutions are provided in the development of alternative information criteria for use in IT model averaging, especially under a Bayesian framework (Box 1). Additionally, in GLMM analysis, the residual variance of nonGaussian data may be modelled as either multiplicative overdispersion (the overdispersion parameter which appears in QAIC, see Table 1) or additive overdispersion (a residual variance as in linear mixed models; see Browne et al., 2005). These different implementations can obviously influence information criterion calculations (Nakagawa & Schielzeth, 2010). Although both methods of modelling overdispersion are suited for fitting GLMMs, different software packages may use either approach, affecting how the variance components (i.e. random effects) should be treated and interpreted (Nakagawa & Schielzeth, 2010). Overall, when focussing on linear regressiontype analysis, AIC_{C} remains the most widely used criterion; it is also the most easily applied because it is implemented in model averaging packages in R [such as MuMIn (Bartoń, 2009)] and most other major statistical packages (Symonds & Moussalli, 2010).
Once it has been identified that model averaging is necessary, the next step is to determine which models to average (see Appendix: Step 3). This can be influenced by the question being asked: for example, broad questions, such as whether inbreeding affects fitness, will require a larger model set than more specific questions, such as whether one island exhibits greater fledging success than another island. Under an IT framework, it is assumed that the ‘true’ model is in the model set (Burnham & Anderson, 2002), but averaging the full model set, or a large proportion of it, is not recommended not only because parameter estimates from models with very poor weights are likely to be spurious (Anderson & Burnham, 2002) but also because the full model set may include redundant models (such as biologically meaningless models or nested models). Indeed, where S (the number of models in the set) is very high relative to N (the sample size), excessive model uncertainty (and thus high error associated with parameter estimation) can be expected and even the best model will have a very small Akaike weight (Burnham & Anderson, 2002). On the other hand, limiting the model set too stringently may result in exclusion of the ‘best’ model. There are a number of recommendations for the cutoff criterion to use to delineate a ‘top model set’, such as using the top 2AIC_{C} of models (Burnham & Anderson, 2002), top 6AIC_{C} (Richards, 2008), top 10AIC_{C} (Bolker et al., 2009) or 95% confidence (summed weight, Burnham & Anderson, 2002).
An added complication is how to decide what to do if a particular factor of interest (such as an experimental treatment) is not present in a model captured within the top model set (see Appendix: Step 4). Solutions in such cases are to either conclude that there is little evidence that the factor of interest explains variation in the response variable or extend the cutoff criteria to include at least one model that contains the factor of interest (for example, in cases where a parameter estimate is essential to further analysis). The latter solution may result in very large model sets, and/or inconsistent cutoff criteria for different response variables. High cutoffs are discouraged as they can lead not only to spurious results as described earlier but also to the inclusion of overly complex models (Richards, 2008). Such overly complex models may have similar weight as simpler versions in the set, and model averaging these can potentially result in overweighting the parameters they contain. Simulation studies have shown that removing complex models from the set does not necessarily impact the chance of selecting parsimonious models and also reduces the total number of models selected (Richards et al., 2010). A tentative solution therefore is to exclude models from the set that are more complex versions of those with lower AIC_{C} (Burnham & Anderson, 2002; Richards, 2008). However, careful scrutiny of these complex models may reveal that they are characterized by the presence of unique predictors of potentially strong biological importance and therefore in such cases should not be removed. Determining how to resolve the issue of nested models is likely to depend on the context of the particular study, but there are currently few clear guidelines on this.
After a top model set is defined, the method used to compute the modelaveraged parameters should also be chosen carefully. There are two methods by which the estimate and error for each parameter are weighted (detailed in Burnham & Anderson, 2002; Nakagawa & Freckleton, 2010). In the socalled natural average method (Burnham & Anderson, 2002; p. 152), the parameter estimate for each predictor is averaged only over models in which that predictor appears and is weighted by the summed weights of these models. Alternatively, in the socalled zero method (Burnham & Anderson, 2002), a parameter estimate (and error) of zero is substituted into those models where the given parameter is absent, and the parameter estimate is obtained by averaging over all models in the top model set. Thus, the zero method decreases the effect sizes (and errors) of predictors that only appear in models with small model weights (particularly when the predictors have weak effects), diluting the parameter estimates of these predictors (shrinkage towards zero) (Lukacs et al., 2010).
Although no clear distinction has been made as to the circumstances under which either of these two methods is more appropriate, Nakagawa & Freckleton (2010) recommend that the zero method should be used when the aim of the study is to determine which factors have the strongest effect on the response variable. Conversely, when there is a particular factor of interest and it is possible that this factor may have a weak effect compared to other covariates, the natural average method should be used to avoid shrinkage towards zero (see Appendix: Step 3). Under the natural average method, the choice of whether to include a parameter of interest is inconsequential, as this method only averages parameters over models in which they appear anyway. Thus, the presence of additional models in the set, that do not include the parameter of interest, will have no influence on the calculation of the effect size or SE of the focal parameter. However, restricting the top model set to only those models that contain a parameter of interest will fix the relative importance of this parameter at 1, making this metric no longer useful (see Appendix: Table S3).
Determining whether the final model provides a good fit to the data presents technical challenges when random factors are present. In the case on nonmixed models, R^{2} can be calculated (Burnham & Anderson, 2002), but this is difficult in mixed models (Gelman & Hill, 2007). Further implementation of these methods is required in widely used statistical software such as R.
Interpretation of model estimates
 Top of page
 Abstract
 Introduction
 Box 1: a summary of the alternatives to AIC
 Beyond simple model selection
 Defining appropriate input and predictor variables
 Random factors
 Generating a set of models to compare
 Specific treatments or factors of interest
 Model selection and model averaging
 Interpretation of model estimates
 Using the model for prediction
 Conclusion
 Acknowledgments
 References
 Appendices
 Supporting Information
When modelaveraged estimates are derived, it is essential to interpret both the direction (positive or negative) of parameter estimates and their magnitudes (effect sizes) in relation to one another (see Appendix: Step 4). Such an assessment can be problematic when input variables are measured on different scales (Gelman, 2008), and interactions are present. Interactions prevent the interpretation of main effects (van de Pol & Wright, 2009), because resultant estimates are usually not comparable to each other. These problems are common with any multiple regression analysis and are not unique to the IT approach per se. The process of model averaging can complicate these problems further as it combines parameter estimates derived from models both with and without interaction and polynomial terms (note that the modelaveraged intercepts are usually not interpretable). Fortunately, these problems are largely solved by centralizing predictors (see Appendix: Steps 2 and 4), and there is generally a strong justification for doing so, especially where interactions and polynomials are present (Gelman, 2008; Schielzeth, 2010). Centralizing predictors is essential when model averaging is employed, and standardization facilitates the interpretation of the relative strength of parameter estimates.
In linear regression, the interpretation of main effects is impaired when (significant) interactions are present, but this issue is largely resolved if input variables are centred, and inferences are made at points within the biologically meaningful range of the parameter, such as the mean (detailed in Schielzeth, 2010). In addition, it is recommended that input variables (not predictors) are standardized to a mean of 0 and a SD of 0.5 before model analysis (see Appendix: Step 2). The value 0.5 is used, rather than 1 SD, as this allows the standardization of binary predictors [and/or categorical variables, as ‘dummy variables’ are created (Schielzeth, 2010)] and continuous predictor variables to a common scale (Gelman, 2008; see also Hereford et al. (2004) for a discussion of standardization in the context of quantitative genetics). When interpreting the model, it is therefore important to remember that parameter estimates are on this scale. Such standardizations have sometimes been criticized (King, 1986; Bring, 1994; Hereford et al., 2004; Schielzeth, 2010) because parameter estimates are on the transformed scales, which are difficult to interpret biologically. However, backtransformations (described below) of these estimates are straightforward and we recommend that where point estimates of the response variable are derived, authors present them in the original scale (see Appendix: Step 5).
Conclusion
 Top of page
 Abstract
 Introduction
 Box 1: a summary of the alternatives to AIC
 Beyond simple model selection
 Defining appropriate input and predictor variables
 Random factors
 Generating a set of models to compare
 Specific treatments or factors of interest
 Model selection and model averaging
 Interpretation of model estimates
 Using the model for prediction
 Conclusion
 Acknowledgments
 References
 Appendices
 Supporting Information
The issues presented here are not intended as an exhaustive survey of the practical difficulties associated with the application of model averaging under an IT framework. For example, this paper has not explored the problems presented by missing data. Model comparisons using IT approaches require data sets with no missing data, as deleting cases containing missing values can severely affect the results of model selection under IT approaches (Nakagawa & Freckleton, 2010). This has been recently covered in detail by other authors (Nakagawa & Freckleton, 2010). Nonetheless, in the current discussion, we have identified a number of areas for more research:
 •
Which IT criteria should be used when comparing models, given the difficulties presented by including random factors?
 •
In determining the cutoff for a top model set when examining a factor of interest – how many models is ‘too many’ for model averaging?
 •
How should we decide which nested models to remove from the model set?
 •
How do we quantify model fit in mixedeffects models?
In addition, we emphasize the importance of standardizing variables where model averaging is employed, as to fail to do so renders the results of model averaging uninterpretable in the presence of interactions (c.f. Schielzeth, 2010).
Whereas the debate continues amongst the statisticians in this general area – amongst Frequentists, Information Theoreticians and Bayesians (e.g. Stephens et al., 2005, 2007; Lukacs et al., 2007; McCarthy, 2007) – ecologists and evolutionary biologists continue to derive interesting and important hypotheses, collect data to test their hypotheses, and analyse and (hopefully) publish their results. Resolution of some of the pertinent issues noted above may still be a considerable time away and future work on these problems using simulated data, particularly exploring the use of AICbased metrics (Box 1), will be a promising area of research. In the meantime, practitioners require pathways and signposts to tentatively guide them through what could be considered the analytical and statistical fog of the new era of information theory and model averaging. Until that fog lifts, it is hoped that the guidelines provided here can improve the consistency and standard of reporting of results in ecological and evolutionary studies using IT approaches.