Collinearity describes the situation where two or more predictor variables in a statistical model are linearly related (sometimes also called multicollinearity: Alin 2010). Many statistical routines, notably those most commonly used in ecology, are sensitive to collinearity (Stewart 1987, Belsley 1991, Chatfield 1995): parameter estimates may be unstable, standard errors on estimates inflated and consequently inference statistics biased. But even for less sensitive methods, two key problems arise under collinearity: variable effects cannot be separated and extrapolation is likely to be seriously erroneous (Meloun et al. 2002, p. 443). This means, for example, that if we want to explain net primary productivity (NPP) using mean annual temperature and annual precipitation, and we find that temperature and precipitation are negatively linearly related, we will not be able to separate the effects of the two factors. Using one will partly explain the effect of the other. NPP might be limited only by precipitation but we may not be able to ascertain this relationship because temperature is collinear with precipitation: our model might contain both variables or perhaps only temperature. We will make incorrect inferences and prediction may be compromised. Suppose we want to predict the effect of climate change on NPP and our climate scenarios indicate no change in precipitation but an increase in temperature. Since our regression wrongly includes temperature, we would erroneously predict a change in NPP.

Collinearity is a problem recognised by most introductory textbooks on statistics, where it is often described as a special case of model non-identifiability. As demonstrated in the example above, it cannot be solved: if two highly collinear variables are both correlated with Y, without further information the ‘true’ predictor cannot be identified. Nevertheless, there are approaches for exploring it and working with it. Despite the relevance of the problem and the variety of available methods to address it, most ecological studies have not embraced measures to address collinearity (Graham 2003, Smith et al. 2009). The main reasons for this are likely to be: 1) belief that common statistical methods are unaffected by collinearity; 2) uncertainty about which method to use; 3) unsuitability of a method given the type of data to be analysed; 4) lack of interpretability of results when using approaches that combine variables; or 5) inaccessible software. The issue is by no means restricted to ecology (Murray et al. 2006, Kiers and Smilde 2007, Mikolajczyk et al. 2008).

In this paper we aim at facilitating better understanding of collinearity and of methods for dealing with it, by reviewing and testing existing approaches and providing relevant software. The review is structured into five parts. In the first we reflect on when collinearity is, or is not, a problem. The second illustrates spatio-temporal variation in relationships between environmental variables that are commonly used as explanatory variables in regression analyses. The third part introduces the different methods we review, starting with diagnostics, through ‘pre-analysis clean-up methods’ to methods that incorporate collinearity or are tolerant to the problem (Supplementary material Appendix 1.1 for details on their implementation). In the fourth part we carry out a large simulation study to compare all reviewed methods. We provide complementary case studies on real data in Supplementary material Appendix 1.2. The fifth part discusses our findings with respect to the scattered literature on collinearity. Most importantly it provides advice for the appropriate choice of an approach and supporting information for its application (e.g. parameterization). Finally we close with suggestions for further research.