The analysis of species distribution data has reached high statistical sophistication in recent years (Elith et al. 2006). However, even the most advanced and computer-intensive statistical procedures are no guarantee for improving our understanding of the determinants of species distributions, nor of our ability to predict species distributions under altered environmental conditions (Araújo and Rahbek 2006, Dormann 2007c). One critical step in statistical modelling is the identification of the correct model structure. As pointed out for experimental ecology in 1984 by Hurlbert, designs analysed without consideration of the nested nature of subsampling are fundamentally flawed. Spatial autocorrelation is a subtle, less obvious form of subsampling (Fortin and Dale 2005): samples from within the range of spatial autocorrelation around a data point will add little independent information (depending on the strength of autocorrelation), but unduly inflate sample size, and thus degrees of freedom of model residuals, thereby influencing statistical inference.
We have presented an overview of different modelling approaches for the analysis of species distribution data in which environmental correlates of the distribution are inferred. All these methods can be implemented in freely available software packages (Table 1). In choosing between the methods, the type of error distribution in the response variable will be an important criterion. For normal data, GLS-based methods (GLS, SAR, CAR) can be used efficiently. The most flexible methods, addressing SAC for different error distributions, are spatial GLMMs, GEEs and SEVM. The autocovariate method, too, is flexible, but performed very poorly with regards to coefficient estimation in our analyses. We encourage users to try a number of methods, since there is often not enough mechanistic information to choose one specific method a priori. One can use AIC or alike to compare models (Link and Barker 2006). Note that a “proper” (perfectly correctly specified) model would not require the kind of correction the above methods undertake (Ripley in comments to Besag 1974). In the absence of a perfect model, however, doing something is better than doing nothing (Keitt et al. 2002).
With the exception of autocovariate regression, differences in parameter estimates and inference between spatial and non-spatial models were small for our simulated data. This was possibly a result of the type of spatial autocorrelation in, and the simplistic nature of, these data (see section “Limitations of our simulations”). However, spatial autocorrelation can also reflect failure to include an important environmental driver in the analysis or inadequate capture of its non-linear effect, so that its spatial autocorrelation cannot be accounted for by non-spatial models (Besag et al. 1991, Legendre et al. 2002). In either case, spatial autocorrelation can make a large difference for statistical inference based on spatial data (for review see Dormann, 2007aDormann, 2007bDormann, 2007c; for drastic cases of this effect see Tognelli and Kelt 2004 and Kühn 2007). How to interpret these differences, especially the shifts in parameter estimates between spatial and non-spatial models commonly observed in real data, remains controversial. While Lennon (2000) and others (Tognelli and Kelt 2004, Jetz et al. 2005, Dormann 2007b, Kühn 2007) argue that spatial autocorrelation in species distribution models may well bias coefficient estimation, Diniz-Filho et al. (2003) and Hawkins et al. (2007) found non-spatial model to be robust and unbiased for several data sets. So far, no extensive simulation study has been carried out to investigate how spatial versus non-spatial methods perform under different forms and causes of SAC. Implementing a lagged autocorrelation structure to simplistic data did not reveal a bias in parameter estimation in OLS (Kissling and Carl 2007), consistent with the results of Hawkins et al. (2007).
One of the two most striking findings of our analyses is the high error rate of the autocovariate method. Most methods for normally distributed data yielded coefficient estimates for “rain” that were acceptable, including the non-spatial ordinary least square regression (Fig. 1). However, two models performed poorly: both the autocovariate regression and the lag version of the simultaneous autoregressive model showed a very consistent and strong bias, leading to severe underestimation (in absolute terms) of model coefficients. A similar pattern was also found for the non-normally distributed errors, identifying autocovariate regression as a consistently worse performer than the other approaches. The poor performance of the autocovariate regression approach in our study with regards to parameter estimation contrasts with earlier evaluations of this method (Augustin et al. 1996, Huffer and Wu 1998, Hoeting et al. 2000, He et al. 2003), but is in line with more recent ones (Dormann 2007a, Carl and Kühn 2007a). These earlier studies used more sophisticated parameter estimation techniques, suggesting that the inferiority of autocovariate models in our simulation may partly result from our simplistic (but not unusual) implementation of the method. Moreover, two of the earlier studies were undertaken in the context of many missing values: Augustin et al. (1996) used only 20% of sites in their study area for model training; Hoeting et al. (2000) used between 3.8 and 5.8%. This may have diminished the influence of any autocovariate and perhaps explains why in these studies the autocovariate did not overwhelm other model coefficients (as it did in ours). A final reason for the discrepancy in findings may be that our artificial data simulated spatial autocorrelation in the error structure, whereas other simulations created spatial structure directly in the response values, which more closely reflects the assumptions underlying autocovariate models.
The second interesting finding is the overall higher variability of results for binary data. While for normal- and Poisson-distributed residuals all model approaches (apart from autocovariate regression) yielded similar results and little variance across the ten realisations (Fig. 1), a different pattern emerged for binary (binomial) data. We attribute this to the relatively low information content of binary data (Breslow and Clayton 1993, Venables and Ripley 2002), making parameterisation of the model very dependent on those data points that determine the point of inflexion of the logistic curve (McCullough and Nelder 1989). This phenomenon has been noted before (McCullough and Nelder 1989), and remains relevant for species distribution models, where the majority of studies are based on the analysis of presence-absence data (Guisan and Zimmermann 2000, Guisan and Thuiller 2005).
Tricks and tips
Each of the above methods has its quirks and some require fine-tuning by the analyst. Without attempting to cover these comprehensively, we here hint at some areas for each method type which require attention.
In autocovariate regression, neighbourhood size and type of weighting function are potentially sensitive parameters, which can be optimised through trial and error. It seems, however, that small neighbourhood sizes (such as the next one to two cells) often turn out best, and that the type of weighting function has relatively little effect. This was the case in our analysis as well as in published studies investigating different neighbourhood sizes (for review see Dormann 2007b). Another important aspect of autocovariate models is the approach chosen to dealing with missing data, which may lead to cells without neighbours (“islands”). Since the issue arises for all modelling methods, we shall briefly discuss it here. Missing data can be overcome by a) omission (Klute et al. 2002, Moore and Swihart 2005); b) strategic choice of neighbourhood structure (Smith 1994); c) estimating missing response values by initially ignoring spatial autocorrelation and regressing known response values against explanatory variables other than the autocovariate (Augustin et al. 1996, Teterukovskiy and Edenius 2003, Segurado and Araújo 2004); and d) as in c), but then refining it through an iterative procedure known as the Gibbs sampler (Casella and George 1992). This procedure is computationally intensive, but has been found to yield the best results (Augustin et al. 1996, Wu and Huffer 1997, Osborne et al. 2001, Teterukovskiy and Edenius 2003, Brownstein et al. 2003, He et al. 2003). Simulation studies further suggest that a) parameter estimation is poor when the autocovariate effect is strong relative to the effect of other explanatory variables (Wu and Huffer 1997, Huffer and Wu 1998); b) the precision of parameter estimates varies with species prevalence, i.e. the number of presence records relative to the total sample size (Hoeting et al. 2000); and c) autocovariate models adequately distinguish between meaningful explanatory variables and random covariates (Hoeting et al. 2000) (but not in our study). Both simulation and empirical studies also indicate that autocovariate models achieve better fit than equivalent models lacking the autocovariate term (Augustin et al. 1996, Hoeting et al. 2000, Osborne et al. 2001, He et al. 2003, McPherson and Jetz 2007).
For spatial eigenvector mapping, computational speed becomes an issue for large datasets. Although the calculation of eigenvectors itself is rapid, optimising the model by permutation-based testing combinations of spatial eigenvectors is computer-intensive. Diniz-Filho and Bini (2005) argue that the identity of the selected eigenvectors is indicative of the spatial scales at which spatial autocorrelation takes effect, making this method potentially very interesting for ecologists. The implementation used in our analysis requires little arbitration and hence should be explored more widely. Note that SEVM, in the way that was applied here, is based on a different modelling philosophy. Its declared aim is to remove residual spatial autocorrelation, unlike all other methods described above, which simply provide a mathematical way to incorporate SAC into the analysis.
For the GLS-based methods (GLS and the spatial GLMM), estimation of the correlation structure functions (i.e. the parameter r) can be rather unstable. As a consequence some models yield r=0 (i.e. no spatial autocorrelation incorporated) or r≈∞, with the GLS model returning what is in fact a non-spatial GLM or nonsensical results, respectively. This problem can be overcome by inclusion of a “nugget” term that reduces the correlation at infinitesimally small distances to a value below 1, or, even better, a specification of r based on a semi-variogram of the residuals (Littell et al. 1996, Kaluzny et al. 1998). The common justification for a nugget term are measurement errors (on top of the spatially correlated error); including a nugget effect can stabilize the estimation of the correlation function (Venables and Ripley 2002).
Autoregressive models (SAR and CAR) require a decision on the weighting scheme for the weights matrix, for which there is not always an a priori reason. The main options are row standardised coding (sums over all rows add up to N), globally standardised coding (sums over all links add up to N), dividing globally standardised neighbours by their number (sums over all links add up to unity), or the variance-stabilising coding scheme proposed by Tiefelsdorf et al. (1999, pp. 167–168), i.e. sums over all links to N. In our analysis, the row standardised coding was most often the superior choice, which is in line with other studies (Kissling and Carl 2007), but the binary and the variance-stabilising coding scheme also resulted in good models. SAR and CAR models did not differ much in our analysis. According to Cressie (1993), CAR models should be preferred in terms of estimation and interpretation, although SAR models are preferred in the econometric context (Anselin 1988). Either approach can be relatively slow for large data sets (sample size>10 000) due to the estimation of the determinant of (I–ρW) for each step of the iteration. Note that Bayesian CAR models do not require the computation of such a determinant and can therefore be particularly suitable for data on large lattices (Gelfand and Vounatsou 2003). For SAR models, identification of the correct model structure is recommended and model selection procedures can help to reduce bias (Kissling and Carl 2007). The Lagrange-test (see supplementary material) can also help here. However, SAR error models generally perform better than SAR lag or even SAR mix models when tackling simulated data containing autocorrelation in lagged predictors (or response and predictors), as recently demonstrated in a more comprehensive assessment of SAR models using different spatially autocorrelated datasets (Kissling and Carl 2007).
Generalised estimating equations require high storage capacity for solving the GEE score equation without clustering as we used it in our fixed model. Application of the fixed model will therefore be limited for models on data with larger sample size, but the method is very suitable for missing data and non-lattice data. The need in storage capacity is considerably reduced by cluster models, such as our user-defined model. But clustering requires attention to three steps in the analysis: cluster size, within-cluster correlation structure and allocation of cells to clusters. To find the best cluster size for the analysis, we recommend investigating clusters of 2×2, 3×3 and 4×4. In real data, these cluster sizes have been sufficient to remove spatial autocorrelation (Carl and Kühn 2007a). Several different correlation structures should be computed initially, e.g. to allow for anisotropy. Finally, allocation of cells to clusters can start in different places. Depending on the starting point (e.g. top right or north west), cells will be placed in different clusters. Choosing different starting points will give the analyst an idea of the (in our experience limited) importance of this issue. Computing time is short.
Autocorrelation in a predictive setting
Spatial autocorrelation may arise for a number of ecological reasons, including external environmental and historical factors limiting the mobility of organisms, intrinsic organism-specific dispersal mechanisms and other behavioural factors causing the spatial aggregation of populations and species in the landscapes. In addition to these factors, spatial autocorrelation can also be caused by observer bias and differences in sampling schemes and sampling effort. Overall, spatial autocorrelation occurs at all spatial scales from the micrometre to hundreds of kilometres (Dormann 2007b), possibly for a whole suite of reasons. Since these reasons are mostly unknown, one cannot readily derive a spatial correlation structure for an entirely new, unobserved region. Augustin et al. (1996) and others (Hoeting et al. 2000, Teterukovskiy and Edenius 2003, Reich et al. 2004) have, however, successfully used the Gibbs sampler (Casella and George 1992) to derive predictions for unobserved areas within the study region (interpolation), and He et al. (2003) extrapolated autologistic predictions through time to examine possible effects of climate change.
Interpolation, i.e. the prediction of values within the parameter and spatial range, can be achieved by several of the presented methods. An advantage of GLS is that the spatially correlated error can be predicted for sites where no observations are available, based on the values of observed sites (e.g. kriging). The same holds true for the spatial GLMM. For autocovariate regression and spatial eigenvector mapping, in contrast, interpolation is more complicated, requiring use of the aforementioned Gibbs-sampler.
When models are projected into new geographic areas or time periods the handling of spatial autocorrelation becomes more problematic (if not impossible). Extrapolation in time, for example, is necessarily uncertain, particularly if biotic interactions – and with them spatial autocorrelation patterns – could change as each species responds differentially to climate change. However; most of the statistical methods used for prediction in time neglect important processes such as migration, dispersal, competition, predation (Pearson and Dawson 2003, Dormann 2007c), or at least assume many of them to remain constant. One might therefore argue that, while taking the autocorrelation structure as constant adds one more assumption, the use of spatial parameters at least helps to derive better models. Extrapolation in space, in contrast, is not recommended: the variance-covariance matrix parameterised in GLS approaches, for example, may look very different in other regions, even for the same organism. Hence, extrapolation can only be based on the coefficient estimates, not on the spatial component of the model. Extrapolation is further complicated by model complexity. The use of non-linear predictors and interactions between environmental variables will increase model fit, but compromises transferability of models in time and space (Beerling et al. 1995, Sykes 2001, Gavin and Hu 2006). Our study therefore did not compare methods’ abilities to either make predictions to new geographic areas or extrapolate beyond the range of environmental parameters.
Our review focused on frequentist methods. Bayesian methods, which allow prior beliefs about data to be incorporated in the calculation of expected values, offer an alternative. Experience and a good understanding of the influence of prior distributions and convergence assessment of Markov chains are crucial in Bayesian analyses. Thus, if therefore the question of interest can be addressed using more robust, less computationally intensive methods, there is no real need to apply the “Bayesian machinery” (Brooks 2003). The spatial analyses as presented in this paper can be done straightforwardly using non-Bayesian methods. However, Bayesian methods for the analyses of species distribution data are more flexible; they can be more easily extended to include more complex structures (Latimer et al. 2006). Models can for example be extended to a multivariate setting when several (correlated) counts of different species in each grid cell are to be modelled, or when both count and normally distributed data are to be modelled within the same framework (Thogmartin et al. 2004, Kühn et al. 2006). Bayesian methods are also a generally more suitable tool for inference in data sets with many missing values, or when accounting for detection probabilities (Gelfand et al. 2005, Kühn et al. 2006).
In this study, we introduced a wide range of statistical tools to deal with spatial autocorrelation in species distribution data. Unfortunately, none of these tools directly represents dynamic aspects of ecological reality (e.g. dispersal, species interaction): all the methods examined remain phenomenological rather than mechanistic. Therefore they are unable to disentangle stochastic and process-introduced spatial autocorrelation. Disentangling these sources of spatial autocorrelation in the data would be particularly important for the analysis of species that are not at equilibrium with their environmental drivers (e.g. newly introduced species expanding in range or species that have undergone population declines due to overexploitation). Moreover, it would be desirable to extend the statistical approaches used here to model multivariate response variables, such as species composition (see Kühn et al. 2006, for an example). Similarly, presence-only data, as commonly found for museum specimens, cannot be analysed with the above methods, nor are we aware of any method suitable for such data. While in principle it is possible to incorporate temporal and/or phylogenetic components into species distribution models (e.g. into GEEs, GLMMs and Bayes), this has not yet been attempted. It also would be desirable to have methods available that allow for the strengths of spatial autocorrelation to vary in space (non-stationarity), since stationarity is a basic and strong assumption of all the methods used here (except perhaps SEVM). Finally, the issue of variable selection under spatial autocorrelation has received virtually no coverage in the statistical literature, and hence the effect of spatial autocorrelation on the identification of the best-fitting model, or candidate set of most likely models, still remains unclear.