Correspondence site: http://www.respond2articles.com/MEE/

# Fine-scale environmental variation in species distribution modelling: regression dilution, latent variables and neighbourly advice

Article first published online: 25 JAN 2011

DOI: 10.1111/j.2041-210X.2010.00077.x

© 2011 The Authors. Methods in Ecology and Evolution © 2011 British Ecological Society

Additional Information

#### How to Cite

McInerny, G. J. and Purves, D. W. (2011), Fine-scale environmental variation in species distribution modelling: regression dilution, latent variables and neighbourly advice. Methods in Ecology and Evolution, 2: 248–257. doi: 10.1111/j.2041-210X.2010.00077.x

#### Publication History

- Issue published online: 2 JUN 2011
- Article first published online: 25 JAN 2011
- Received 14 July 2010; accepted 18 October 2010 Handling Editor: David Orme

- Abstract
- Article
- References
- Cited By

### Keywords:

- Bayesian;
- Bioclimate;
- climate envelope;
- niche modelling;
- observation error;
- potential distribution

### Summary

**1.** Developing the next-generation of species distribution modelling (SDM) requires solutions to a number of widely recognised problems. Here, we address the problem of uncertainty in predictor variables arising from fine-scale environmental variation.

**2**. We explain how this uncertainty may cause scale-dependent ‘regression dilution’, elsewhere a well-understood statistical issue, and explain its consequences for SDM. We then demonstrate a simple, general correction for regression dilution based on Bayesian methods using latent variables. With this correction in place, unbiased estimates of species occupancy vs. the true environment can be retrieved from data on occupancy vs. measured environment, where measured environment is correlated with the true environment, but subject to substantial measurement error.

**3.** We then show how applying our correction to multiple co-occurring species simultaneously increases the accuracy of parameter estimates for each species, as well as estimates for the true environment at each survey plot – a phenomenon we call ‘neighbourly advice’. With a sufficient number of species, the estimates of the true environment at each plot can become extremely accurate.

**4.** Our correction for regression dilution could be integrated with models addressing other issues in SDM, e.g. biotic interactions and/or spatial dynamics. We suggest that Bayesian analysis, as employed here to address uncertainty in predictor variables, might offer a flexible toolbox for developing such next-generation species distribution models.

### Introduction

Understanding the determinants of species distributions has been a primary interest of ecology since its inception (Darwin 1859; Rapoport 1975). Strong correlations between distributions and physical factors such as climate have long been documented (Andrewartha & Birch 1958; Gaston 2003) and have received renewed recent interest in questioning how species distributions might respond to climate change (e.g. Thomas *et al.* 2004). This has led to the applied activity of species distribution modelling (SDM), which aims to build explicit quantitative models of the relationship between species distributions and the environment to predict potential future distributions.

However, the expected relationship between species distributions and climate may not always be retrieved by current SDM (Beale, Lennon, & Gimona 2008; Chapman 2010) and the ability of current SDM methods to make plausible predictions is frequently regarded as limited (see: Austin 2002; Gaston 2003; Hampe 2004; Guisan & Thuiller 2005; Araújo & Rahbek 2006; Araújo & Guisan 2006; Heikkinen *et al.* 2006; Austin 2007; Soberón 2007; Elith & Leathwick 2009). For example, current methods do not account for the differences between observed species distributions (realised niches) and potential distributions (fundamental niches) (Soberón 2007; Soberón & Nkamura 2009) that are generated by *process errors* such as populations dynamics, spatial dynamics and biotic interactions (Guisan & Thuiller 2005; Elith & Leathwick 2009). In addition, there are differences between the reality we wish to model and available data. These *observation errors* may be derived from sampling biases in data collection (e.g. Cabral & Schurr 2010) and biases introduced by imperfect measurement (Huston 2002; Clark 2005; Araújo & Rahbek 2006; Austin 2007). As there are multiple sources of both observation and process error, selecting and developing a solution to any source of error should not preclude other solutions to other errors.

In this paper, we address a problem inherent to all SDM analyses, namely uncertainty in predictor variables attributed to fine-scale environmental variation. Consider a typical application of SDM where the probability of occurrence for a species is modelled as a function of the predictor variables associated with the survey sites. The predictor variables (such as climate, soil and ‘habitat’ information) are not measured continuously through time and space, but instead are taken from interpolations of weather station data or from gridded climate re-analysis products. Thus, the *true* value of the predictor variable – the value actually affecting the species biology – has a noisy relationship with the *measured* (or apparent) value of that predictor variable used in the SDM analysis. Uncertainty in predictor variables can be introduced by both instrument error and spatial scaling (Huston 2002).

Within the coarse scale of cartographic grid cells, there are multiple possible local environments: for example, cooler, northern slopes within grid cells that are warm on average (e.g. Grime 1997) or patches of deep soil in regions of mostly shallow soil. Without accounting for this fine-scale variation, the breadth of tolerance to a predictor may be poorly estimated (Palmer & Dixon 1990). This is because a species with a true requirement for cool temperatures will appear to be tolerant of warm temperatures where it occurs on cool slopes within a grid cell with a warm average temperature. The situation is of course more complex than species simply existing in cooler sites in warm grid cells (and vice versa) because other climate-independent factors also vary at fine scales (e.g. soils, habitat loss, fire frequency; see Thomas *et al.* 1999).

These and other consequences of ignoring fine-scale variation are related to a well-explored statistical issue called ‘*regression dilution*’ (or ‘*attenuation bias’*) (e.g. Frost & Thompson 2000; Bartlett, De Stavola, & Frost 2009). Without correcting for errors in a predictor variable, regression analysis assigns errors in the estimation of that predictor variable to uncertainty in the response variable given the predictor variable. This misappropriation of variation tends to squash the apparent functional responses compared to their true values (Palmer & Dixon 1990; Frost & Thompson 2000). For example, in linear regression, errors in predictor variables decrease the estimated slope and increase the estimated intercept. Within SDM, this kind of regression dilution would flatten the estimated species’ functional responses to environmental variables compared to the true functional response.

Crucially, regression dilution can have scale-dependent effects for SDMs. We would expect the errors in predictor variables to differ between studies carried out at different scales [e.g. a fine-scale model with locally measured environmental variables vs. coarse-grained models with gridded climate data (Trivedi *et al.* 2008)]. This implies that the strength of regression dilution will depend on the scale of an SDM analysis. For this reason alone, models estimated at one scale cannot be expected to apply at other scales (e.g. Frost & Thompson 2000).

In the following paper, we first illustrate the problem of regression dilution by carrying out ‘virtual’ SDM on artificially generated data. This enables us to compare the estimated functional responses to known true responses. We then illustrate a practical and general solution to this problem – inspired by the study of growth-light relationships in saplings carried out by Lichstein *et al.* (2010)– relying on a Bayesian approach to SDM using latent variables (also see Clark 2005). Other methods are available for correcting regression dilution in linear regression and when estimating nonlinear response functions (e.g. Phillips & Davey Smith 1991; Frost & Thompson 2000; Bartlett, De Stavola, & Frost 2009). However, many of these rely on estimating a correction factor from repeated measures or supplementary data and are frequently restricted to simple errors in the predictor variable (Frost & Thompson 2000).

We explore the latent variable approach because it is simple to describe and implement (despite being viewed as an advanced topic); might be integrated with other solutions to process and observation errors within a Bayesian approach (see Discussion); is an adaptable method applicable to simple and complex forms of both response functions and error structures in predictor variables; and explicitly accounts for parameter uncertainty that can be subsequently incorporated readily into model predictions. As we demonstrate, the method can be further enhanced for SDM analyses by fitting multiple co-occurring species simultaneously rather than one by one (see Discussion). Using this ‘neighbourly advice’, the SDM approaches the ideal situation where predictor variables are measured perfectly and continuously through space. In discussion, we explore how switching to a Bayesian approach could offer solutions to problems of both process and observation error in SDM, which could be integrated into a unified next-generation set of SDM methods.

### Methods and results

#### Virtual SDM analysis

From the outset, it is important to emphasise that both the regression dilution problem and our solution are not exclusive to the simple assumptions we use to illustrate the problem and to demonstrate a solution. Non-symmetric, bimodal or arbitrarily complex functional responses to predictor variables, or indeed multiple variables, will all face individual versions of this problem and can all use a form of the following solution.

For illustration, we assume a simple Gaussian relationship between the environment and a species, *j.* G_{jq}, the probability of occurrence for species *j* at location *q*, is a function of the *true* environment *E*_{q} at location *q*. For simplicity, we refer to *E* as temperature hereafter and use an arbitrary temperature scale of 1–100. G_{jq} depends on the (i) optimum temperature of the species *φ*_{j}, (ii) maximum probability of occurrence at the optimum *K*_{j} and (iii) the rate at which probability decays away from the optimum σ_{j}.

- (eqn 1)

Note that variations on this simple model can be used to investigate multiple environmental variables, combinations of different functional forms for a species’ responses, and alternative measures of abundance including count data and continuous measures.

We generate data for a single species by sampling from an underlying true species distribution (eqn 1; ;]>, where θ_{j} denotes the set of model parameter values for species *j*). We assume a set of survey sites *q* where each site has a *true* temperature *E*_{q} derived from a normal distribution with a mean equal to the *measured* temperature and a standard deviation at each site equal to . For clarity, we make σ_{E} shared across all sites, but the methods we present easily extend to cases where σ_{E} varies across sites even in complicated ways. We generated a set of grid cells with values spread evenly at 0·25 increments between 1 and 100. A random true temperature (*E*_{q}) given is drawn for a single site within each grid cell. Species occupancy at site *q* was determined by eqn 1 given the true environmental condition (*E*_{q}) (in contrast an application of SDM will need to work with only), and then a random occurrence at each site, *Z*_{jq}, was made given *G*_{jq} (*Z*_{jq} = 0 implying absence, *Z*_{jq} = 1 implying presence).

A simple visual inspection of this data set makes regression dilution clear (Fig. 1). The response of the species occupancy vs. measured temperature is flattened in comparison with that against true temperature. When viewing occupancy vs. the measured temperature only, unoccupied sites occur more frequently at temperatures away from the species’ true temperature optimum and less frequently at temperatures towards the species’ true optimum. In effect, errors in the estimates of *E*_{q} are incorrectly assigned to the species’ environmental response (Fig. 1b). Of course, having data on the true temperature would mean that any difference between the actual environmental response and our modelled response would be because of the sampling error only. For instance, in Fig. 1c, the observed occupancies follow the true pattern of the probability of occupancy, subject to some sampling error. If σ_{E} was to increase, each apparent species response would be further diluted away from the true response, as sites that have similar values of can have more widely varying values of *E*_{q.}

Next, we place this illustration on a formal footing by employing a virtual SDM analysis using the presence–absence data (*Z*_{jq}). Our analysis is based on Bayesian methodology allowing for flexibility of the solution for regression dilution and potential integration with solutions to other errors. To carry out the Bayesian analysis, we need to define the log likelihood of the parameters given the data (}) with respect to the model described by eqn (1).

#### Uncorrected individual species

We begin by making the traditional, but incorrect assumption, that the measured temperature is equal to the true temperature ( = *E*_{q}) at each location. Within the Bayesian model, the probability of occurrence is calculated for the parameter estimates () using eqn 1. Assuming independence of the observations, the likelihood is then,

- (eqn 2a)

- (eqn 2b)

Equation 2a is the sum, over all species *j* and plots *q*, of the logarithm of the probability of observed occurrence (present *Z*_{jq} = 1, or absent *Z*_{jq} = 0) given the temperature at *q*. In this case, the true temperature is assumed to be the measured temperature (*E*_{q} = ). Equation 2b shows that *P*_{jq} = *G*_{jq} when a presence is recorded at *q* (*Z*_{jq} = 1) and *P*_{jq} = 1- *G*_{jq} when the species is absent (*Z*_{jq} = 0).

Having defined the likelihood, we can use standard methods to estimate the posterior distribution for each parameter given the data. For a Bayesian analysis, we also need to specify priors on parameters. In this case, we used extremely weakly informative, uniform priors with ranges *K*_{j}, 0·000001–1; *φ*_{j}, −100 to 200; *σ*_{j}, 1–200. From the posterior, we can then extract, for each parameter, our best estimate of the true value (the posterior mean) as well as credibility intervals. Moreover, it is a simple matter to generate an ensemble of model predictions that incorporates the uncertainty in the parameters (e.g. Fig. 2a). To carry out the analysis, we used a well-tested Metropolis–Hastings (M–H) Markov chain Monte Carlo (MCMC) (e.g. Chib & Greenberg 1995; Robert & Casella 1999; see Supporting Information) numerical algorithm coded in *C*. There are a variety of ways to achieve this kind of analysis, and it could be coded in other widely available programs such as R or Winbugs. When carried out in *C* on a standard PC, the computational demand of this study posed no issues and a very generous period of 200 000 iterations were given to the initial ‘burn in’ and to achieve convergence to the posterior distribution, after which we recorded every 100th parameter set from a subsequent 200 000 iterations.

The application of this SDM to the pseudo-data reveals the regression dilution effect (Fig. 2a). As expected, the estimated environmental response function has a lower maximum () and a wider tolerance () than the true function. The SDM analysis has generated a function that fits the data well in the sense that the relationship between occurrence and *measured* temperature is recovered. However, the parameter estimates are incorrect, and they do not recover the relationship between occurrence and *true* temperature. As such, the estimated model is a descriptive model that would not be expected to produce reliable predictions for altered temperatures, or when applied at different scales.

#### Corrected, individual species

Secondly, we use a latent variable approach to correct for the problem of fine-scale environmental variation. We assume knowledge of the degree of uncertainty (*σ*_{E}) in the predictor variables (*E*_{q}), i.e. we know the *probability distribution* for true temperature given the measured temperature at each site (see discussion). To incorporate the uncertainty in the predictor variables into the Bayesian analysis, we need to integrate over both the uncertainty in the predictor variables and the probability of occurrence (*G*_{jq}) given the predictor variables (). In principle, we could do this mathematically by deriving a closed-form solution for the integral, or by integrating numerically for each datum every time the likelihood is evaluated. However, both approaches require careful implementation, are slow and tend to be specific applications to specific problems. However, we can also use latent variables as a general, computationally easy and efficient way to reach the same end. In practise, latent variables are simply extra parameters that are treated equally with all other parameters by the sampling algorithm. In the same way that sampling algorithms such as Metropolis–Hastings MCMC explore different values of the model parameters, the algorithms explore values of the latent variables. In this way, posterior distributions of the true models parameters come to reflect the posterior distributions of the latent variables.

We can use latent variables to address uncertainty in predictor variables by defining for every site *q* a latent parameter called , representing an estimate of the true value of the environment at site *q*. We then define the log likelihood in two parts:

- (eqn 3a)

- (eqn 3b)

The left hand part of eqn 2a is as before with one exception. It depends on, for each site *q*, the probability of occurrence for species *j* at site *q,* but given that the true temperature is . The right-hand term is new and represents the probability distribution of the values themselves. In this example, we know the true temperature is from a normal distribution with the mean and standard deviation specified by measurement (*P*(*E*_{q} = |, *σ*_{E})) (again note that this is a simplified example for illustration purposes and complex forms of environmental variation can be accounted for). Equation 3a accounts for the altered assumption (see eqn 2a) that the latent variables represent the true temperatures (*E*_{q} = ).

In this way, knowledge of error in the predictor variables allows us to state the likelihood function and so solve for the parameters given the occurrence data. In practise, the MCMC algorithm will explore a variety of values for each latent variable with the distribution for each reflecting a compromise between the prior (, *σ*_{E}) on the one hand, and the occurrence data on the other (*Z*_{jq}). As latent variables are technically the same as any other parameter, we can examine the posterior means and credibility intervals on each value of – i.e. on the estimated true temperature of each site.

As shown in Fig. 2b, this approach recovers a close approximation to the true functional relationship between occurrence and true temperature. The response function is no longer ‘squashed’, and all of the true functional response is within the 95% credibility intervals of the model output. These intervals are quite large simply because of the low number of samples we used; in this case, 43 presences (*Z*_{jq} = 1). Parameter estimates for the latent variable approach are much more accurate than the standard regression approach and the 95% credibility intervals on parameter estimates contain the true values (Fig 2c–e). As a result of the small number of presences in our pseudo-data, the latent variables themselves are poorly constrained, in the sense that the credible intervals on each are large. This should be expected in this demonstration because there are as many latent variables as there are data. Nonetheless, the MCMC algorithm explores a range of values for each , preventing regression dilution.

#### Corrected, multiple species

Now, we demonstrate a novel method to better constrain these latent variable estimates by including multiple co-occurring species within a single parameterisation. That is the parameter set will include parameters for each species for eqn 1 (for n species, ), but only a single set of latent variables for each site, *q*. This only makes a small change in an operational algorithm and the likelihood function.

- (eqn 4)

Note there is no change in the prior for each parameter as we assume that species have been surveyed from the same network of plots. We have not changed the assumption that *E*_{q} = and so still use eqn 3a.

To illustrate this method, we generated data for nine further species to accompany the single species investigated from the previous method (see Figs 1 and 2) making a set of 10 species. Each of the further nine species were created with parameters randomly sampled from uniform distributions with ranges *K*_{n}, 0·4–0·8; *φ*_{n}, 15–85; *σ*_{n}, 5–20. Note that we assume no direct interactions between species (i.e. interspecific interactions) and also assume that species experience the same true temperature at each site. By using the same data set for the first species across the four methods, we can directly compare the predicted parameter estimates () against the actual parameters (*θ*) and assess any trend to overestimate or underestimate responses using each methodology. This comparison is shown in Fig. 3– (i) SDM uncorrected for fine-scale variation ()(1st column; eqs 2a-b); (ii) corrected SDM, species responses are modelled individually ()(2nd column; eqs 3a-b); (iii) corrected SDM, species responses are estimated together ()(3rd column; eqs 4&3b); and (iv) SDM where we use the true temperature (*E*_{q}) at each site (4th column, eqs 2a-b).

As expected the uncorrected method overestimates tolerances (Fig. 3a) and underestimates maximum responses (Fig. 3b) for all species, but can generate a reasonable estimate of species’ optima (Fig. 3c). In contrast, the individual latent approach reduces these biases (Fig. 2, 2nd column, Fig. 3) and the uncertainty is greatly reduced (compare *y*-axis scales in Fig. 3a,c). Most importantly, although some species parameters have intervals that do not contain the true values, there is no general tendency to over- or underestimate responses. Thus, on average, the corrected species-by-species approach is unbiased even though it can be biased for particular species.

The multispecies latent method produced parameter estimates that far outperform either of these methodologies (3rd column of Fig. 3). All credibility intervals include the actual parameter values (*θ*_{j}) and the parameter estimates are more tightly constrained. Of course, the ideal situation is to know the true temperature, *E*_{q} where no regression dilution could occur (4th column Fig. 3). The multispecies latent method approaches this ideal scenario correcting, almost perfectly, for the uncertainty in the true temperature at each site.

Moreover, the latent variables themselves now become better constrained homing in on the true temperature at each site only using data on the measured temperature. This is crudely shown by the correlation coefficients between *E*_{q} and for the uncorrected method 0·925; corrected, with the individual latent method 0·927–0·939; and corrected, multispecies method 0·972. This reduction in bias is shown in Fig. 4. For the corrected individual latent, the reduction in bias is only noticeable for occupied sites (Fig. 4b). Unoccupied sites remain unchanged, because the posterior tends to hone in on the prior where the value is far away from the optimum for that species, such that the predicted probability of occupancy is very close to zero for any realistic value of . Thus, away from the species’ optimum, the likelihood is very insensitive to the choice of and the posterior for becomes dominated by the prior. In contrast, the likelihood tends to be sensitive to all values of when using the multispecies method as all plots are close to the optimum for at least one species. This helps to explain how the multispecies method allows data from each species to inform both the estimated functional responses of all other species (Fig. 3 column 3) and the true temperature at each site (Fig. 4c).

### Discussion

We have demonstrated a general solution to include fine-scale variation in SDM. The solution (i) explicitly accounts for uncertainty in environmental measurements; (ii) is scale independent; (iii) is robust to heterogeneity in environmental variation attributed to both environmentally dependent and independent processes; and (iv) can potentially be integrated into a framework including solutions to other issues of SDMs. In addition, we provide a novel suggestion for model parameterisation where ‘*neighbourly advice*’ from co-occurring species strengthens constraints on all parameters through indirect feedback between latent variables and species’ parameters.

#### Regression dilution in SDM and the consequences

Problems associated with errors in predictor variables in SDM have received some attention in the literature previously (e.g. Ashcroft, Chisholm, & French 2009) but, to our knowledge, have not been explicitly linked to the widely understood problems of regression dilution using latent variables until now. SDM is essentially regression, albeit potentially involving rather unusual, biologically inspired functional forms of species’ environmental responses and multiple predictors.

The impact of regression dilution is likely to increase substantially when study systems move outside of the conditions where the initial measurements were made (e.g. Thuiller *et al.* 2004), i.e. when SDMs are required to be predictive. In a changing environment, an incorrectly parameterised model could overestimate or underestimate the range limits in relation to each environmental variable. For purely descriptive models of current distributions, this may be no problem as fine-scale variation is ‘wrapped up’ with species’ functional responses to reproduce current distributions at the scale that models were applied. But, under an altered environment or altered scale, it is important to know the true functional response, especially if the rate of environmental change varies across space (Loarie *et al.* 2009). As we have shown here, it is necessary to correct for regression dilution to extract these true responses.

Regression dilution will then tend to produce modelled species distributions that are more fragmented than real distributions at fine scales (also see Montoya *et al.* 2009). For example, with a single environmental variable, erroneously squashed functional responses (Palmer & Dixon 1990; Frost & Thompson 2000) will tend to produce over-dispersed predictions of species distributions (e.g. Fig. 2). In the presence of local dispersal and local ecological interactions, population and community dynamics can be qualitatively affected by such differences in spatial structure (e.g. Tilman & Kareiva 1997; Dieckmann, Law, & Metz 2000). Therefore, mechanistic SDM methods that assume metapopulation like processes could make various erroneous predictions (e.g. the degree of range deformation, time lags in responses, spatial turnover, extinction thresholds) if they do not correct for regression dilution. And the parameters of those more complex, process models would be affected in unexplored ways.

Regression dilution is one candidate for explaining scale effects in SDM (Palmer & Dixon 1990; Trivedi *et al.* 2008). For instance, Trivedi *et al.* (2008) found 6–7 more species (out of the 10 studied) would go extinct (lose all suitable modelled habitat) in predictions of a fine-grained model compared to those of a coarse-grained model. Superficially, this appears like regression dilution as ‘squashed’ response functions would be expected at the larger scale, producing broader environmental tolerance and so fewer predicted extinctions. However, contrasting results were subsequently found in a similar study by Randin *et al.* (2009), where fewer species faced extinction in the fine-scale model. In fact, 100% of the species losing all suitable habitats under their coarse-grained model actually persisted in their fine-scale model (Randin *et al.* 2009). These two studies find differential effects of large magnitude when comparing outputs from models based on fine-grained (site) and coarse-grained (grid cell) data. Whilst one bears the hallmarks of regression dilution, with the addition of the second study, the explanation does not appear so straight forward. Perhaps, ominously for SDM, which is essentially multiple regression, some discussion in the statistical literature suggests these differential effects might be explained because regression dilution might act to either squash *or sharpen* the estimated responses of different variables in multivariate analyses (Bohrnstedt & Carter 1971; Phillips & Davey Smith 1991). The effect of regression dilution in individual studies will be determined by the relationships between the random errors of different predictor variables and the methodology used for correction (Bohrnstedt & Carter 1971; Phillips & Davey Smith 1991; Frost & Thompson 2000), the extent of measures and inferred coverage of each variable (Thuiller *et al.* 2004; Barbet-Massin, Thuiller, & Jiguet in press) and the functional forms of ‘true’ and assumed environmental responses used.

Species distribution modelling analyses inevitably cover large extents for which weather stations may sample the environment unevenly. As is frequently the case for the highest altitudes, large extents of environmental variation may be entirely inferred because of the lack of weather station data (e.g. Barbet-Massin, Thuiller, and Jiguet in press). We note that the latent variable approach is flexible enough to cope with even this complex situation. However, these issues of data are in addition to the other sources of observation and process error, which are still wrapped up in these two analyses (Trivedi *et al.* 2008; Randin *et al.* 2009) and those errors, such as biotic interactions, may also have scale-dependent effects (Heikkinen *et al.* 2007). This being the general case in SDM, care is needed when drawing general conclusions about scale effects for individual sources of error when many sources are acting in concert. These considerations point to the need for an integrated tool box to disentangle the diverse sources of error within current SDM analyses (see below).

#### Correcting using latent variables

We consider our method for correcting regression dilution in SDM to be straight forward. Providing we place SDM within a Bayesian framework (see below), inclusion of latent variables representing the ‘true’ as opposed to ‘measured’ environment is simple – and the values of these latent variables can be sampled using standard computational methods. Furthermore, the latent variable approach is flexible, being applicable to a variety of functional forms for species’ environmental responses. For example, in eqn 3a, the functional response is separate from the description of fine-scale environmental variation. Thus, the functional response can be altered within the same overall approach. Importantly, the functional response could be a process-based model that, for example, needs simulation to equilibrium to predict the probability of species’ occurrence (e.g. Purves *et al.* 2007). In this way, our method for correcting for fine-scale environmental variation could be used in conjunction with various ‘next-generation’ process-based SDM methods. Moreover, there is nothing in the method that restricts latent variables to a single variable per site, and so functional responses could be extended to several ‘true’ environmental variables. In which case, eqn 3a would be extended to include latent variable terms for each extra predictor variable. Latent variable approaches and related methodologies have been employed elsewhere, e.g. where light varies below forest canopies (Lichstein *et al.* 2010), and the methods have broader applications in ecology (e.g. Clark *et al.* 2003; Soubeyrand, Neuvonen, & Penttinen 2008).

To implement our method as presented here, it is necessary to have *a priori* estimates of the degree of error in the predictor variables. For our example, we specified *σ*_{E} beforehand. For a real application of this method, where would we get this information from? First, it is important to recognise that uncertainty in predictor variables need not be known perfectly. Current SDM analysis are one extreme of a spectrum where no uncertainty in predictors is assumed (i.e. *σ*_{E} = 0). At the other extreme would be analyses assuming complete uncertainty in predictors. Placing real SDM analyses at a more realistic point along this spectrum may lead to better predictions in general. Second, it should be possible, in principle, to estimate the degree of error in predictor variables given available supplementary information. For example, comparing weather station data (taken at one point in space) with climate variables reported in gridded re-analysis products would provide a simple direct measure of the degree of within-cell variation. Also, relatively fine-scale elevation data are now available for many regions. As elevation is known to be a major control of climate and soils at local scales, it seems sensible to assume that regions with greater variation in elevation have greater variation in climate and soils (e.g. Daly, Neilson, & Phillips 1994). With knowledge of lag-rates, it might even be possible to set values of *σ*_{E} entirely from elevation data. Third, we note that in principle, with a more complex Bayesian analysis, it might be possible to make inferences about the magnitude of fine-scale variation in some environmental variables purely from the species occurrence or abundance data.

#### Neighbourly advice

We found that including multiple species in an SDM analysis with a single set of latent variables dramatically improves the constraints on both latent variables and species parameters. The effect on the latent variable estimates is explained at a simple level by the increased coverage that multiple species might have through space. For analyses using latent variables for individual species (Fig. 3, eqn 4), the latent variable method always increased our knowledge about the true temperature (Fig. 4) – but only slightly. This inaccuracy occurs because, for each site, we have only a single absence or presence (*Z*_{q}) to supplement the information we already have on true temperature ()*.* Where species parameters () predict an absence and an absence is also observed in the data, latent variables will tend to the most likely environmental value, in this case the mean (). In contrast, for multiple species in one analysis, we have several absences and presences for each site (*Z*_{jq}), which increases the coverage and improves the estimates of , i.e. the posterior distribution of each latent variable value. But these improved estimates of true temperature then naturally lead to improved estimates of all species responses given the true temperature, which in feedback, increase the accuracy of the estimates of the true temperature. Each species in effect informs each other species about the true environment when multiple species have been surveyed. This is why we refer to this method as ‘neighbourly advice’. In the limit of an infinite number of species, neighbourly advice would mean that the true temperature would become known perfectly for every site at which point the species estimated responses would become exactly those estimated if the true temperature was known in the first place. Surprisingly, Fig. 4 shows that even a set of 10 species takes our analysis most of the way towards this ideal scenario. Finally, it is important to recognise that this issue is separate to the major unsolved problem of how to include direct species interactions in SDM (Elith & Leathwick 2009), but could be integrated with a solution to species interactions.

#### An integrated approach to solving SDM challenges

Despite making progress in understanding fine-scale variation and accounting for the resultant biases, the solution we introduced here addresses only one of several major problems in SDM analysis (e.g. Austin 2002, 2007; Gaston 2003; Guisan & Thuiller 2005; Araújo & Guisan 2006; Araújo & Rahbek 2006; Heikkinen *et al.* 2006; Soberón 2007; Elith & Leathwick 2009). As next-generation SDM analyses will need to include corrections for all of these problems of process and observation errors, it is important that a solution to any one problem can be combined with solutions to the others. Placing SDM within a formal Bayesian context could, in our opinion, help to produce these kinds of integrated solutions for SDM. In this case, Bayesian methods have a simple solution to regression dilution and similar methods have already been used within SDM – e.g. to fit process-based metapopulation models to current distributions (Purves *et al.* 2007). Within SDM and elsewhere, the Bayesian framework offers a transparent, flexible and general approach that allows analyses to be tailored to the questions and data at hand. There are a host of standard methods for efficient parameter estimation (e.g. MCMC sampling), for incorporating previous biological knowledge (e.g. priors and/or combining several data sets), comparing alternative models, for propagating parameter uncertainty through model predictions and for averaging model predictions over alternative models. At present, Bayesian methods appear to be an extremely useful toolbox for SDM and developing the next-generation of models.

### Acknowledgements

We thank Ben Calderhead for discussions on the topic of hidden/latent variables. We also thank Mathew Smith and Mark Vanderwel, Wilfred Thuiller, the Editor and an anonymous referee for detailed comments that improved different parts of the manuscript.

### References

- 1958) The Distribution and Abundance of Animals. University Press, Chicago. & (
- 2006) Five (or so) challenges for species distribution modelling. Journal of Biogeography, 33, 1677–1688. & (
- 2006) How does climate change affect biodiversity? Science, 313(5792), 1396–1397. & (
- 2009) Climate change at the landscape scale: predicting fine-grained spatial heterogeneity in warming and potential refugia for vegetation. Global Change Biology, 15, 656–667. , & (
- 2002) Spatial prediction of species distribution: an interface between ecological theory and statistical modelling. Ecological Modelling, 157, 101–118. (
- 2007) Species distribution models and ecological theory: a critical assessment and some possible new approaches. Ecological Modelling, 200, 1–19. (
- How much do we overestimate future local extinction rates when restricting the range of occurrence data in climate suitability models? Ecography, 33, 878–886. , & (In Press)
- 2009) Linear mixed models for replication data to efficiently allow for covariate measurement error. Statistics in Medicine, 28, 3158–3178. Direct Link: , & (
- 2008) Opening the climate envelope reveals no macroscale associations with climate in European birds. PNAS, 105(39), 14908–14914. , & (
- 1971) Robustness in regression analysis. Sociological Methodology, 3, 118–146. & (
- 2010) Estimating demographic models for the range dynamics of plant species. Global ecology and Biogeography, 19, 85–97. & (
- Weak climatic associations among British plant distributions. Global Ecology and Biogeography, 19, 831–841. (2010).
- 1995) Understanding the Metropolis–Hasting algorithm. American Statistician, 49, 327–335. & (
- 2005) Why environmental scientists are becoming Bayesians. Ecology Letters, 8, 2–14. (
- 2003) Coexistence: how to identify trophic cascades. Ecology, 84(1), 17–31. , , & (
- 1994) A statistical-topographic model for mapping climatological precipitation over mountainous terrain. Journal of Applied Meteorology, 33, 140–158. , & (
- 1859) On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life, 1st edn. John Murray, London. (
- 2000) The Geometry of Ecological Interactions – Simplifying Spatial Complexity. Cambridge studies in adaptive dynamics. University Press, Cambridge. , & (
- 2009) Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution and Systematics, 40, 677–697. & (
- 2000) Correcting for regression dilution bias: comparison of methods for a single predictor variable. Journal of the Royal Statistical Society. Series A, 163(2), 173–189. & (
- 2003) The Structure and Dynamics of Geographic Margins. Oxford University Press, Oxford. (
- 1997) Climate change and vegetation. Plant Ecology (ed M.J.Crawley), pp. 582–594. Blackwell Publishing, Oxford. (
- 2005) Predicting species distribution: offering more than simple habitat models. Ecology Letters, 9, 993–1009. & (
- 2004) Bioclimate envelope models, what they detect and what they hide. Global Ecology and Biogeography, 13, 469–471. (
- 2006) Methods and uncertainties in bioclimatic envelope modelling under climate change. Progress in Physical Geography, 30, 751–777. , , , , & (
- 2007) Biotic interactions improve prediction of boreal bird distributions at macro-scales. Global Ecology and Biogeography, 16, 754–763. , , , & (
- 2002) Critical issues for improving predictions. Predicting Species Occurrences: Issues of Accuracy and Scale (eds J.M.Scott, P.J.Heglund, M.L.Morrison, M.G.Raphael, W.A.Wall & F.B.Samson), pp. 7–24. Island Press, Covelo. (
- 2010) Unlocking the forest inventory data: relating individual-tree performance to unmeasured environmental factors. Ecological Applications, 20, 684–699. , , , , , & (
- 2009) The velocity of climate change. Nature, 462, 1052–1055. , , , , & (
- 2009) Do species distribution models explain spatial structure within tree species ranges? Global ecology and Biogeography, 18(6), 662–673. , , & (
- 1990) Small-scale environmental heterogeneity and the analysis of species distributions along gradients. Journal of Vegetation Science, 1, 57–65. & (
- 1991) How independent are ‘independent effects’?: relative risk estimation when correlated exposures are measured imprecisely. Journal of Clinical Epidemiology, 44, 1223–1231. & (
- 2007) Environmental heterogeneity, bird-mediated directed dispersal, and oak woodland dynamics in Mediterranean Spain. Ecological Monographs, 77(1), 77–97. , , , & (
- 2009) Climate change and plant distribution: local models predict high-elevation persistence. Global Change Biology, 15, 1557–1569. , , , , , , , & (
- 1975) Aerography: Geographical Strategies of Species. Pergamon Press, Oxford. (
- 1999) Monte Carlo Statistical Methods. Springer-Verlag, New York, New York, USA. & (
- 2007) Grinnellian and Eltonian niches and geographic distributions of species. Ecology Letters, 10(12), 1115–1123. (
- 2009) Niches and distributional areas: concepts, methods, and assumptions. Proceedings of the National Academy of Science, 106, 19644–19650. & (
- 2008) Mechanical-statistical modeling in ecology: from outbreak detections to pest dynamics. Bulletin of Mathematical Biology, 71(2), 318–998. , & (
- 1999) Intraspecific variation in habitat availability among ectothermic animals near their climatic limits and their centres of range. Functional Ecology, 13(Suppl. 1), 55–64. , , , & (
- 2004) Extinction risk from climate change. Nature, 427, 145–148. , , , , , , , , , , , , , , , , , & (
- 2004) Effects of restricting environmental range of data to project current and future species distributions. Ecography, 31, 165–172. , , & (
- 1997) Spatial Ecology: The Role of Space in Population Dynamics and Interspecific Interactions. Princeton University Press, Princeton, New Jersey. & (
- 2008) Spatial scale affects bioclimate model projections of climate change impacts on mountain plants. Global Change Biology, 14, 1089–1103. , , & (