Fine-scale environmental variation in species distribution modelling: regression dilution, latent variables and neighbourly advice


Correspondence author. E-mail:


1. Developing the next-generation of species distribution modelling (SDM) requires solutions to a number of widely recognised problems. Here, we address the problem of uncertainty in predictor variables arising from fine-scale environmental variation.

2. We explain how this uncertainty may cause scale-dependent ‘regression dilution’, elsewhere a well-understood statistical issue, and explain its consequences for SDM. We then demonstrate a simple, general correction for regression dilution based on Bayesian methods using latent variables. With this correction in place, unbiased estimates of species occupancy vs. the true environment can be retrieved from data on occupancy vs. measured environment, where measured environment is correlated with the true environment, but subject to substantial measurement error.

3. We then show how applying our correction to multiple co-occurring species simultaneously increases the accuracy of parameter estimates for each species, as well as estimates for the true environment at each survey plot – a phenomenon we call ‘neighbourly advice’. With a sufficient number of species, the estimates of the true environment at each plot can become extremely accurate.

4. Our correction for regression dilution could be integrated with models addressing other issues in SDM, e.g. biotic interactions and/or spatial dynamics. We suggest that Bayesian analysis, as employed here to address uncertainty in predictor variables, might offer a flexible toolbox for developing such next-generation species distribution models.


Understanding the determinants of species distributions has been a primary interest of ecology since its inception (Darwin 1859; Rapoport 1975). Strong correlations between distributions and physical factors such as climate have long been documented (Andrewartha & Birch 1958; Gaston 2003) and have received renewed recent interest in questioning how species distributions might respond to climate change (e.g. Thomas et al. 2004). This has led to the applied activity of species distribution modelling (SDM), which aims to build explicit quantitative models of the relationship between species distributions and the environment to predict potential future distributions.

However, the expected relationship between species distributions and climate may not always be retrieved by current SDM (Beale, Lennon, & Gimona 2008; Chapman 2010) and the ability of current SDM methods to make plausible predictions is frequently regarded as limited (see: Austin 2002; Gaston 2003; Hampe 2004; Guisan & Thuiller 2005; Araújo & Rahbek 2006; Araújo & Guisan 2006; Heikkinen et al. 2006; Austin 2007; Soberón 2007; Elith & Leathwick 2009). For example, current methods do not account for the differences between observed species distributions (realised niches) and potential distributions (fundamental niches) (Soberón 2007; Soberón & Nkamura 2009) that are generated by process errors such as populations dynamics, spatial dynamics and biotic interactions (Guisan & Thuiller 2005; Elith & Leathwick 2009). In addition, there are differences between the reality we wish to model and available data. These observation errors may be derived from sampling biases in data collection (e.g. Cabral & Schurr 2010) and biases introduced by imperfect measurement (Huston 2002; Clark 2005; Araújo & Rahbek 2006; Austin 2007). As there are multiple sources of both observation and process error, selecting and developing a solution to any source of error should not preclude other solutions to other errors.

In this paper, we address a problem inherent to all SDM analyses, namely uncertainty in predictor variables attributed to fine-scale environmental variation. Consider a typical application of SDM where the probability of occurrence for a species is modelled as a function of the predictor variables associated with the survey sites. The predictor variables (such as climate, soil and ‘habitat’ information) are not measured continuously through time and space, but instead are taken from interpolations of weather station data or from gridded climate re-analysis products. Thus, the true value of the predictor variable – the value actually affecting the species biology – has a noisy relationship with the measured (or apparent) value of that predictor variable used in the SDM analysis. Uncertainty in predictor variables can be introduced by both instrument error and spatial scaling (Huston 2002).

Within the coarse scale of cartographic grid cells, there are multiple possible local environments: for example, cooler, northern slopes within grid cells that are warm on average (e.g. Grime 1997) or patches of deep soil in regions of mostly shallow soil. Without accounting for this fine-scale variation, the breadth of tolerance to a predictor may be poorly estimated (Palmer & Dixon 1990). This is because a species with a true requirement for cool temperatures will appear to be tolerant of warm temperatures where it occurs on cool slopes within a grid cell with a warm average temperature. The situation is of course more complex than species simply existing in cooler sites in warm grid cells (and vice versa) because other climate-independent factors also vary at fine scales (e.g. soils, habitat loss, fire frequency; see Thomas et al. 1999).

These and other consequences of ignoring fine-scale variation are related to a well-explored statistical issue called ‘regression dilution’ (or ‘attenuation bias’) (e.g. Frost & Thompson 2000; Bartlett, De Stavola, & Frost 2009). Without correcting for errors in a predictor variable, regression analysis assigns errors in the estimation of that predictor variable to uncertainty in the response variable given the predictor variable. This misappropriation of variation tends to squash the apparent functional responses compared to their true values (Palmer & Dixon 1990; Frost & Thompson 2000). For example, in linear regression, errors in predictor variables decrease the estimated slope and increase the estimated intercept. Within SDM, this kind of regression dilution would flatten the estimated species’ functional responses to environmental variables compared to the true functional response.

Crucially, regression dilution can have scale-dependent effects for SDMs. We would expect the errors in predictor variables to differ between studies carried out at different scales [e.g. a fine-scale model with locally measured environmental variables vs. coarse-grained models with gridded climate data (Trivedi et al. 2008)]. This implies that the strength of regression dilution will depend on the scale of an SDM analysis. For this reason alone, models estimated at one scale cannot be expected to apply at other scales (e.g. Frost & Thompson 2000).

In the following paper, we first illustrate the problem of regression dilution by carrying out ‘virtual’ SDM on artificially generated data. This enables us to compare the estimated functional responses to known true responses. We then illustrate a practical and general solution to this problem – inspired by the study of growth-light relationships in saplings carried out by Lichstein et al. (2010)– relying on a Bayesian approach to SDM using latent variables (also see Clark 2005). Other methods are available for correcting regression dilution in linear regression and when estimating nonlinear response functions (e.g. Phillips & Davey Smith 1991; Frost & Thompson 2000; Bartlett, De Stavola, & Frost 2009). However, many of these rely on estimating a correction factor from repeated measures or supplementary data and are frequently restricted to simple errors in the predictor variable (Frost & Thompson 2000).

We explore the latent variable approach because it is simple to describe and implement (despite being viewed as an advanced topic); might be integrated with other solutions to process and observation errors within a Bayesian approach (see Discussion); is an adaptable method applicable to simple and complex forms of both response functions and error structures in predictor variables; and explicitly accounts for parameter uncertainty that can be subsequently incorporated readily into model predictions. As we demonstrate, the method can be further enhanced for SDM analyses by fitting multiple co-occurring species simultaneously rather than one by one (see Discussion). Using this ‘neighbourly advice’, the SDM approaches the ideal situation where predictor variables are measured perfectly and continuously through space. In discussion, we explore how switching to a Bayesian approach could offer solutions to problems of both process and observation error in SDM, which could be integrated into a unified next-generation set of SDM methods.

Methods and results

Virtual SDM analysis

From the outset, it is important to emphasise that both the regression dilution problem and our solution are not exclusive to the simple assumptions we use to illustrate the problem and to demonstrate a solution. Non-symmetric, bimodal or arbitrarily complex functional responses to predictor variables, or indeed multiple variables, will all face individual versions of this problem and can all use a form of the following solution.

For illustration, we assume a simple Gaussian relationship between the environment and a species, j. Gjq, the probability of occurrence for species j at location q, is a function of the true environment Eq at location q. For simplicity, we refer to E as temperature hereafter and use an arbitrary temperature scale of 1–100. Gjq depends on the (i) optimum temperature of the species φj, (ii) maximum probability of occurrence at the optimum Kj and (iii) the rate at which probability decays away from the optimum σj.

image(eqn 1)

Note that variations on this simple model can be used to investigate multiple environmental variables, combinations of different functional forms for a species’ responses, and alternative measures of abundance including count data and continuous measures.

We generate data for a single species by sampling from an underlying true species distribution (eqn 1; inline image;]>, where θj denotes the set of model parameter values for species j). We assume a set of survey sites q where each site has a true temperature Eq derived from a normal distribution with a mean equal to the measured temperature inline image and a standard deviation at each site equal to inline image. For clarity, we make σE shared across all sites, but the methods we present easily extend to cases where σE varies across sites even in complicated ways. We generated a set of grid cells with inline image values spread evenly at 0·25 increments between 1 and 100. A random true temperature (Eq) given inline image is drawn for a single site within each grid cell. Species occupancy at site q was determined by eqn 1 given the true environmental condition (Eq) (in contrast an application of SDM will need to work with inline image only), and then a random occurrence at each site, Zjq, was made given Gjq (Zjq = 0 implying absence, Zjq = 1 implying presence).

A simple visual inspection of this data set makes regression dilution clear (Fig. 1). The response of the species occupancy vs. measured temperature is flattened in comparison with that against true temperature. When viewing occupancy vs. the measured temperature only, unoccupied sites occur more frequently at temperatures away from the species’ true temperature optimum and less frequently at temperatures towards the species’ true optimum. In effect, errors in the estimates of Eq are incorrectly assigned to the species’ environmental response (Fig. 1b). Of course, having data on the true temperature would mean that any difference between the actual environmental response and our modelled response would be because of the sampling error only. For instance, in Fig. 1c, the observed occupancies follow the true pattern of the probability of occupancy, subject to some sampling error. If σE was to increase, each apparent species response would be further diluted away from the true response, as sites that have similar values of inline image can have more widely varying values of Eq.

Figure 1.

 Regression dilution in species distribution modelling. (a) Each point corresponds to one survey plot [hollow, Zjq = 0 (absent); filled, Zjq = 1 (present)], shown against the measured temperature (inline image, x-axis) and the true temperature (Eq, y-axis). The 95% credibility intervals on the true temperature, given the measured temperature, are shown by the light grey blocks either side of the mean (inline image). (b) Grey bars show the species proportional occupancy (Gjq) along measured temperature, from observation data binned at integer increments. Black line shows the true relationship of Gjq with temperature. Grey lines show the range of 95% of the observations, which extend into main plot. Note that through sampling error the observed credibility range is not quite symmetric around the actual species optimum (φ = 50). Occupancy patterns are shown in the dark grey bar above with white stripes indicate occupancy (Zjq = 1). (c) As (b) but proportional occupancy is shown against true temperature. As before, the black line shows the actual values of Gjq, grey bars shows occupancy data binned at integer increments and patterns of occupancy are shown in the dark grey bar to the right (see text for further details of the simulations). Parameters for generating the pseudo-data θj = (Kj = 0.7, φj = 50, σj = 7, σE = 12).

Next, we place this illustration on a formal footing by employing a virtual SDM analysis using the presence–absence data (Zjq). Our analysis is based on Bayesian methodology allowing for flexibility of the solution for regression dilution and potential integration with solutions to other errors. To carry out the Bayesian analysis, we need to define the log likelihood of the parameters given the data (inline image}) with respect to the model described by eqn (1).

Uncorrected individual species

We begin by making the traditional, but incorrect assumption, that the measured temperature is equal to the true temperature (inline image = Eq) at each location. Within the Bayesian model, the probability of occurrence is calculated for the parameter estimates (inline image) using eqn 1. Assuming independence of the observations, the likelihood is then,

image(eqn 2a)
image(eqn 2b)

Equation 2a is the sum, over all species j and plots q, of the logarithm of the probability of observed occurrence (present Zjq = 1, or absent Zjq = 0) given the temperature at q. In this case, the true temperature is assumed to be the measured temperature (Eq = inline image). Equation 2b shows that Pjq = Gjq when a presence is recorded at q (Zjq = 1) and Pjq = 1- Gjq when the species is absent (Zjq = 0).

Having defined the likelihood, we can use standard methods to estimate the posterior distribution for each parameter given the data. For a Bayesian analysis, we also need to specify priors on parameters. In this case, we used extremely weakly informative, uniform priors with ranges Kj, 0·000001–1; φj, −100 to 200; σj, 1–200. From the posterior, we can then extract, for each parameter, our best estimate of the true value (the posterior mean) as well as credibility intervals. Moreover, it is a simple matter to generate an ensemble of model predictions that incorporates the uncertainty in the parameters (e.g. Fig. 2a). To carry out the analysis, we used a well-tested Metropolis–Hastings (M–H) Markov chain Monte Carlo (MCMC) (e.g. Chib & Greenberg 1995; Robert & Casella 1999; see Supporting Information) numerical algorithm coded in C. There are a variety of ways to achieve this kind of analysis, and it could be coded in other widely available programs such as R or Winbugs. When carried out in C on a standard PC, the computational demand of this study posed no issues and a very generous period of 200 000 iterations were given to the initial ‘burn in’ and to achieve convergence to the posterior distribution, after which we recorded every 100th parameter set from a subsequent 200 000 iterations.

Figure 2.

 Correcting for regression dilution in species distribution modelling (SDM). In (a) and (b), the true relationship between the probability of occurrence (Gjq) and the true temperature is shown by the grey line. The relationship predicted by application of virtual SDM is shown as black lines (solid line, mean; dashed lines, 95% credibility intervals). In (a), fine-scale variation was uncorrected and the relationship was estimated given the measured temperature only. In (b), regression dilution was prevented by correcting for fine-scale variation in temperature using latent variables (‘individual latent’). Dark grey insets show occupancy patterns for the ‘true’ relationship of Zjq and Eq, and also the relationship between Zjq and (a) the measured temperature (inline image) or in (b) the predicted values temperatures of the latent variables (inline image), where white stripes indicate occupancy (Zjq = 1). Panels (c–e) compare the estimated parameters for the Gaussian relationships shown in (a) and (b): (c) σj, sigma values, (d) Kj, maxima values, (e) φj, optima values (eqn 1). Grey bars show 95% credibility intervals on the error in individual parameters, and dark line shows the mean estimate. Parameters for generating the pseudo-data: θj = (Kj = 0.7, φj = 50, σj = 7, σE = 12).

The application of this SDM to the pseudo-data reveals the regression dilution effect (Fig. 2a). As expected, the estimated environmental response function has a lower maximum (inline image) and a wider tolerance (inline image) than the true function. The SDM analysis has generated a function that fits the data well in the sense that the relationship between occurrence and measured temperature is recovered. However, the parameter estimates are incorrect, and they do not recover the relationship between occurrence and true temperature. As such, the estimated model is a descriptive model that would not be expected to produce reliable predictions for altered temperatures, or when applied at different scales.

Corrected, individual species

Secondly, we use a latent variable approach to correct for the problem of fine-scale environmental variation. We assume knowledge of the degree of uncertainty (σE) in the predictor variables (Eq), i.e. we know the probability distribution for true temperature given the measured temperature at each site (see discussion). To incorporate the uncertainty in the predictor variables into the Bayesian analysis, we need to integrate over both the uncertainty in the predictor variables and the probability of occurrence (Gjq) given the predictor variables (inline image). In principle, we could do this mathematically by deriving a closed-form solution for the integral, or by integrating numerically for each datum every time the likelihood is evaluated. However, both approaches require careful implementation, are slow and tend to be specific applications to specific problems. However, we can also use latent variables as a general, computationally easy and efficient way to reach the same end. In practise, latent variables are simply extra parameters that are treated equally with all other parameters by the sampling algorithm. In the same way that sampling algorithms such as Metropolis–Hastings MCMC explore different values of the model parameters, the algorithms explore values of the latent variables. In this way, posterior distributions of the true models parameters come to reflect the posterior distributions of the latent variables.

We can use latent variables to address uncertainty in predictor variables by defining for every site q a latent parameter called inline image, representing an estimate of the true value of the environment at site q. We then define the log likelihood in two parts:

image(eqn 3a)
image(eqn 3b)

The left hand part of eqn 2a is as before with one exception. It depends on, for each site q, the probability of occurrence for species j at site q, but given that the true temperature is inline image. The right-hand term is new and represents the probability distribution of the inline image values themselves. In this example, we know the true temperature is from a normal distribution with the mean and standard deviation specified by measurement (P(Eq = inline image|inline image, σE)) (again note that this is a simplified example for illustration purposes and complex forms of environmental variation can be accounted for). Equation 3a accounts for the altered assumption (see eqn 2a) that the latent variables represent the true temperatures (Eq = inline image).

In this way, knowledge of error in the predictor variables allows us to state the likelihood function and so solve for the parameters given the occurrence data. In practise, the MCMC algorithm will explore a variety of values for each latent variable inline image with the distribution for each reflecting a compromise between the prior (inline image, σE) on the one hand, and the occurrence data on the other (Zjq). As latent variables are technically the same as any other parameter, we can examine the posterior means and credibility intervals on each value of inline image– i.e. on the estimated true temperature of each site.

As shown in Fig. 2b, this approach recovers a close approximation to the true functional relationship between occurrence and true temperature. The response function is no longer ‘squashed’, and all of the true functional response is within the 95% credibility intervals of the model output. These intervals are quite large simply because of the low number of samples we used; in this case, 43 presences (Zjq = 1). Parameter estimates for the latent variable approach are much more accurate than the standard regression approach and the 95% credibility intervals on parameter estimates contain the true values (Fig 2c–e). As a result of the small number of presences in our pseudo-data, the latent variables themselves are poorly constrained, in the sense that the credible intervals on each inline image are large. This should be expected in this demonstration because there are as many latent variables as there are data. Nonetheless, the MCMC algorithm explores a range of values for each inline image, preventing regression dilution.

Corrected, multiple species

Now, we demonstrate a novel method to better constrain these latent variable estimates by including multiple co-occurring species within a single parameterisation. That is the parameter set will include parameters for each species for eqn 1 (for n species, inline image), but only a single set of latent variables inline image for each site, q. This only makes a small change in an operational algorithm and the likelihood function.

image(eqn 4)

Note there is no change in the prior for each inline image parameter as we assume that species have been surveyed from the same network of plots. We have not changed the assumption that Eq = inline image and so still use eqn 3a.

To illustrate this method, we generated data for nine further species to accompany the single species investigated from the previous method (see Figs 1 and 2) making a set of 10 species. Each of the further nine species were created with parameters randomly sampled from uniform distributions with ranges Kn, 0·4–0·8; φn, 15–85; σn, 5–20. Note that we assume no direct interactions between species (i.e. interspecific interactions) and also assume that species experience the same true temperature at each site. By using the same data set for the first species across the four methods, we can directly compare the predicted parameter estimates (inline image) against the actual parameters (θ) and assess any trend to overestimate or underestimate responses using each methodology. This comparison is shown in Fig. 3– (i) SDM uncorrected for fine-scale variation (inline image)(1st column; eqs 2a-b); (ii) corrected SDM, species responses are modelled individually (inline image)(2nd column; eqs 3a-b); (iii) corrected SDM, species responses are estimated together (inline image)(3rd column; eqs 4&3b); and (iv) SDM where we use the true temperature (Eq) at each site (4th column, eqs 2a-b).

Figure 3.

 Comparison between species distribution modelling parameter estimates using three modelling approaches for 10 species: uncorrected for fine-scale variation (1st column); ‘individual latent’ correction for fine-scale variation, i.e. latent variable correction included, but analysis implemented individually for each species (2nd column); and ‘multispecies latent’ correction, where all ten species are fitted together using a common set of latent variables (3rd column). For comparison, the 4th column shows results given perfect information about the true temperature (Eq). Rows show estimates for (a) σj, sigma values, (b) Kj, maxima values, (c) φj, optima values (eqn 1). Values shown are the difference between predicted and actual parameter values, hence values = 0 show correct retrieval of parameter. Grey bars show 95% credibility intervals on the difference with difference mean shown as a dark line. The x-axis shows species ID where the species have been ordered by the y values. Note the different scales to y-axis in the first column of a and b where the grey dashed lines indicate the relationship of y-axis scales across a, b and c. Species j=1 (as shown in Figs 1 and 2) is highlighted throughout rows and columns for comparison across the panels.

As expected the uncorrected method overestimates tolerances (Fig. 3a) and underestimates maximum responses (Fig. 3b) for all species, but can generate a reasonable estimate of species’ optima (Fig. 3c). In contrast, the individual latent approach reduces these biases (Fig. 2, 2nd column, Fig. 3) and the uncertainty is greatly reduced (compare y-axis scales in Fig. 3a,c). Most importantly, although some species parameters have intervals that do not contain the true values, there is no general tendency to over- or underestimate responses. Thus, on average, the corrected species-by-species approach is unbiased even though it can be biased for particular species.

The multispecies latent method produced parameter estimates that far outperform either of these methodologies (3rd column of Fig. 3). All credibility intervals include the actual parameter values (θj) and the parameter estimates are more tightly constrained. Of course, the ideal situation is to know the true temperature, Eq where no regression dilution could occur (4th column Fig. 3). The multispecies latent method approaches this ideal scenario correcting, almost perfectly, for the uncertainty in the true temperature at each site.

Moreover, the latent variables themselves now become better constrained homing in on the true temperature at each site only using data on the measured temperature. This is crudely shown by the correlation coefficients between Eq and inline image for the uncorrected method 0·925; corrected, with the individual latent method 0·927–0·939; and corrected, multispecies method 0·972. This reduction in bias is shown in Fig. 4. For the corrected individual latent, the reduction in bias is only noticeable for occupied sites (Fig. 4b). Unoccupied sites remain unchanged, because the posterior tends to hone in on the prior where the value is far away from the optimum for that species, such that the predicted probability of occupancy is very close to zero for any realistic value of inline image. Thus, away from the species’ optimum, the likelihood is very insensitive to the choice of inline image and the posterior for inline image becomes dominated by the prior. In contrast, the likelihood tends to be sensitive to all values of inline image when using the multispecies method as all plots are close to the optimum for at least one species. This helps to explain how the multispecies method allows data from each species to inform both the estimated functional responses of all other species (Fig. 3 column 3) and the true temperature at each site (Fig. 4c).

Figure 4.

 Estimation of the true environment using species distribution modelling (SDM). All plots compare estimated with true temperatures across a set of survey plots (a) an uncorrected SDM implicitly assumes that the true temperature (Eq) is equal to the measured temperature (inline image); (b) individual latent correction; (c) multispecies correction. Occupied sites (Zjq = 1) for a single species are shown by filled circles; unoccupied (Zjq = 0) empty circles (species j = 4 was selected for purposes of illustration).


We have demonstrated a general solution to include fine-scale variation in SDM. The solution (i) explicitly accounts for uncertainty in environmental measurements; (ii) is scale independent; (iii) is robust to heterogeneity in environmental variation attributed to both environmentally dependent and independent processes; and (iv) can potentially be integrated into a framework including solutions to other issues of SDMs. In addition, we provide a novel suggestion for model parameterisation where ‘neighbourly advice’ from co-occurring species strengthens constraints on all parameters through indirect feedback between latent variables and species’ parameters.

Regression dilution in SDM and the consequences

Problems associated with errors in predictor variables in SDM have received some attention in the literature previously (e.g. Ashcroft, Chisholm, & French 2009) but, to our knowledge, have not been explicitly linked to the widely understood problems of regression dilution using latent variables until now. SDM is essentially regression, albeit potentially involving rather unusual, biologically inspired functional forms of species’ environmental responses and multiple predictors.

The impact of regression dilution is likely to increase substantially when study systems move outside of the conditions where the initial measurements were made (e.g. Thuiller et al. 2004), i.e. when SDMs are required to be predictive. In a changing environment, an incorrectly parameterised model could overestimate or underestimate the range limits in relation to each environmental variable. For purely descriptive models of current distributions, this may be no problem as fine-scale variation is ‘wrapped up’ with species’ functional responses to reproduce current distributions at the scale that models were applied. But, under an altered environment or altered scale, it is important to know the true functional response, especially if the rate of environmental change varies across space (Loarie et al. 2009). As we have shown here, it is necessary to correct for regression dilution to extract these true responses.

Regression dilution will then tend to produce modelled species distributions that are more fragmented than real distributions at fine scales (also see Montoya et al. 2009). For example, with a single environmental variable, erroneously squashed functional responses (Palmer & Dixon 1990; Frost & Thompson 2000) will tend to produce over-dispersed predictions of species distributions (e.g. Fig. 2). In the presence of local dispersal and local ecological interactions, population and community dynamics can be qualitatively affected by such differences in spatial structure (e.g. Tilman & Kareiva 1997; Dieckmann, Law, & Metz 2000). Therefore, mechanistic SDM methods that assume metapopulation like processes could make various erroneous predictions (e.g. the degree of range deformation, time lags in responses, spatial turnover, extinction thresholds) if they do not correct for regression dilution. And the parameters of those more complex, process models would be affected in unexplored ways.

Regression dilution is one candidate for explaining scale effects in SDM (Palmer & Dixon 1990; Trivedi et al. 2008). For instance, Trivedi et al. (2008) found 6–7 more species (out of the 10 studied) would go extinct (lose all suitable modelled habitat) in predictions of a fine-grained model compared to those of a coarse-grained model. Superficially, this appears like regression dilution as ‘squashed’ response functions would be expected at the larger scale, producing broader environmental tolerance and so fewer predicted extinctions. However, contrasting results were subsequently found in a similar study by Randin et al. (2009), where fewer species faced extinction in the fine-scale model. In fact, 100% of the species losing all suitable habitats under their coarse-grained model actually persisted in their fine-scale model (Randin et al. 2009). These two studies find differential effects of large magnitude when comparing outputs from models based on fine-grained (site) and coarse-grained (grid cell) data. Whilst one bears the hallmarks of regression dilution, with the addition of the second study, the explanation does not appear so straight forward. Perhaps, ominously for SDM, which is essentially multiple regression, some discussion in the statistical literature suggests these differential effects might be explained because regression dilution might act to either squash or sharpen the estimated responses of different variables in multivariate analyses (Bohrnstedt & Carter 1971; Phillips & Davey Smith 1991). The effect of regression dilution in individual studies will be determined by the relationships between the random errors of different predictor variables and the methodology used for correction (Bohrnstedt & Carter 1971; Phillips & Davey Smith 1991; Frost & Thompson 2000), the extent of measures and inferred coverage of each variable (Thuiller et al. 2004; Barbet-Massin, Thuiller, & Jiguet in press) and the functional forms of ‘true’ and assumed environmental responses used.

Species distribution modelling analyses inevitably cover large extents for which weather stations may sample the environment unevenly. As is frequently the case for the highest altitudes, large extents of environmental variation may be entirely inferred because of the lack of weather station data (e.g. Barbet-Massin, Thuiller, and Jiguet in press). We note that the latent variable approach is flexible enough to cope with even this complex situation. However, these issues of data are in addition to the other sources of observation and process error, which are still wrapped up in these two analyses (Trivedi et al. 2008; Randin et al. 2009) and those errors, such as biotic interactions, may also have scale-dependent effects (Heikkinen et al. 2007). This being the general case in SDM, care is needed when drawing general conclusions about scale effects for individual sources of error when many sources are acting in concert. These considerations point to the need for an integrated tool box to disentangle the diverse sources of error within current SDM analyses (see below).

Correcting using latent variables

We consider our method for correcting regression dilution in SDM to be straight forward. Providing we place SDM within a Bayesian framework (see below), inclusion of latent variables representing the ‘true’ as opposed to ‘measured’ environment is simple – and the values of these latent variables can be sampled using standard computational methods. Furthermore, the latent variable approach is flexible, being applicable to a variety of functional forms for species’ environmental responses. For example, in eqn 3a, the functional response is separate from the description of fine-scale environmental variation. Thus, the functional response can be altered within the same overall approach. Importantly, the functional response could be a process-based model that, for example, needs simulation to equilibrium to predict the probability of species’ occurrence (e.g. Purves et al. 2007). In this way, our method for correcting for fine-scale environmental variation could be used in conjunction with various ‘next-generation’ process-based SDM methods. Moreover, there is nothing in the method that restricts latent variables to a single variable per site, and so functional responses could be extended to several ‘true’ environmental variables. In which case, eqn 3a would be extended to include latent variable terms for each extra predictor variable. Latent variable approaches and related methodologies have been employed elsewhere, e.g. where light varies below forest canopies (Lichstein et al. 2010), and the methods have broader applications in ecology (e.g. Clark et al. 2003; Soubeyrand, Neuvonen, & Penttinen 2008).

To implement our method as presented here, it is necessary to have a priori estimates of the degree of error in the predictor variables. For our example, we specified σE beforehand. For a real application of this method, where would we get this information from? First, it is important to recognise that uncertainty in predictor variables need not be known perfectly. Current SDM analysis are one extreme of a spectrum where no uncertainty in predictors is assumed (i.e. σE = 0). At the other extreme would be analyses assuming complete uncertainty in predictors. Placing real SDM analyses at a more realistic point along this spectrum may lead to better predictions in general. Second, it should be possible, in principle, to estimate the degree of error in predictor variables given available supplementary information. For example, comparing weather station data (taken at one point in space) with climate variables reported in gridded re-analysis products would provide a simple direct measure of the degree of within-cell variation. Also, relatively fine-scale elevation data are now available for many regions. As elevation is known to be a major control of climate and soils at local scales, it seems sensible to assume that regions with greater variation in elevation have greater variation in climate and soils (e.g. Daly, Neilson, & Phillips 1994). With knowledge of lag-rates, it might even be possible to set values of σE entirely from elevation data. Third, we note that in principle, with a more complex Bayesian analysis, it might be possible to make inferences about the magnitude of fine-scale variation in some environmental variables purely from the species occurrence or abundance data.

Neighbourly advice

We found that including multiple species in an SDM analysis with a single set of latent variables dramatically improves the constraints on both latent variables and species parameters. The effect on the latent variable estimates is explained at a simple level by the increased coverage that multiple species might have through space. For analyses using latent variables for individual species (Fig. 3, eqn 4), the latent variable method always increased our knowledge about the true temperature (Fig. 4) – but only slightly. This inaccuracy occurs because, for each site, we have only a single absence or presence (Zq) to supplement the information we already have on true temperature (inline image). Where species parameters (inline image) predict an absence and an absence is also observed in the data, latent variables will tend to the most likely environmental value, in this case the mean (inline image). In contrast, for multiple species in one analysis, we have several absences and presences for each site (Zjq), which increases the coverage and improves the estimates of inline image, i.e. the posterior distribution of each latent variable value. But these improved estimates of true temperature then naturally lead to improved estimates of all species responses given the true temperature, which in feedback, increase the accuracy of the estimates of the true temperature. Each species in effect informs each other species about the true environment when multiple species have been surveyed. This is why we refer to this method as ‘neighbourly advice’. In the limit of an infinite number of species, neighbourly advice would mean that the true temperature would become known perfectly for every site at which point the species estimated responses would become exactly those estimated if the true temperature was known in the first place. Surprisingly, Fig. 4 shows that even a set of 10 species takes our analysis most of the way towards this ideal scenario. Finally, it is important to recognise that this issue is separate to the major unsolved problem of how to include direct species interactions in SDM (Elith & Leathwick 2009), but could be integrated with a solution to species interactions.

An integrated approach to solving SDM challenges

Despite making progress in understanding fine-scale variation and accounting for the resultant biases, the solution we introduced here addresses only one of several major problems in SDM analysis (e.g. Austin 2002, 2007; Gaston 2003; Guisan & Thuiller 2005; Araújo & Guisan 2006; Araújo & Rahbek 2006; Heikkinen et al. 2006; Soberón 2007; Elith & Leathwick 2009). As next-generation SDM analyses will need to include corrections for all of these problems of process and observation errors, it is important that a solution to any one problem can be combined with solutions to the others. Placing SDM within a formal Bayesian context could, in our opinion, help to produce these kinds of integrated solutions for SDM. In this case, Bayesian methods have a simple solution to regression dilution and similar methods have already been used within SDM – e.g. to fit process-based metapopulation models to current distributions (Purves et al. 2007). Within SDM and elsewhere, the Bayesian framework offers a transparent, flexible and general approach that allows analyses to be tailored to the questions and data at hand. There are a host of standard methods for efficient parameter estimation (e.g. MCMC sampling), for incorporating previous biological knowledge (e.g. priors and/or combining several data sets), comparing alternative models, for propagating parameter uncertainty through model predictions and for averaging model predictions over alternative models. At present, Bayesian methods appear to be an extremely useful toolbox for SDM and developing the next-generation of models.


We thank Ben Calderhead for discussions on the topic of hidden/latent variables. We also thank Mathew Smith and Mark Vanderwel, Wilfred Thuiller, the Editor and an anonymous referee for detailed comments that improved different parts of the manuscript.