A new statistical framework for the quantification of covariate associations with species distributions

Authors


Summary

  1. Identifying processes that shape species geographical ranges is a prerequisite for understanding environmental change. Currently, species distribution modelling methods do not offer credible statistical tests of the relative influence of climate factors and typically ignore other processes (e.g. biotic interactions and dispersal limitation).
  2. We use a hierarchical model fitted with Markov Chain Monte Carlo to combine ecologically plausible niche structures using regression splines to describe unimodal but potentially skewed response terms. We apply spatially explicit error terms that account for (and may help identify) missing variables.
  3. Using three example distributions of European bird species, we map model results to show sensitivity to change in each covariate. We show that the overall strength of climatic association differs between species and that each species has considerable spatial variation in both the strength of the climatic association and the sensitivity to climate change.
  4. Our methods are widely applicable to many species distribution modelling problems and enable accurate assessment of the statistical importance of biotic and abiotic influences on distributions.

Introduction

It has long been recognized that many forces shape species distributions (Connell 1961; Gaston 2003). Despite intensive effort, identifying and quantifying the influence of different ecological processes has proved difficult. In part, this difficulty arises from the wide variety of processes, often operating at different spatial scales: availability of suitable habitat is relevant at fine scales (Snaydon 1962); climate appears more important at larger scales (Pearson & Dawson 2003); and at the largest scale, only evolutionary history is important: Did the species evolve locally or has it had the opportunity to disperse into the region (Cotgreave & Harvey 1994)? The different processes that impact distribution have recently been summarized within the ‘biotic, abiotic, movement’ (BAM) framework (Soberón & Peterson 2005; Soberón 2007; Soberón & Nakamura 2009). In reality, some scale-specific factors may reflect not so much processes operating at different scales, but rather the scale at which factors vary and therefore the statistical information available at a given scale (Soberón 2007). Moreover, it is likely that different processes act in different parts of species ranges (Thomas & Lennon 1999; Gaston 2003): for a hypothetical species perhaps the northern limit is determined by temperature, the southern limit arises from competition with a sister species and the eastern limit is caused by an impassable geographical barrier. With biotic, abiotic and essentially contingent geological and evolutionary factors all influencing species distribution, it is unsurprising that identifying the precise cause of presence or absence of any given species in any particular location is challenging. Never has such understanding been more important, as species distribution shifts are currently observed globally (Parmesan & Yohe 2003; Araújo et al. 2005a,b), and successful conservation management depends on a good knowledge of limits on the distribution of target species (Vaughan & Ormerod 2003).

Despite the range of factors that affect the distribution of species, it has become common practice to model large-scale distribution of species (e.g. across a continent) using climate variables alone (e.g. Araújo & New 2007; Araújo & Peterson 2012; Garcia et al. 2012). This practice is justified by the claim that at large scales, climate is the main driver of distribution, supported by the apparently strong match between these models and species distributions (Pearson & Dawson 2003). Such models have become a primary tool for the projection of the impacts of climate change on biodiversity and influence policy at many levels (Pachauri & Reisinger 2008). Recently, however, a debate has centred on whether or not climate really is the dominant process driving large-scale distribution (Beale, Lennon & Gimona 2008, 2009; Araújo, Thuiller & Yoccoz 2009; Peterson et al. 2009; Chapman 2010), confirming an urgent need to develop methods that can rigorously quantify any such association (Hampe 2004). Recognizing these problems, alternative mechanistic approaches have been developed that emphasize the autecology of species strongly; surprisingly accurate models of tree distributions have been built from understanding the impact of climate on a suite of phenological variables (Morin, Viner & Chuine 2008); excellent predictions of animal distributions have been achieved from understanding climate impacts on energetic costs (Kearney et al. 2008). Such models sidestep many criticisms aimed at statistical models, but ultimately, they rely on such detailed understanding of each species that it is impossible to apply them to large numbers of species (Thuiller et al. 2005; Huntley et al. 2008). Similarly, CLIMEX models use a library of climate response functions drawn from the literature on species growth and survival patterns. For nearly three decades, CLIMEX modellers have combined inferential and deductive techniques to fit accurate models of mainly invasive species potential distributions (Sutherst & Maywald 1985; Sutherst 2013). It has also been used to explore the effect of species interactions on their distributions, but the time-consuming manual fitting method has precluded it from being applied to large numbers of species (Sutherst, Maywald & Bourne 2007).

Hybrid solutions are being developed that link traditional species distribution model analyses to detailed information on dispersal rates, and these have demonstrated improved predictive ability (e.g. Boulangeat, Gravel & Thuiller 2012; Dullinger et al. 2012), but still require detailed ecological knowledge of individual species. Developing methods to incorporate more ‘biology’ into species distribution models has become an important topic (Higgins, O'Hara & Römermann 2012).

Austin (2002) noted that species distribution models require three components: a biological model, a statistical model and a data model. The biological model involves all considerations of the tailoring of the model to an individual species: What biotic and abiotic variables should be included in the model (and equally important, which variables should not be included in the model)? What scale is appropriate for analysis? What forms of relationship between presence and covariates are plausible? The data model concerns the methods by which the data are gathered (e.g. single visits, repeated visits or ad hoc observations, the spatial domain within which data are gathered, and so forth.). The statistical model is the element that draws the three together: to be effective it must: (i) be statistically valid, that is, be able to detect and accurately characterize non-random relationships with known environmental covariates even in the presence of influential but unknown covariates, appropriate to the data type and structure (including spatial structure), and enabling proper model comparison and the identification of important covariates; (ii) be flexible enough to describe all elements of the biological model but not so flexible that it allows what we know are biologically unrealistic response curves (e.g. it can fit and if required enforce ecologically realistic functional forms to environmental relationships, also accounting for dispersal limitation); (iii) be capable of fitting hierarchical data models. Ideally, any statistical framework for modelling should be easily extensible in incorporate species- and survey-specific requirements. Although the identification of these challenges is not new (the three models and many specific examples were described by Austin (2002, 2007), the principal of including more realistic biology in distribution models is established (Higgins, O'Hara & Römermann 2012), and the BAM concept (Soberón 2007; Soberón & Nakamura 2009) emphasizes some of these issues). Although there are a plethora of species distribution models in current usage (see Dormann et al. 2007; Elith & Leathwick 2009 for reviews), none currently meet all requirements. Perhaps most importantly, most methods deal inadequately with the autocorrelation inherent within distribution data. Species distributions are strongly autocorrelated: a distribution map typically shows an aggregated pattern (Legendre 1993). Traditional species distribution modelling methods ignore this, and should residual correlation remain (as is highly likely), common model assumptions will be violated leading to potentially incorrect inferences about the importance of explanatory variables (Legendre 1993; Lennon 2000; Beale et al. 2010). Here, we describe a new model that meets at least in part all the requirements of a good statistical model of distribution and is extendible to a range of further model structures. We show how to identify factors important in determining distribution in any part of the range and quantify the relative importance of climate and other processes.

Methods

As the basis for our modelling approach, we implement a Bayesian hierarchical model in WinBUGS (Lunn et al. 2000), with additional data handling in r v.2.15.1 (R Development Core Team 2012) using package R2WinBUGS (Sturtz, Ligges & Gelman 2006). To facilitate further work in this area, we provide detailed r and WinBUGS code as Methods S1.

Data Model

In species distribution modelling, data models require careful consideration (Austin 2002). Using a hierarchical approach means a variety of data models can be used: statistical models that link observed presence and absence to occupancy through a model of observer effort can easily be built to suit survey design (Royle et al. 2007). Where data are not available on observer effort, or where effort is assumed constant, observed presence and absence can be modelled directly. We provide code for a simple observer effort model that assumes detection probability given presence asymptotically approaches one with increased observer effort. In addition to survey design, the other most important component of the data model is the survey area. Much has been written about appropriate spatial domains for species distribution analysis with evidence presented suggesting that domains can be both too large (e.g. Austin & Meyers 1996) and too small (e.g. Barve et al. 2011), but no general solution allowing the identification of an optimal geographical domain has been found. We do not intend to resolve this problem here (indeed, there may be no general solution), but suggest instead that some of the issues highlighted are best considered not data model issues, but biological model issues. For example, domains are generally considered too large if they include areas where organisms have not been able to disperse, whereas an appropriate biological model would include dispersal and go some way to resolving this issue. Instead, data model issues relating to domain size should consider whether surveys have been sufficiently large to adequately identify relevant distribution limits with sufficient statistical power. This does not require all distribution limits to be sampled: a study seeking to identify only northern distribution limits need not survey regions within the southern portion of the range to adequately identify the limits to northern distributions.

Biological Model

The biological model we assume builds on the BAM concept (Soberón & Peterson 2005; Soberón 2007; Soberón & Nakamura 2009). This concept identifies three fundamental processes that influence distribution: biotic, abiotic and movement (dispersal) related. Under this framework, species are only able to sustain source populations in regions where both biotic and abiotic conditions are suitable and where they have managed to disperse.

Following the BAM concept, our biological model identifies three components, starting with a consideration of movement. It recognizes that there will be geographical regions where biotic and abiotic conditions are suitable, but the species has been unable to disperse and is therefore absent. Equally, sink populations can occur in locations where biotic and abiotic conditions are not suitable, but where dispersal from source populations maintains a population through immigration. As organisms are limited in their ability to disperse, sink populations are expected to be relatively close to source populations, whilst unoccupied but suitable locations are likely to be relatively distant. We therefore consider the existence of sink populations as a source of systematic statistical noise, with incoming dispersal extending the realized niche of the sink individuals beyond the fundamental niche and thereby shaping our interpretation of identified niche space, rather than seriously impeding model discovery. By contrast, the existence of unoccupied but suitable regions argues for different consideration; we suggest that if maximum dispersal distances are accurately known, this can become a criterion for identifying survey area and be included in the data model. If maximum dispersal distances are not known (as is typically the case), dispersal limitation can lead to residual spatial autocorrelation which needs to be accounted for in the statistical model (Shurin, Cottenie & Hillebrand 2009; Beale et al. 2010).

Biotic factors are often ignored by species distribution modellers, particularly at large spatial scales. Our biological model considers biotic interactions may have both positive and negative impacts on species, and have similar impacts to abiotic processes. It is therefore convenient to consider the two in parallel, because a primary decision is to select the candidate biotic and abiotic factors likely to be important in determining species distribution, and then to decide on the constraints placed on the form of the relationship between species presence and each of the factors. For parameter selection, there can be no substitute for detailed knowledge of the ecology of the species of interest, with no general solution, though in the absence of specific information we urge more, rather than less selectivity.

Further, we concur with Austin (2002) that abiotic variables are a priori restricted to functional forms with a relatively simple shape (e.g. unimodal or monotonic). This ensures that the niche model cannot identify relationships where, for example, a species tolerates temperatures between 10°C and 15°C, and temperatures between 20°C and 25°C but is unable (or less likely) to survive between 15°C and 20°C. Although realized niches may show more complicated relationships than this, the reasons for such patterns are likely to involve biotic interactions, which form a separate part of our biological model and can have more complicated forms. Despite the unimodal or monotonic restriction, we do not want to preclude the possibility that transitions from tolerable to intolerable climate may occur more or less rapidly at either end of the tolerable range. Thus, steep physiological limits may apply at the hotter end of a temperature variable, but tolerance for low temperatures may decline more gradually; this flexibility is incorporated into our modelling approach.

Statistical Model

The statistical model must meet all the requirements set by the data and biological models (including the BAM concept) as well as ensuring accurate estimation with appropriate confidence intervals. To account for the complexities of real data, we adopt a hierarchical approach that allows straightforward extension should data and biological models change. As a minimum addition to the fixed effects describing known biotic and abiotic covariates, we account for residual spatial autocorrelation with an intrinsic conditional autoregressive (iCAR: Besag, York & Mollié 1991) error structure. This method of accounting for spatial autocorrelation has recently been found to perform well for complex simulated data sets (Beale et al. 2010); adaptation for binary data is straightforward. The error structure, modelled as a spatial random effect, accounts for spatially structured residuals (e.g. missing biotic and abiotic covariates and dispersal processes) not explicitly modelled by the covariate relationships (see Methods S1 for details of the implementation). This spatial error term captures ‘noise’ caused by movement, as well as accounting for autocorrelated errors caused by missing biotic and abiotic covariates. Using a hierarchical approach allows easy extension of this minimal model to occupancy type models, when information on observation processes is known.

To model biotic and abiotic covariates (the B and A of the BAM concept) with the appropriate degree of flexibility, we fitted each covariate as a smooth term, represented by a penalized regression spline with two knots (Eilers & Marx 1996). By changing the number of knots in the spline terms, more or less flexibility can be allowed if justified (presence may show non-monotonic relationships with biotic variables, Austin 2002); our Bayesian implementation follows that of Crainiceanu, Ruppert & Wand (2005), whereby for each knot a basis function is calculated, and these functions are then included as additive terms in a linear model.

Although the biological model must be the primary concern when choosing variables to include in the statistical model, additional statistical factors must be considered too: strongly correlated covariates are hard to tease apart statistically; the number of covariates considered is important and so on. These statistical restrictions mean that each variable used is better considered as representative of a package of correlated variables, rather than just a single variable. This is particularly true for climate variables often used in species distribution models, and in some cases, this is explicit: measures of water balance integrate rainfall, evaporative potential (calculated from solar insolation and temperature) and soil parameters. In other cases, it is not explicit, but tacitly assumed: growing degree days is as much a proxy for alternative temperature-related variables as a direct estimate of growing degree days per se. Other covariates also form ‘covariate packages’: altitude, or latitude and longitude are closely correlated with a set of climate variables, for example, but special care must be taken when using models including these fixed relationships to make statements about climate sensitivity, or when projecting to the future. Although care must be taken when using models including latitude and similar variables, there can be good statistical reasons to include such variables. For example, iCAR models assume that the spatially explicit random effect is stationary (i.e. spatially invariant and without strong spatial trends) across the domain. This is often unrealistic, but stationarity can in theory be improved by fitting spatial coordinates as explanatory variables (Beale et al. 2010), allowing accurate estimation of credible intervals for the covariates of interest (i.e. notionally correct Type I error rates). This is likely to generate a conservative test of significance (Type II error rates may be high) when covariates of interest are strongly correlated with the coordinates, as is often the case with climate covariates. Comparing results of models with and without coordinates can therefore be informative: covariates with narrow credible intervals in both models are strongly supported, and covariates that decline in importance in models which add coordinates may be genuinely unimportant (especially when the covariates changing in notional importance are not strongly correlated with latitude or longitude), but, more pragmatically, may simply reflect an alternative climate proxy to the coordinates themselves. Hence, when assessing importance of a covariate, we use models that include latitude and longitude, but when assessing climatic sensitivity, we use models without latitude and longitude. Empirically, for our example analyses, median parameter estimates were largely unchanged by the inclusion or exclusion of coordinates (See Fig. 1 and Fig. S1–S4).

Figure 1.

Marginal effect plots for Melodious Warbler: (a) seasonality, (b) length of the growing season, (c) water availability and (d) winter cold. Smaller plots (e–j) show equivalent parameter estimates when latitude (g) and longitude (j) included. Grey-shaded areas are the 95% credible intervals, solid line the median estimate of 10 000 Markov Chain Monte Carlo iterations. Above and below the figures are rug plots indicating the percentile distribution of raw data from the observed data set split (above) for squares with presence and (below) for squares with absence. Areas where the grey shading does not cross the dashed 0·5 probability line when latitude and longitude are included can be considered significant.

Our basic hierarchical model therefore consists of two parts: a restricted set of smooth terms for the niche model (potentially incorporating both biotic and abiotic variables) and an iCAR component for spatially correlated errors (which incorporates movement generating population sinks, as well as missing covariates). The explicit links with the BAM concept are simple: biotic (B) and abiotic (A) components can be described as fixed effects (and can be treated individually or combined to give total abiotic or biotic components), and movement (M) is captured within the spatial errors, alongside residual biotic and abiotic effects not directly measured. Thus, our model allows minimum estimates of B and A to be calculated, and M to be incorporated: it is important to note that these are minimum estimates of B and A, and when there are strong spatial effects, the values are highly dependent on the choice of covariates. The sum of the fixed effects and iCAR gives the logit-transformed probability of occurrence (P) for each square (e.g. Equation (eqn 1) where ‘Rain’ and ‘Sun’ are two illustrative examples of appropriate abiotic covariates: biotic covariates could be incorporated in exactly the same way).

display math(eqn 1)

where pi is the probability of occurrence in square i, b0 is the intercept, b1 and b2 the estimated parameters for rainfall, Rain1i and Rain2i are the two spline bases for observed rainfall in square i, b3 and b4 are the estimated parameters for sunshine, Sun1i and Sun2i are the two spline bases for observed sunshine in square i, and SEi is the estimated spatially explicit error for square i that also includes spatially unstructured (i.e. white noise) errors. In our examples, we used four climate covariates and therefore estimated nine or 11 fixed-effect parameters (intercept plus two parameters for each climate variable and optionally one each for latitude and longitude) and the spatially explicit error. With the exception of the precision parameter τ in the iCAR model, we specified diffuse priors for each parameter (mean = 0, variance = 1000). We monitored parameter estimates for all covariates and for the iCAR structure in each square.

The results from application of this model can be plotted and analysed to reveal useful patterns, which we illustrate below. First, we can assess the sensitivity of the model to changes in each covariate by differentiation. In practice, analytic solutions to the differential of the GAM models are complex, so a numerical solution is used, adding a small number (in the examples, we use 0·1, representing a change of c. 1·5% of the range of each of the raw covariate) to the raw covariate value (Equation (eqn 2)). The corresponding values of the two spline bases are then calculated (Equation (eqn 3)), and the parameter estimates from the model used to calculate a new probability of presence given the change in the one covariate (Equation (eqn 4)). The difference between the estimated probability as calculated with the adjusted covariate and as calculated with the original covariate reflects the sensitivity of the model to changes in this covariate (Equations (eqn 2), (eqn 3), (eqn 4), (eqn 5) illustrate the process for estimating sensitivity to a change in Rain from Equation (eqn 1)).

display math(eqn 2)
display math(eqn 3)
display math(eqn 4)
display math(eqn 5)

where Raw_Raini is the value of the raw, untransformed rainfall variable for square i; ΔRain is the small change in rainfall that is being applied; New_Rain1i and New_Rain2i are the two spline bases calculated by transformation (ƒ) from the raw rainfall data plus a small amount; Equation (eqn 4) has the same parameters as Equation (eqn 1); and Probi is the inverse logit (inv.logit)-transformed probability from Equation (eqn 1).

As Equation (eqn 1) describes a model that has a linear predictor combining covariate effects and an error term, it can be separated into constituent parts and used to investigate the relative influence of individual and combined covariates and the error term in each square. For example, from Equation (eqn 1), if the modelled probability of occurrence for a square is 0·95, it is straightforward to calculate which of the b parameters when multiplied by their covariate value or the error term are responsible for this high value. In practice, the logit transform ensures that probabilities close to 1 or 0 may be explained equally well by a number of components: several different components may result in a high score, but the logistic function is ‘flat’ at the extremes. Equation (eqn 1) can therefore be split:

display math(eqn 6)
display math(eqn 7)

Where Clim_Compi is the climate component, and SE_Compi is the contribution of the spatially explicit error term in square i. If the biological model is intended to include biotic interactions explicitly, this component can in addition be calculated exactly as in Equation (eqn 6). An index of the effect of the climate and error terms can therefore be generated as the absolute difference between probabilities estimated from Equation (eqn 1), and the probabilities estimated from Equations (eqn 6) and (eqn 7). For example, if the difference between values calculated from Equations (eqn 1) and (eqn 6) is large but the differences between Equation (eqn 1) and (eqn 7) are small, it is clear that climate is not strongly influential. In general, the ratio of the covariate component to the spatial error term component can be used as an overall assessment of covariate impact (Equation (eqn 8), a statistic bounded between 0 and 1 that is appropriate for ranking, but inappropriate for subsequent analysis using methods that assume normality). We note here that our methodology is not restricted to use with climate covariates – other kinds of covariate information could be used equally well, allowing partial quantification of the components of the BAM model.

display math(eqn 8)

Moreover, as this ratio effectively corrects for differences in model fit and overall prevalence by making comparisons only within the components of the model itself, we can use its mean value across all squares as a measure of covariate impact to compare different species. This ratio cannot wholly identify the relative importance of biotic and abiotic processes (two components of the BAM model), as some ostensibly abiotic parameters are likely to be proxies for biotic variables: rainfall may in fact only influence distribution of a bird through altering the habitat, for example. The original model is fitted using Markov Chain Monte Carlo (MCMC), so all derived variables can be calculated for each MCMC iteration, providing samples from their posterior distributions and hence allowing appropriate assessment of variability and the calculation of 95% credible intervals as illustrated below (the accuracy of this method was assessed using randomization tests similar to those of Chapman (2010) and is described in the Methods S1).

Examples

Data

To illustrate our methodology, we chose three European bird species with contrasting distributions: Melodious Warbler, Hippolais polyglotta, a migrant species found mainly in the Iberian peninsula, with well-known competitive relationships with a sister taxon H. icterina to the east (Secondi et al. 2003); Temminck's Stint Calidris temminckii, a migrant species breeding on high Arctic tundra, and the Grey Heron, Ardea cinerea, a resident and partial migrant found widely across Europe and further afield. We combined the probable, possible and confirmed records of presence within 50 km × 50 km squares taken from the European Breeding Bird Atlas west of 30 °E and excluding Svalbard and the Azores (Hagemeijer & Blair 2002). We used four climate variables often used in this type of analysis and reported likely to either have direct physiological impacts on birds or indirect effects mediated by impact on vegetation (Araújo et al. 2005a,b; Huntley et al. 2008). We used (i) the mean minimum temperature of the coldest month, (ii) growing season degree days (expressed as the temperature sum above 5°C), (iii) the water availability (the ratio of actual evapotranspiration to potential evapotranspiration) and (iv) seasonality (the coefficient of variation in mean monthly temperature in °K). Mean monthly temperature data from 1961 to 1990 were available at 0·5° resolution (CRU_CL_1.0: New, Hulme & Jones 2000); evapotranspiration data were similarly available (data set GNV183: Ahn & Tateishi 1994) and were projected onto the 50 km bird locations using ordinary Kriging assuming an exponential spatial structure (Beale, Lennon & Gimona 2008). All covariates were centred and scaled to have mean of zero and variance one. These large-scale (50 km2) bird data may traditionally be expected to show strong climate-driven signals, with climate often considered the primary driver of European bird distribution at such scales (Huntley et al. 2008). In this example, we assume that our four climate covariates are the primary abiotic drivers within the BAM framework: this is a vast simplification to enable a relatively simple demonstration of our model, and the results demonstrated here should therefore be seen as illustrative only.

Application

Models for all three species converged within 1000 MCMC iterations and provided estimates of occurrence probability broadly matching the pattern of observed presence (Fig. 2). (Methods S1 describes an additional simulation test allowing MCMC convergence to be assessed and the accuracy with which we assessed variable importance). Focussing on the Melodious Warbler (equivalent plots for other species are available as Figs S1, S3, S5–S10), we first examine the marginal effect plot for the four climate covariates including the additional variance of the intercept estimates to calculate the 95% credible interval (Fig. 1). This is equivalent to plotting the intercept plus covariate effect of a typical marginal effect plot, but centred on the estimated intercept. This shows that this species prefers areas of intermediate seasonality and avoids areas with high seasonality and low minimum winter temperatures (note that as a summer visitor to Europe, it is obvious this impact is not direct but reflects either winter temperature being a ‘climate package’ highly correlated with other temperature variables that may have direct effects or through indirect impacts on habitat). We found that including latitude and longitude left the median results essentially unchanged, but increased the width of the credible intervals (Fig. 1).

Figure 2.

European distributions of the three example species and the modelled probability of presence: Melodious Warbler observed (a) and modelled (b) distribution; Temminck's Stint observed (c) and modelled (d); Grey Heron observed (e) and modelled (f) distribution. Squares with observations are indicated in blue, absence in yellow. For model results, the size of the blue squares is proportional to the probability of presence.

To visualize the effects of the covariates spatially, the individual contributions (i.e. the parameter estimate multiplied by the covariate value) can be combined with the variance from the intercept (ensuring correct variance estimates of the 95% CI) and plotted geographically. Moreover, because the Bayesian model provides a distribution of parameter estimates rather than a single mean estimate, squares where the distribution of any one component is consistently different from zero can be identified. For the Melodious Warbler, we find (Fig. 3) that the areas with suitable seasonality are widespread (indeed, more widespread than the actual distribution of the species, Fig. 2a) in southern and central Europe, whilst the very highly seasonal areas that are avoided are similar to the areas with low minimum winter temperatures in northern Scandinavia.

Figure 3.

Geographical projection of the marginal effect plots for Melodious Warbler distribution in Europe: (a) seasonality, (b) length of the growing season, (c) water availability and (d) winter cold. Median estimate is indicated by the size of the symbol, positive and negative values on the logit scale identified with light and dark shading. Regions where the 95% credible intervals do not overlap 0 on the logit scale are indicated with solid symbols (note that the very small symbols always appear solid, but never are). Note that in (c) no large, positive values are present.

Marginal effect maps such as this indicate areas of suitable climate, but the models can also show where changes in climate variables are most likely to drive changes in distribution by calculating the sensitivity of the probability estimate in each square to small changes in each climate variable. As is expected, the sensitivity plots reflect the marginal effect plots in showing the important influence of the same parameters (Fig. 4), but also highlight the areas around the range edges where changes in different components of climate are likely to have most effect: a change in winter temperature may result in changes in occupancy around the southern coasts of Europe, whilst changes in seasonality may affect distribution at the western periphery. Similarly, the distribution limit running approximately along the north-eastern French border (from Fig. 2) is unlikely to be directly affected by change in any climate variable; other factors must explain this limit.

Figure 4.

Mapped sensitivity of Melodious Warbler distribution in Europe to change in individual climate covariates: (a) seasonality, (b) length of the growing season, (c) water availability and (d) winter cold.

These individual maps, however, do not show everything: whilst individual climate components may not appear important in any one region, the effects of all climatic components are important in combination. We can visualize the overall effect of all covariates geographically, by computing the linear sum of all covariate components (logit probabilities from Equation (eqn 6)) and again identifying squares where the linear sum is consistently far from zero. We can then colour squares by blending shades according to the contribution towards the linear sum of each individual climate component: in this example, red (for winter temperature) and yellow (for seasonality) blend to give a predominantly orange/brown tone (Fig. 5a). This shows large areas of southern and central Europe are climatically suitable for this species as both important components of the climatic niche are suitable in this region (both seasonality and minimum temperature are intermediate). The same two factors, high seasonality and low winter temperatures, become unsuitable in the far north and west and explain absence in these areas. Similar maps could obviously be created for the sum of any combination of biotic or abiotic covariates, allowing a visual interpretation of the geography of the BAM concept, bearing in mind that these maps are only a minimum estimate of either biotic or abiotic processes, limited by the covariates fitted, and with additional biotic or abiotic residuals incorporated in the spatial effect.

Figure 5.

Strength of climatic (a) and other (b) effects on the modelled distribution of the Melodious Warbler. Colours in (a) indicate the individual covariate contribution to the overall climatic effect by blending yellow (seasonality), green (growing season), blue (water availability) and red (winter cold) such that the orange/brown tone indicates contributions from both winter temperature and seasonality. Darker shades indicate lower than average probability of presence, paler shades indicate higher than average, with the size representing the overall median climate effect and solid symbols identifying regions where the credible interval does not overlap zero. Square sizes in (b) indicate the size of the error term, colours indicate the positive (grey) or negative (pink) influence of the error term on overall modelled probability with darker colours indicating credible intervals do not overlap zero. Thus, dark grey identifies regions with significantly higher probability of presence than indicated by climate variables alone. (c) and (d) are equivalent plots for Temminck's Stint, and (e) and (f) for Grey Heron. Note especially the relatively small random effect contribution in (d) and the apparent political boundary effect in (f).

Comparing this map (Fig. 5a) with the map of actual presence (Fig. 2a) suggests that large areas of apparently climatically suitable territory are unoccupied. Plotting the spatially explicit error term (Fig. 5b) shows that there is very strong congruence between the error term and actual presence, suggesting factors other than the measured climate variables are important drivers of distribution. This is further evidenced by plotting the map with squares sized according to one minus (ensuring large sizes represent large climate effects) the absolute difference between probabilities estimated from the full model (Equation (eqn 1)) and those estimated from the climate components alone (Equation (eqn 6); Fig. 6). It is self-evident that the measured climate variables must be suitable within the Iberian range, and it is not surprisingly that climate is unsuitable in the far north-east. More interesting is the evidence that other factors explain most of the actual range limit, running from the Netherlands, through Germany and down to the Balkans and Greece (cf. Fig. 4). In fact, this is no surprise: direct competition with the Icterine Warbler is a well-known factor limiting distribution (Secondi et al. 2003), whilst the Eastern Olivaceous Warbler H. pallida occupies much of the suitable regions of south-eastern Europe (Fig. 6 inset). If we had included the distributions of other Hippolais warblers as an additional (biotic) covariate, we would have been able to quantify the relative effect of this compared to the measured climate and other abiotic covariates, allowing some quantification of the BAM concept.

Figure 6.

The proportion of the overall modelled probability of presence for Melodious Warbler that can be explained by climate variables alone. Size of squares indicates (one minus) the difference between probabilities estimated from the climate covariate components of the model alone and that of the full model, colours are blended as in Fig. 5 to indicate the contributions of individual climate covariates: yellow (seasonality), green (length of growing season), blue (water availability) and red (winter cold). Inset shows the distribution of known competitor Hippolais icterina (red and purple squares) and H. pallida (blue and purple squares). Note that climate alone is capable of explaining presence in Iberia and absence in north-eastern Europe, but other factors must explain absence in the band running north-west to south-east, where contact with competitors occurs.

For other species, different effects are obvious: the distribution of Temminck's Stint (Fig. 5c,d) appears to show strong climate effects (though absence from Iceland requires a different explanation, perhaps simply geographical isolation: Beale, Lennon & Gimona 2009). The pattern for the Grey Heron (Fig. 5e,f) is particularly interesting, as the mapped error terms identify political units: Spain, Italy, Iceland and Greece are particularly obvious countries where the low probability of finding the birds matches the low values of the spatial error term. This pattern is perhaps best explained not in terms of biology, but by the historical persecution of the species in some of these countries, from which the species is only now recovering (Lorrilliere, Boisteau & Robert 2010). Indeed, it is useful to note that when modelling individual species distributions, plotting the spatially explicit error term may allow identification of biologically important covariates not initially included in the model (the distribution of competitors, or regions where human disturbance is particularly important), which, after judicious ecological investigation, may later be explicitly incorporated into a new model, resulting in an overall improvement. Of course, if there is spatial variation in observer effort and no occupancy model is included within the data model, the random effect may also include bias in detection probability – although this is unlikely to be the explanation for the large and obvious Grey Heron where persecution is known to impact distribution (Lorrilliere, Boisteau & Robert 2010).

So far, these analyses have focussed on the influence of different factors on individual squares, but by averaging various properties of the model over all squares, we can estimate the overall importance of the different components of the model (Equations (eqn 6), (eqn 7), (eqn 8)). In fact, we find that different species show differing degrees of climatic determination (From Equation (eqn 8), Melodious Warbler = 0·277 (95% CI = 0·110, 0·532), Temminck's Stint = 0·422 (0·047, 0·583), Grey Heron = 0·098 (0·003, 0·775): Table S1): it is clear that climate exerts a relatively strong influence on the distribution of Temminck's Stint, a more moderate influence on the distribution Melodious Warbler and little influence on Grey Heron – all as expected from these species' autecology.

Discussion

Valuable ecological insights can be obtained from our method, which, uniquely among species distribution modelling methods, allows rigorous statistical modelling and visualization of the spatial effects of both known and unknown factors on the distribution of species. We found that our method passes two further tests: it can accurately identify uninformative covariates (because the true degree of statistical association can be measured), and parameter estimates are not strongly affected by variations in the covariate set. Moreover, a strength of the hierarchical modelling approach is that it is relatively straightforward to extend our basic model to more sophisticated cases: if, for example, data are available representing observer effort, an observation/detection process can be incorporated to generate an occupancy model (e.g. Royle et al. 2007), more complicated niche models could be built to include interactions between covariates or habitat variables and other factors, or models of data from two time periods or two spatial scales can be incorporated to measure correlates of change in distribution (we recently used a related method, extended with an occupancy model and models of colonization and extinction to analyse changes in distribution of Tanzanian bird species: Beale et al. 2013). Our method also allows the quantification of the importance of measured covariates compared with unknown spatial effects. Such partitioning has always been possible with any spatially explicit method (and has been widely used in analysis of species composition data sets since Borcard, Legendre & Drapeau 1992), but has not previously been explored in the context of species distribution models. Methods for measuring variable importance in regression models (including RandomForests) have a long history in the statistical literature (reviewed by Grömping 2006), and most suffer from a range of problems related to collinearity that will likely influence our estimate here too. However, no standard solution has been previously proposed that is suitable for this type of random effects model: our measure is a simple and intuitive approximation that will likely develop further in time. When used with adequate covariates, our method allows calculation of the relative importance of any fitted biotic and abiotic components at a range of spatial scales, allowing quantification of part of the BAM concept. Indeed, with the exception of calculations involving the spatial error term, many more standard distribution models could be explored in similar ways to those outlined here: the advantage of our method lies in the imposition of an ecologically plausible model, the statistical defensibility and the improved model assessment that comes from appropriate modelling of spatial autocorrelation.

Whilst the improvements in distribution modelling using these methods are potentially substantial, we do not consider our method the final word in species distribution modelling. First, and most importantly, as with all statistical models, our method does no more than identify an empirical association or correlation. Whilst correlations can be extremely valuable, particularly when large-scale analyses are undertaken and experiments are impossible, they are still unable to attribute causation. Thus, it remains important for ecologists to identify and test potential mechanisms that underlie correlations. Secondly, these methods remain focused on single species in isolation and do not deal explicitly with species interactions, although the error term can help identify these. Methods combining current approaches to multispecies analysis (Milns, Beale & Smith 2010) with the spatially explicit statistical method described here are highly desirable: our method is extendible to a small number of multispecies models through hierarchical linkage of two or more simultaneous distribution models (MacKenzie, Bailey & Nichols 2004), but whilst this would allow a much fuller analysis of the biotic component of the BAM concept, this approach would be computationally challenging with more than a few species. Finally, our approach, like any, is limited by the appropriateness of the set of covariates used. Variables considered useful for bioclimatic niche modelling have evolved using methods with considerable statistical shortcomings, and this may have resulted in the suboptimal selection of appropriate climatic factors. The approach described here allows this to be addressed as covariate credibility is statistically assessed and it may be necessary to, at least initially, search more widely to identify optimal climate covariates. Furthermore, in common with all current methods, our assessment of sensitivity focuses on individual climate variables, whereas global climate change will alter multiple climate variables in ways that are unlikely to preserve the current correlations between different climate components (Jackson et al. 2009). Further work will be necessary to allow the impacts of such changes in cross-correlation to be assessed.

Other details of the modelling method can be seen not so much as limitations, but issues that need consideration during the analysis. Most of these involve interpretation of the spatially explicit error term. As we showed in the examples, plotting and investigating the error term can highlight patterns indicating important but not explicitly modelled processes. In a detailed single species study, it may be possible to identify some candidate processes underlying the pattern, and these can be incorporated as additional covariates in an improved model, although it is important to avoid multiple testing problems (Anderson, Burnham & Thompson 2000). For example, it might be wise to incorporate a factor identifying countries where Grey Heron have been heavily persecuted. Doing this should improve the precision with which all parameters are estimated, and a new model would likely show a very different error term structure, reflecting other, less obvious spatial patterns. It should certainly not be assumed that estimating an error term removes the need to include other factors within the model if important variables can be identified: the ability of the error term to ‘absorb’ unexplained spatial structure is finite. A further elaboration of our approach could involve a different form of spatially structured error term that allows for discontinuities in the degree of smoothing between neighbouring squares around sharp geographical or political boundaries (Brewer & Nolan 2007; Reich & Hodges 2008).

Strong patterns in the spatially explicit error term also make future projections from the model challenging. In our examples, we fitted models that are essentially bioclimate envelope models (c.f. Beale, Lennon & Gimona 2008). Such models have been widely used in combination with output of general circulation models to predict the distribution of species following climate change (Araújo et al. 2004; Araújo & New 2007; Garcia et al. 2012), but for our models, this would require careful thought about what to do with the error term component in the future projection. In the case of Temminck's Stint, where the error term is relatively small throughout the spatial domain, this is unimportant. If, as with the Grey Heron, the error term primarily reflects anthropogenic activities that are unlikely to be affected by climate change, then perhaps the best approach would be to leave the error term for each square unchanged when making a projection. However, in the case of the Melodious Warbler, where the species is apparently absent from many climatically suitable regions probably because of competition, the appropriate response is less clear because climate change may simultaneously impact the distribution of these other species too. This difficulty of projection is a challenge, but we believe this actually more accurately reflects a fundamental difficulty in using correlative models to forecast climate-induced distribution shifts. This stands in contrast with the apparent ease with which projections have been made to date using statistically naïve niche models. Moreover, because we use a Bayesian approach to parameter estimation, we can easily present uncertainties involved in any projection – an issue prominent in recent climate impacts literature (Araújo & New 2007; Beale & Lennon 2012). Additionally, our maps of sensitivity to individual covariates, whilst not actually projections of future distribution, can certainly be used to predict regions where distribution changes are most (and least) likely to occur, allowing verification of our models. For example, we found that the distribution of Grey Heron was least strongly explained by climate overall (Table S1), but that climate warming may allow an inland colonization of Norway and southern Sweden, and may also lead to colonization of more areas in the central European Alps (Fig. S7). Such predictions could be useful not only to verify model accuracy but to identify species and locations where impacts of climate change are likely to be first identified. They may also identify regions where mitigation methods (both to prevent unwanted declines and unwelcome increases in invasive species) are likely to be of greatest use.

In conclusion, this type of model, with ecologically plausible niche forms and appropriate accounting for spatial autocorrelation, can be seen as a partial implementation of the BAM concept and advances most methods for species distribution modelling, addressing as it does a number of the important statistical and ecological shortcomings of current methods. We anticipate that use of this and similar models will allow ecologists to address afresh important questions concerning the patterns and structure of species distributions.

Acknowledgements

We thank the European Bird Census Council for access to European breeding bird distributions. D. Elston, S. Albon, T. Peterson and two anonymous referees helped improve the manuscript. This project was funded by the Rural and Environment Science and Analytical Services Division of the Scottish Government and by the European Union through a Marie Curie fellowship to CMB.

Data accessibility

Species distributions: (Hagemeijer & Blair 2002), contact EBCC for access: info@ebcc.info. R/WinBUGS code: uploaded as online supporting information.

Ancillary