Identifying appropriate spatial scales of predictors in species distribution models with the random forest algorithm


Correspondence author. E-mail:


  1. Including predictors in species distribution models at inappropriate spatial scales can decrease the variance explained, add residual spatial autocorrelation (RSA) and lead to the wrong conclusions. Some studies have measured predictors within different buffer sizes (scales) around sample locations, regressed each predictor against the response at each scale and selected the scale with the best model fit as the appropriate scale for this predictor. However, a predictor can influence a species at several scales or show several scales with good model fit due to a bias caused by RSA. This makes the evaluation of all scales with good model fit necessary. With potentially several scales per predictor and multiple predictors to evaluate, the number of predictors can be large relative to the number of data points, potentially impeding variable selection with traditional statistical techniques, such as logistic regression.
  2. We trialled a variable selection process using the random forest algorithm, which allows the simultaneous evaluation of several scales of multiple predictors. Using simulated responses, we compared the performance of models resulting from this approach with models using the known predictors at arbitrary and at the known spatial scales. We also apply the proposed approach to a real data set of curlew (Numenius arquata).
  3. AIC, AUC and Naglekerke's pseudo R2 of the models resulting from the proposed variable selection were often very similar to the models with the known predictors at known spatial scales. Only two of nine models required the addition of spatial eigenvectors to account for RSA. Arbitrary scale models always required the addition of spatial eigenvectors. 75% (50–100%) of the known predictors were selected at scales similar to the known scale (within 3 km). In the curlew model, predictors at large, medium and small spatial scales were selected, suggesting that for appropriate landscape-scale models multiple scales need to be evaluated.
  4. The proposed approach selected several of the correct predictors at appropriate spatial scales out of 544 possible predictors. Thus, it facilitates the evaluation of multiple spatial scales of multiple predictors against each other in landscape-scale models.


The distribution of species can depend on processes occurring at multiple spatial scales (Wiens 1989; Cunningham & Johnson 2006; Thogmartin & Knutson 2007). The occupancy of a breeding patch or nest site can depend on factors that influence the quality of specific nest sites (Martin 1998), areas immediately around nest sites (e.g. territories) or areas utilised outside the immediate area (Dallimer et al. 2012). Occupancy can additionally depend on factors, which indicate the quality of breeding areas [e.g. ‘public information’ such as breeding densities (Doligez et al. 2004)] or which influence population size (Takada & Miyashita 2004). Thus, occupancy of a patch can depend on multiple spatial scales and some of these can be large spatial scales.

Multiple-scale species distribution models can increase predictive ability relative to single-scale approaches (Cunningham & Johnson 2006). However, getting the scale of predictors wrong can decrease the variance explained, add residual spatial autocorrelation (RSA), bias regression coefficients and therefore lead to the wrong conclusions (de Knegt et al. 2010). Accounting for spatial autocorrelation [with techniques such as simultaneous autoregression (see e.g. Dormann et al. 2007)] may not remove the parameter estimate bias originating from the misspecification of spatial scales of predictor variables (de Knegt et al. 2010). This leads to a conundrum, including the appropriate predictors, and at the appropriate scales is important. Not including them or including the correct predictors at the wrong scales can lead to the wrong conclusions, but how do we select appropriate scales when they are typically not known? (As species distribution models (SDM) frequently model variables correlated with process variables, we use the term ‘appropriate scale’ rather than ‘scale of processes’).

One approach to this problem is to first measure the potential predictors within different buffer sizes (henceforth called scales) around sample locations and then to regress each predictor against the response for each scale (e.g. Steffan-Dewenter et al. 2002; Holland, Bert & Fahrig 2004; Gray, Phan & Long 2010). The assumption is that the model with the best fit identifies the most predictive, and therefore, the single most meaningful, spatial scale. The best fit is typically measured by the highest correlation coefficient or lowest AIC (Akaike information criterion) (Burnham & Anderson 2004). For an example with our data see Fig. 1a.

Figure 1.

Akaike information criterion for logistic regressions of curlew presence or absence with (a) area of flat land and (b) area of unfertile soil, at buffer sizes (scales) between 250 m and 10 km around sample locations. The log of the area visible from the transect was included as an offset and the number of hiking groups on the transect as a fixed factor. This approach of measuring predictors within different scales, and regressing the predictor against the response for each scale has been used to identify a single scale assumed to be the most meaningful scale of the predictor. In example (a) only one scale (arrow) is identified: 2500 m. In example (b) two scales with good model fit exist: 250 m and 10 km (arrows).

However, assessing each predictor separately, and selecting a single best scale, can lead to overall problems in model specification. If models contain only single predictors, they may leave out other important variables, increasing residual variance and leading to a bias in the statistical inference if the response and predictor are both spatially autocorrelated (which is often the case in ecological data sets) (Legendre 1993), as the correlation coefficient and AIC can be affected (Lennon 2000; Hoeting et al. 2006). Therefore, the magnitude in the difference in fit between scales may be biased (see also Schooley 2006). The bias will be affected by the strength of the spatial autocorrelation in the predictor (Lennon 2000), which is highly likely to vary with scale. Therefore, the scale with the best fit may not necessarily represent the most appropriate scale for the predictor. Indeed, there may not be a single appropriate scale as a species can be influenced by a predictor at multiple scales (Thogmartin & Knutson 2007).

One solution therefore would be to evaluate all scales of all predictors with moderate levels of statistical support against each other, rather than to rely on selecting only the ‘best’ one, where the estimate of what is best may be biased. However, this raises two problems: firstly, the number of possible predictors can often be large (van Horne 2002). In a landscape, a single factor (such as habitat type) could be a predictor in several ways at each spatial scale: for example, the presence or the proportion or the size of patches of one of several habitat types or the isolation or connectivity of these patches could influence the distribution of a species (Saab 1999; van Langevelde 2000; Cunningham & Johnson 2006; Hodgson et al. 2011). The number of predictors can therefore quickly become large relative to the number of data points, a situation to which traditional statistical techniques such as linear or logistic regression are not robust (Strobl et al. 2007). Secondly, predictors in ecological data sets are often highly correlated, which can lead to problems in variable selection (Graham 2003). Both problems are potentially exacerbated when the above repeated regressions identify several scales with good model fit per predictor (for an example see Fig. 1b).

Studies conducted at different spatial scales can lead to different conclusions (a variable influencing a response at one scale may not do so at another or influence the response in a different way) (Wiens 1989; Hamer & Hill 2000). In ecology, studies using arbitrary spatial scales proliferate (Wheatley & Johnson 2009). Therefore, a methodology that can identify appropriate spatial scales of predictors despite the difficulties outlined above would further our understanding of landscape-scale processes and result in more accurate predictions of species distributions.

In this study, we investigate whether the random forest algorithm, a machine learning method based on regression or classification trees (Breiman 2001; Liaw & Wiener 2002) can circumvent this ‘combinatorial explosion’ of ‘predictor × scale’. Random forest is robust to situations in which the number of data points is small compared with the number of predictors (Strobl et al. 2007) and has successfully been used for variable selection, for example, in microarray studies (which also often have a small number of data points, a high number of predictors and high correlations between predictors) (Archer & Kimes 2008; Nicodemus et al. 2010).

We first use simulated responses to assess whether combining regressions of single predictors with variable selection by random forest can (a) result in accurate predictions and (b) select the correct predictors at appropriate spatial scales. Then, we apply the same approach to a real data set of a wader species in an upland area of the UK. For both the simulated and real data sets, we examined 12 factors, four of which (elevation, aspect, slope, soil) were further divided into categories to examine whether the quantity of a category (e.g. of low elevation) at a given scale influenced the distribution of the species or simulated response. Examination of these variables at up to 15 scales each resulted in 544 potentially highly correlated predictors to select between.

Materials and methods

Study area and predictor variables

The study was carried out in the Yorkshire Dales National Park (see Fig. S1a), an upland area in northern England, UK. Predictor variables for the simulation were the same as were chosen for the case study on curlew (Numenius arquata).

Variables that directly influence a species distribution, such as food availability or predation risk, are often difficult to record particularly for landscape-scale studies. Instead, we used habitat variables that were more readily available for large areas and were likely to be associated with the more direct, but unmeasured predictors. We sought to describe food availability (through possible associations with soil type, elevation, slope, aspect, rainfall), microclimatic conditions (through elevation, aspect, rainfall), habitat structure (through livestock numbers), disturbance (using settlements, paths/roads) or perceived predation risk (using settlements, field walls, viewshed) (for a more detailed description of the possible associations between the factors we sought to describe and the predictors used see Appendix S2).

Elevation was obtained as a Digital Terrain Model (DTM) at 50 m resolution (50 × 50 m) (Land-Form PANORAMA downloaded from the EDINA Digimap OS service; © Crown Copyright/database right 1993. An Ordnance Survey/EDINA supplied service). Aspect and slope were calculated from the DTM. Soil data were extracted from the simplified version of the National Soil Map data set [NATMAP soilscapes; 1:125000 vector map (NSRI 2011)] after conversion into a 50 m resolution raster (NATMAP soilscapes © Cranfield University (NSRI) and for the Controller of HMSO 2009). The area of settlement (houses, gardens, barns etc.) was identified from the Ordnance Survey MasterMap ( Paths/roads (including railways) outside settlements were collated from the MasterMap and data on public rights of way held by the Yorkshire Dales National Park Authority (YDNPA). ‘Obstructing features’ in the MasterMap (line features which were mainly walls) were used as field walls.

Average annual rainfall for the period 1961–90 was obtained at 5 km resolution (Met Office, Livestock numbers (sheep and cattle) in 2004 (AGCensus data downloaded from the EDINA Digimap OS service; © Crown Copyright/database right 2009. An Ordnance Survey/EDINA supplied service) were obtained at 2 km resolution.

Curlew were repeatedly surveyed from sections of public right of way (henceforth called transects) in 244 observation units (see ‘'Survey data'’ below), and two transect-specific predictors were included: due to the hilly environment, parts of the observation units were invisible from the transect line, potentially decreasing the probability of recording curlew. The visible area per observation unit was recorded approximately on a field map (scale: ca. 1:8000) and later digitized and calculated. The number of walking groups (a potential source of disturbance) was recorded during each transect survey and the mean between all repeat surveys computed.

Curlew are highly mobile birds and we expected that the quantity of a predictor (e.g. of low elevation) within a scale would be more important in explaining the distribution of the species than the size of patches of the predictor or the connectivity between patches and restricted our analysis to the quantity of predictors within scales. For elevation, aspect, slope and soil type this necessitated creating categories. As we had no a priori knowledge on the grouping of categories (we hypothesised for example that the more wind exposed west facing areas (Met Office 2010) may be more thermoregulatory unfavourable than south facing areas, but did not know in which category to place south-west facing areas), we initially created a fine division of categories for elevation (0–200, 200–300, 300–400, 400–500, 500–600, 600–850 m), aspect (flat, north, north-east, east, south-east, south, south-west, west, north-west) and slope (0–2, 2–5, 5–10,10–15, 15–25, 25–60°). In the curlew model, for adjacent categories (e.g. south and south-east) with good model fit in repeated regressions (see below) at the same spatial scale, we plotted the component smooth function of a generalized additive model at the scale of the linear predictor, that is, the smoothed relationship between the response and the predictor. If the plots were similar, the adjacent categories were grouped. Soil data were grouped according to the attributes of the data (NSRI 2011): by texture (peat or loam), fertility and lime status (lime-rich, very low, low, moderate or high fertility. Very low and low as well as moderate and high were further grouped into fertile and unfertile soils) and drainage (well drained, impeded drainage or wet soil. The latter two were further grouped into moist soils).

For each pixel in a raster map (50 m resolution) of the study area, we calculated the area of each elevation, aspect, slope and soil category as well as the area of settlements within circular buffers of radii (i.e. the scales): 0·25, 0·5, 0·75, 1, 1·5, 2, 2·5, 3, 4, 5, 6, 7, 8, 9 and 10 km), the length of paths/roads at: 250, 500, 750, 1000, 1500 and 2000 m and the length of walls at: 250, 500, 750, 1000 and 1500 m. For each observation unit, values of each of these predictors at each scale were extracted for the centroid of the visible area. Viewshed (the area visible from a location) was calculated for four random points (which had to be at least 120 m apart) per observation unit and the scales 250, 500, 750 and 1000 m and averaged per observation unit and scale. Henceforth, a predictor considered at a specific scale will be called scale-specific predictor. Considering the eight factors (aspect, slope, elevation, soil, settlements, paths/roads, walls and viewshed) at multiple spatial scales resulted in 540 scale-specific predictors. Together with two single-scale predictors (rainfall and livestock numbers, extracted for the centroid of the visible area of each observation unit) and the two transect-specific predictors, the total number of predictors was 544.

Simulated responses

We ran two sets of simulations differing in the maximum scale of predictors (5 and 10 km, respectively). For both sets, we created simulated responses based on two, four or six scale-specific predictors (henceforth called true predictors) and used arbitrary coefficient values for each predictor. For both six predictor scenarios and for the four predictor scenarios with maximum scale of 10 km, we created two simulated responses each, differing in the magnitude of the coefficients (see Table 1 and Table S1 for their values).

Table 1. Akaike information criterion, AUC and Naglekerke's R2 for models with a simulated response against the true scale-specific variables, the selected scale-specific variables and the true variables at arbitrary scales (500 m). The simulated responses were based on two, four or six true scale-specific variables (Var), and the scales of the variables were either restricted to a maximum of 5 or 10 km. When spatial eigenvectors were added, results are presented as: before/after addition of spatial eigenvectors
 Lower coefficientsHigher coefficients
TrueSelectedArbitrary scaleTrueSelectedArbitrary scale
  1. Coefficients used to calculate the linear predictor: two true variables: 0·7, −0·7;four true variables: lower coefficients: 1·2, 0·7, −0·7, −1·2; higher coefficients: 1·8, 1·2, −1·2, −1·8; six true variables: lower coefficients: 1·8, 1·2, 0·7, −0·7, −1·2, −1·8; higher coefficients: 1·8, 1·5,1·2, −1·2, −1·5, −1·8.

 Max scale 10 km
2 VarAIC309·72309·28328·34/320·5   
Naglekerke's R20·170·170·08/0·14   
4 VarAIC271·47286·24326·32/308·13207·33201·39319·91/219·72
Naglekerke's R20·360·280·11/0·220·580·610·14/0·58
6 VarAIC158·34211·63/179·81244·25/198·9789·2184·79237·69/125·27
Naglekerke's R20·730·58/0·690·48/0·650·880·890·5/0·83
 Max scale 5 km
2 VarAIC276·42282·28309·81/294·58   
Naglekerke's R20·320·330·18/0·25   
4 VarAIC228·42231·12270·8/257·9   
Naglekerke's R20·520·50·36/0·43   
6 VarAIC169·51215·92/190·09249·73/199·17146·89151·52229·82/186·41
Naglekerke's R20·70·57/0·660·46/0·660·760·750·52/0·68

To create the simulated responses, we preselected the spatial scales for each scenario. For the simulations with the maximum scale of 10 km, with two predictors, we selected 1 and 9 km, for four predictors, we used 0·5, 3, 7 and 9 km, and for six predictors, we used 0·25, 2, 4, 6, 8 and 10 km. For the simulations with maximum 5 km scale, with two predictors, we used 0·25 and 5 km, for four predictors, we used 0·25, 1·5, 3 and 5 km and for six predictors 0·25, 1, 2, 3, 4, and 5 km. We ordered all ‘types’ of predictors (i.e. flat: 1, north-facing: 2, etc.), then permuted their order.

We chose the first ‘type’ of the permuted order for the smallest preselected scale of each scenario. For each subsequent scale, we permuted the order of the remaining ‘types’ again. From the permuted order, we chose the first ‘type’ with Spearman ρ ≤ |0·3| to each previously selected predictor for the scale.

We standardized each true scale-specific predictor to a mean of zero and a standard deviation of one and calculated the linear predictor g(xi) for each of our 244 transect centroids i as g(x)i = α + β1 * x1i + … + βni * xni where x are the true scale-specific predictors, β the coefficients, n the number of true predictors and the intercept α = 0. We calculated the logistic model for the probability P of the response being 1 as

display math

and drew a value from a binomial distribution with probability P for each of the 244 observation units.

Survey data

To demonstrate the proposed approach on a real data set, we used survey data of a bird species, the curlew. Due to the size of the national park (ca. 1770 km2) and the limitations of available survey time, the survey was focused on areas below 500 m in elevation. Curlew were surveyed from 61 transects of 2 km length during 2008 (Fig. S1b). Three sets of repeated surveys between 2nd–30th April, 3rd–22nd May and 2nd June–1st July were carried out. Plots of the number of curlew records with distance from transects suggested reduced detectability beyond 200 m (see Fig. S2). Only observations within this distance were used. The area surveyed from each transect was divided into four observation units of approximately equal length to enable the study of environmental variation at this and larger scales. The observation units (244) roughly approximated the reported size of core areas used during nesting and chick rearing of the species (see Appendix S1).

Records that were unlikely to be of breeding birds, such as flocks, were omitted (see Appendix S1). Each observation unit was recorded as having the species present when at least one individual was recorded during at least one survey (see Appendix S1, also for further details of the selection of transect locations and survey methodology).

Scale-specific models

Following previous approaches (e.g. Steffan-Dewenter et al. 2002; Holland, Bert & Fahrig 2004; Gray, Phan & Long 2010), we (a) measured predictors within different scales around sample locations and (b) carried out repeated regressions of the response with each predictor at each scale. Then for each predictor, we (c) used AIC to select all scales with good model fit and (d) used random forest to rank and select important scale-specific predictors (R code available on request).

Repeated regressions

For factors which we considered at several spatial scales, we carried out repeated regressions first. The purpose of the repeated regressions was not to identify the important scale-specific predictors, but to omit some of the unimportant scale-specific predictors. We carried out logistic regressions of presence or absence of each simulated or real response for each predictor and scale. To account for potential transect-specific influences, the number of walking groups was added as a fixed factor and the log of the visible area included as an offset. We plotted AIC with scale for each predictor. To avoid the selection of too many irrelevant scales, only those scales were selected for which AIC was (a) a least two less than AIC of the model with the transect-specific variables only, (b) less than the AIC of the next smaller and larger scale and (c) less than the AIC for the second closest smaller and larger scale. Thus, all local minima with a difference in AIC of at least two to the model with transect-specific predictors were selected, but if the plot of AIC with scale showed two local minima with only one datapoint between the minima, only one (with the smallest AIC) was selected. Condition C was dropped for the smallest and largest scale as the trend for AIC outside of the considered range was unknown. Thus, it was possible to retain none, one or several spatial scales per predictor.

Random forest variable selection

Random forest (see Appendix S3) is an ensemble classifier which builds many classification or regression trees from random subsets of the data and aggregates the results (Breiman 2001; Liaw & Wiener 2002). Random forest is increasingly being used in ecology (e.g. Chapman et al. 2010; Bradter et al. 2011) due to its good performance as a classifier (Cutler et al. 2007). The use of random forest for variable selection is less common in ecology although several variable selection procedures have been proposed (e.g. Sandri & Zuccolotto 2005; Díaz-Uriarte & Alvarez de Andrés 2006; Genuer, Poggi & Tuleau-Malot 2010).

We used the variable selection procedure proposed by Genuer, Poggi & Tuleau-Malot (2010), which the authors have shown to be capable of selecting many, but not all, of the true predictors. It is based on the unscaled permutation importance that is calculated by permuting each predictor in turn and (for classification) using the difference in prediction error (OOB error) before and after permutation as a measure of variable importance (Liaw & Wiener 2002; Strobl et al. 2008). To calculate the OOB error, a training data set is created by sampling with replacement from 2/3 of the data for each classification tree in the ensemble. Each tree is then used to predict the other 1/3 (‘out of bag’) of the data. Classification is by aggregating over all trees, and the OOB error is computed as the proportion of times that the predicted class is not the same as the true class (Breiman 2001; Liaw & Wiener 2002).

Due to the randomness in the algorithm (see Appendix S3), it is desirable to average measures such as the permutation importance over several repetitions of the same model (Nicodemus & Malley 2009). The unscaled permutation importance was unbiased when there was no relationship between the response and the predictors (Nicodemus et al. 2010). However, the permutation importance for noncausal predictors, which were correlated with causal predictors was inflated (Strobl et al. 2008; Nicodemus et al. 2010).

The approach proposed by Genuer, Poggi & Tuleau-Malot (2010) to identify a set of predictors suitable for model interpretation consists of: (a) ranking all predictors using the unscaled permutation importance (averaged over 50 repetitions); (b) discarding noise variables by fitting a regression tree to the curve of the plot of standard deviations of importance measures ordered by their mean importance. Variables with a mean importance less than the smallest predicted value of the regression tree model are discarded; (c) computing OOB error for models (averaged over 50 repetitions) starting with a model with the most important variable and adding predictors sequentially in the order of their ranking (henceforth referred to as nested models); (d) selecting the model with the smallest OOB error and augmenting it by the standard deviation of the 50 repetitions; (e) selecting the nested model with an OOB error less than this and containing the fewest predictors.

As expected due to the bias towards predictors correlated with causal predictors and as shown by Genuer, Poggi & Tuleau-Malot (2010), this set may contain noncausal predictors, which are correlated to causal predictors. For prediction, Genuer, Poggi & Tuleau-Malot (2010) proposed further variable selection to arrive at a smaller variable data set. We base our study on the above outlined approach. For the parameters to be specified in random forest (number of trees built in the forest (ntree) and the number of predictors available at each node split (mtry) (see Appendix S3), we used values suggested by Genuer, Poggi & Tuleau-Malot (2010), to calculate the permutation importance: ntree = 2000 and mtry = p/2 where p is the number of predictors; to calculate OOB error: default values.

We used repeated regressions before variable selection with random forest to reduce neighbouring scales with good model fit to a single scale (the local AIC minimum). Without this preselection, the bias of the permutation importance towards predictors correlated with important predictors (see above) would be expected to lead to predictors at many neighbouring scales in the final set as these are often highly correlated (unpublished results confirmed this). This preselection reduced computing time (computing time for our simulated data was ca. three hours without and ca. 15 min with repeated regressions on a 1·87 GHz processor with 8 GB of RAM).

Performance assessment

For each simulated response, we computed three logistic regressions against: (a) the true scale-specific predictors, (b) the selected scale-specific predictors and (c) the true predictors, but at an arbitrary selected scale of 500 m. Predictors were standardized to a mean of zero and standard deviation of one. To all models other than those with the true scale-specific predictors, we applied Moran eigenvector filtering (Dray, Legendre & Peres-Neto 2006; Griffith & Peres-Neto 2006). If necessary, spatial eigenvectors were added until RSA was no longer significant at the 0·05 level. An analysis of deviance between the model with and without spatial eigenvectors was significant at the 0·05 level in each case, confirming that the model with spatial eigenvectors provided the better model fit.

For each simulated response, we compared AIC, AUC (Area under the receiver operating characteristic plot) and Naglekerke's pseudo r2 of the model with the selected scale-specific variables and the arbitrary scale model to the true model. AUC values range from 0·5 to 1. A value of 0·5 indicates that the model discriminates between cases no better than random, whereas AUC values of 0·5–0·7 indicate poor, 0·7–0·9 moderate and > 0·9 high discriminative ability (Franklin 2009). Naglekerke's pseudo r2 is a measure of the usefulness of the predictors in explaining the response and was calculated as

display math

where D is the deviance of the model in question, Dnull is the deviance of the null model and n is the sample size (Faraway 2006). (Nonlinearity and interactions need not be explicitly specified in random forest in contrast to logistic or linear regression. We added each scale-specific predictor selected by random forest as a main term into the logistic regressions as no interactions were used to generate the simulated responses).

To assess whether the selection procedure identified some or all of the true variables, we present the true and selected scale-specific variables for each simulated response and the scale-specific predictor in the nested model set with the highest |Spearman ρ| to each true scale-specific predictor.


All map manipulations were carried out in ArcGIS9.2 (ESRI Redlands, CA, USA). Random points were created using Hawth's Tools 3.27 ( Statistical analysis was carried out in R 2.14.1 (R Development Core Team 2011) using the packages randomForest (Liaw & Wiener 2002), spdep (Bivand, Pebesma & Gómez-Rubio 2008) for Moran eigenvector filtering, mgcv (Wood 2006) for component smooth function plots and Deducer (Fellows 2010) for AUC values.


We considered the variables aspect (nine categories), slope (six categories), elevation (six categories), soil type (13 categories), settlements, paths/roads, walls and viewshed at up to 15 scales each. For simulated responses, between 21 and 63 scale-specific variables were carried forward from the repeated regressions. Together with the four single-scale variables, they were entered into a random forest variable selection.

In the nine simulations, 75% of true predictors (50–100%) were selected at the true or similar scale (within 3 km) (see Table S1). The predictors with the highest regression coefficients (assigned to the largest and smallest scale) were always identified. For 11 of the unidentified 12 true predictors, the highest correlation to any variable in the nested model set was to itself at the true or similar scale.

An additional 16 untrue predictors were selected. These were correlated to one or more true scale-specific predictors. Their highest |Spearman ρ| to a true scale-specific predictor was 0·61 (mean) (range, 0·39–0·92). For 75% of the untrue predictors selected, the highest |Spearman ρ| was to a true predictor identified in the selected set at the same or similar scale (Table S1).

The logistic regressions resulting from the proposed approach required the addition of spatial eigenvectors only for simulated responses based on six predictors and lower coefficients (Table 1). For six of nine simulations, AIC, AUC and Naglekerke's pseudo r2 were very similar between the model with true and selected scale-specific variables. This was less so for the three models for which only 50% of true scale-specific predictors were identified at similar scales (the two models based on six predictors and lower coefficients and the model based on four predictors, lower coefficients and maximum scale of 10 km).

The arbitrary scale models always required the addition of spatial eigenvectors. Even then, AIC, AUC and Naglekerke's pseudo r2 were often less similar to the true model than the models resulting from the proposed approach (and spatial eigenvectors if necessary) (Table 1).

For the curlew model, the following category groupings were made: north-east, east, south-east and south facing slopes at 10 km were grouped into south-easterly slopes. South-west, west and north-west facing slopes at 10 km were grouped into westerly slopes. All slopes greater than 10° at 10 km were grouped into steep slopes. After grouping, 40 scale-specific predictors were carried forward from the repeated regressions. Eighteen predictors were contained in the final set. We interpreted these as suggested by Nicodemus et al. (2010) as a set containing groups of correlated and possibly partially redundant predictors. Inspection of the correlations between the predictors (see Table S2) identified two groups: one describing rainfall and topography (aspect/slope/elevation) mainly at large spatial scales (7–10 km) and one describing soil characteristics/low elevation areas/settlements at small spatial scales (0·25–1 km). Flat and east-facing areas at a medium scale (2·5 km) and the area visible from the transect were also selected. The average OOB error (for 50 repetitions) was 23% (see Fig. S3 for a map of predicted curlew distribution). (For comparison with the models for simulated responses, AUC for a logistic regression (each selected predictor added as a main term) was 0·87, Naglekerke's pseudo r2 was 0·51 and no spatial eigenvectors were required).


Our data analysis approach was designed to give explicit attention to the selection of appropriate spatial scales of predictors in SDM. As random forest is robust to situations with many predictors relative to data points, it enables the evaluation of several scales of several predictors against each other in situations in which there are too many predictors (even after preselection) for more traditional statistical techniques like regression to be used. The simulations showed that of 544 predictors, the proposed approach identified 50–100% of the true predictors at an appropriate scale. The addition of spatial eigenvectors to account for unexplained spatial structure in the models was only required when few of the true scale-specific predictors were identified. When a high proportion of the true predictors were identified, AIC, AUC and Naglekerke's pseudo r2 of the models resulting from variable selection were very similar to the true model. Consequently, the capabilities of the models for prediction were similar to the true model, and the models allowed a considerable understanding of the variables underlying the spatial pattern.

Models using an arbitrary spatial scale performed poorly without the addition of spatial eigenvectors. Adding spatial eigenvectors improved their performance. However, model interpretation with the correct scale-specific predictors is usually easier than inferring underlying processes from interpreting the spatial structure of the spatial eigenvectors. Hence, based on our simulations, the proposed methodology suggests a considerable improvement in understanding the drivers of spatial patterns at multiple scales.

SDMs are correlative approaches and the identified scale-specific predictors may not be causal, but be correlated with causal predictors. Causal predictors are often difficult to measure over large areas, but are often correlated to other, easier to collect measures (e.g. temperature variation along an altitudinal range would be time-consuming to collect from many sample sites, but can often be replaced by widely available measures of elevation and aspect). Such substitute predictors can therefore be useful for prediction and can be used to generate hypothesis for future consideration to assess causation.

Our applied model on curlew selected predictors at large, medium and small scales. While the selected predictors and their scales may not necessarily be the causal predictors and scales, the relatively large difference between the smallest and largest selected scale suggest that processes at multiple scales determine the distribution of curlew. This study therefore adds weight to a growing number of studies which, when considering multiple spatial scales, find that several of them emerge as important (e.g. Cunningham & Johnson 2006). If, as seems the case, processes occurring at multiple scales co-determine spatial patterning in ecology, this should receive much greater weight in ecological thinking, despite the difficulties of fitting models. This study illustrates some methodological approaches that can help address some of the problems of model fitting.

True predictors were identified typically either at the true scale or one similar. As the correlations between a predictor at similar scales tend to be high, using a similar scale should still result in appropriate models. Additionally, processes at the landscape-scale will often involve a range of similar scales rather than only one specific scale. For example, during the breeding season, curlew may fly to foraging areas, which are at a range of distances (rather than a single distance) to their nest site (Robson & Percival 2002).


In some of our simulations (which appeared to be those with a higher number of true predictors, lower coefficients and/or larger maximum spatial scales), fewer of the true predictors were identified at appropriate scales. Frequently, predictors which were correlated to already selected true predictors were contained in the selected set instead (see Table S1). This probably results from the inflation of the permutation importance of predictors not important in themselves, but correlated with important predictors (see 'Materials and methods'). Adding several highly correlated predictors can increase OOB error (Genuer, Poggi & Tuleau-Malot 2010). We suspect that the decrease in OOB error from weaker (relative to other predictors in the model) true predictors may not always be able to offset the increase in OOB error from uninformative predictors with inflated importance if too many of the latter are added before the former in nested models. This may lead to weaker true predictors being ranked below the threshold. If this is so, then if this bias in the permutation importance could be removed or reduced, it seems likely that our approach would have resulted in a higher proportion of true scale-specific variables being identified in more scenarios. An alternative, less biased variable importance measure, the conditional variable importance has been suggested (Strobl et al. 2008), but is computationally expensive. For models of the size presented in this study for example, the averaging of importance values required greater computational resources than we had available.

Despite this, the proposed approach has the ability to generate hypotheses of possible associations between a species’ distribution and predictors and can greatly reduce the number of ‘predictor x scale’ combinations for further consideration.

While the simulations have shown that our approach is promising, they could only cover a few potential situations. For example, we have not studied the performance of the proposed approach for unbalanced samples, its ability to identify nonlinear scale-specific relationships or interactions. Also, we have not studied the effect of a missing scale-specific predictor on the result of variable selection (which however has also received little attention in other approaches as pointed out by Dormann et al. (2007)).


The simulations presented in this study showed that the proposed variable selection approach is a promising tool that can result in models with high capabilities for prediction and model interpretation. Although the true scale-specific predictors were not always all identified, the identification of at least the stronger predictors at appropriate scales out of many ‘predictor x scale’ combinations will aid our understanding of landscape-scale influences on species distributions. If the bias of the permutation measure towards predictors correlated with important predictors in the current implementation of random forest can be removed or reduced, it is possible that the proportion of predictors that can be identified at appropriate spatial scales will be high for most situations.


The study was funded by the YDNPA and the Nigel Bertram Charitable Trust. The OS MasterMap data were used under YDNPA OS Licence Number 1000237402007. We thank the Associate Editor for valuable comments on an earlier version of this manuscript and Richard Gunton for stimulating discussions. Two anonymous referees provided constructive comments on an earlier version of this manuscript.