## Introduction

The distribution of species can depend on processes occurring at multiple spatial scales (Wiens 1989; Cunningham & Johnson 2006; Thogmartin & Knutson 2007). The occupancy of a breeding patch or nest site can depend on factors that influence the quality of specific nest sites (Martin 1998), areas immediately around nest sites (e.g. territories) or areas utilised outside the immediate area (Dallimer *et al*. 2012). Occupancy can additionally depend on factors, which indicate the quality of breeding areas [e.g. ‘public information’ such as breeding densities (Doligez *et al*. 2004)] or which influence population size (Takada & Miyashita 2004). Thus, occupancy of a patch can depend on multiple spatial scales and some of these can be large spatial scales.

Multiple-scale species distribution models can increase predictive ability relative to single-scale approaches (Cunningham & Johnson 2006). However, getting the scale of predictors wrong can decrease the variance explained, add residual spatial autocorrelation (RSA), bias regression coefficients and therefore lead to the wrong conclusions (de Knegt *et al*. 2010). Accounting for spatial autocorrelation [with techniques such as simultaneous autoregression (see e.g. Dormann *et al*. 2007)] may not remove the parameter estimate bias originating from the misspecification of spatial scales of predictor variables (de Knegt *et al*. 2010). This leads to a conundrum, including the appropriate predictors, and at the appropriate scales is important. Not including them or including the correct predictors at the wrong scales can lead to the wrong conclusions, but how do we select appropriate scales when they are typically not known? (As species distribution models (SDM) frequently model variables correlated with process variables, we use the term ‘appropriate scale’ rather than ‘scale of processes’).

One approach to this problem is to first measure the potential predictors within different buffer sizes (henceforth called scales) around sample locations and then to regress each predictor against the response for each scale (e.g. Steffan-Dewenter *et al*. 2002; Holland, Bert & Fahrig 2004; Gray, Phan & Long 2010). The assumption is that the model with the best fit identifies the most predictive, and therefore, the single most meaningful, spatial scale. The best fit is typically measured by the highest correlation coefficient or lowest AIC (Akaike information criterion) (Burnham & Anderson 2004). For an example with our data see Fig. 1a.

However, assessing each predictor separately, and selecting a single best scale, can lead to overall problems in model specification. If models contain only single predictors, they may leave out other important variables, increasing residual variance and leading to a bias in the statistical inference if the response and predictor are both spatially autocorrelated (which is often the case in ecological data sets) (Legendre 1993), as the correlation coefficient and AIC can be affected (Lennon 2000; Hoeting *et al*. 2006). Therefore, the magnitude in the difference in fit between scales may be biased (see also Schooley 2006). The bias will be affected by the strength of the spatial autocorrelation in the predictor (Lennon 2000), which is highly likely to vary with scale. Therefore, the scale with the best fit may not necessarily represent the most appropriate scale for the predictor. Indeed, there may not be a single appropriate scale as a species can be influenced by a predictor at multiple scales (Thogmartin & Knutson 2007).

One solution therefore would be to evaluate all scales of all predictors with moderate levels of statistical support against each other, rather than to rely on selecting only the ‘best’ one, where the estimate of what is best may be biased. However, this raises two problems: firstly, the number of possible predictors can often be large (van Horne 2002). In a landscape, a single factor (such as habitat type) could be a predictor in several ways at each spatial scale: for example, the presence or the proportion or the size of patches of one of several habitat types or the isolation or connectivity of these patches could influence the distribution of a species (Saab 1999; van Langevelde 2000; Cunningham & Johnson 2006; Hodgson *et al*. 2011). The number of predictors can therefore quickly become large relative to the number of data points, a situation to which traditional statistical techniques such as linear or logistic regression are not robust (Strobl *et al*. 2007). Secondly, predictors in ecological data sets are often highly correlated, which can lead to problems in variable selection (Graham 2003). Both problems are potentially exacerbated when the above repeated regressions identify several scales with good model fit per predictor (for an example see Fig. 1b).

Studies conducted at different spatial scales can lead to different conclusions (a variable influencing a response at one scale may not do so at another or influence the response in a different way) (Wiens 1989; Hamer & Hill 2000). In ecology, studies using arbitrary spatial scales proliferate (Wheatley & Johnson 2009). Therefore, a methodology that can identify appropriate spatial scales of predictors despite the difficulties outlined above would further our understanding of landscape-scale processes and result in more accurate predictions of species distributions.

In this study, we investigate whether the random forest algorithm, a machine learning method based on regression or classification trees (Breiman 2001; Liaw & Wiener 2002) can circumvent this ‘combinatorial explosion’ of ‘predictor × scale’. Random forest is robust to situations in which the number of data points is small compared with the number of predictors (Strobl *et al*. 2007) and has successfully been used for variable selection, for example, in microarray studies (which also often have a small number of data points, a high number of predictors and high correlations between predictors) (Archer & Kimes 2008; Nicodemus *et al*. 2010).

We first use simulated responses to assess whether combining regressions of single predictors with variable selection by random forest can (a) result in accurate predictions and (b) select the correct predictors at appropriate spatial scales. Then, we apply the same approach to a real data set of a wader species in an upland area of the UK. For both the simulated and real data sets, we examined 12 factors, four of which (elevation, aspect, slope, soil) were further divided into categories to examine whether the quantity of a category (e.g. of low elevation) at a given scale influenced the distribution of the species or simulated response. Examination of these variables at up to 15 scales each resulted in 544 potentially highly correlated predictors to select between.