### Introduction

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information

Methodological advances in species distribution modelling have been rapid (Guisan & Zimmermann 2000; Scott *et al*. 2002). While the practical and intellectual benefits of obtaining well-tested models for species’ distributions are numerous, including forecasting species’ range shifts from climate change (Thomas *et al*. 2004) and invasion by introduced species (Peterson 2003; Drake & Bossenbroek 2004), testing evolutionary hypotheses (Graham *et al*. 2004), identifying reservoirs for disease (Peterson *et al*. 2002), and planning for conservation in a dynamic landscape (Ferrier 2002), modelling species’ niches is complicated by conceptual and technical difficulties and by data limitations (Guisan & Thuiller 2005). Recent advances in machine-learning techniques for statistical pattern recognition might be used to overcome many of these obstacles, which generally result from assumptions about the statistical distribution of data or restrictive parametric modelling paradigms. We studied the accuracy and reliability of ecological niche models built with support vector machines (SVM) for estimating the support of a statistical distribution (Schölkopf *et al*. 2001; Tax 2001; Tax & Duin 2004). We show that the SVM framework performs comparably or is superior to other methods with only moderate amounts of data while avoiding common problems and limitations.

The most common obstacles to conventional parametric and non-parametric statistical methods for modelling species’ distributions are: (i) autocorrelated observations resulting from the inherent spatial distribution of ecological systems, spatial autocorrelation in species’ actual distributions, and haphazard rather than designed sampling; and (ii) observations only of species’ occurrences without complementary observations of species’ absences. Autocorrelated observations result in inflated *P*-values for hypothesis testing when modelling techniques are based on parametric statistics, and have the potential to introduce bias in estimated models. One approach to this problem in a parametric setting is to add to a generalized linear model (GLM; e.g. logistic model) terms to model the spatial correlation (Augustin, Mugglestone & Buckland 1996; He, Zhou & Zhu 2003). Other studies have taken a similar approach with semi-parametric regression techniques, such as generalized additive models (GAM; Leathwick & Austin 2001). However, these methods place further demands on already sparse data and extrapolate poorly.

Strictly speaking, the second obstacle, lack of data confirming species’ absences, renders modelling approaches based on classification/discrimination impossible (Robertson, Caithness & Villet 2001; Hirzel *et al*. 2002). Previous studies have sought to overcome this problem by simulating observations of species’ absences (sometimes called pseudo-absences) from data domains in which there are no observations of species’ occurrences (Engler, Guisan & Rechsteiner 2004). While remarkably robust models have been developed using this approach (Anderson, Lew & Peterson 2003), a method that does not rely on such heuristics would be useful. Further, it is not clear that these procedures can be used in a setting that is not already information rich, where background knowledge of species’ ecologies can guide modelling heuristics (Anderson, Lew & Peterson 2003), although these are precisely the cases where species distribution models are most useful, for instance for forecasting species invasions or range shifts from climate change. Finally, classification models fitted to simulated data are generally ecologically uninformative or cumbersome to interpret (Keating & Cherry 2004). The aim of this study was to introduce a technique that overcomes these obstacles.

A promising alternative to conventional classification-based species distribution models is to use methods designed for modelling one type of data only (Robertson, Caithness & Villet 2001; Hirzel *et al*. 2002; Brotons *et al*. 2004; Phillips, Dudík & Schapire 2004). Many such techniques may be found in the literature on statistical pattern recognition, where a frequent goal is to separate statistical outliers from observations drawn from a high-dimensional distribution (Schölkopf *et al*. 2001; Tax 2001; Tax & Duin 2004). Indeed, rather than estimating the full probability distribution, in such situations it may be simpler (and more robust) to model just the support of the distribution, the set of points where the (unknown) probability density is greater than zero (Schölkopf *et al*. 2001). Sometimes support estimation is called one-class classification (Tax 2001). While many different methods for estimating statistical distributions might be optimized for one-class classification (Tax 2001; Tax & Duin 2004), methods based on SVM have been particularly successful in applications where data represent a large set of variables (Tax 2001, table 4·2; Tax & Duin 2004). SVM use a functional relationship known as a kernel to map data onto a new hyperspace in which complicated patterns can be more simply represented (Müller *et al*. 2001). The choice of kernel is typically based on theoretical properties, while any kernel parameters are optimized using computational techniques such as cross-validation. Because SVM are not based on characteristics of statistical distributions there is no theoretical requirement for observed data to be independent, thereby overcoming the problem of autocorrelated observations, although model performance will be affected by how well the observed data represent the range of environmental variables. Further, SVM are more stable, require less model tuning, and have fewer parameters than other computational optimization methods such as neural networks (Lusk, Guthery & DeMaso 2002). Finally, computational complexity is minimal and standard algorithms can be used for optimization. Thus, implementation is straightforward in familiar scientific computing environments such as R (http://www.r-project.org/, accessed 16 February 2006) and MATLAB (Mathworks Inc., Natick, MA). In contrast to genetic algorithms (Stockwell & Peters 1999; Drake & Bossenbroek 2004), the solution is deterministic, resulting in both faster computation and repeatable results. Thus, the potential gains from using support vector machines for ecological niche modelling are great, including reliable and accurate forecasting, feasible computation and a high level of ecological interpretability (Guo, Kelly & Graham 2005).

### Results

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information

Overall, we found that methods 1 and 2 performed similarly over the different measures and were superior to method 3. However, consistent models were more often obtained with methods 1 and 3 than with method 2, so that method 1 most reliably provided the best results. We obtained consistent models with method 2 for 58 out of 106 species (54·7%). In contrast, consistent models were obtained with method 1 for 87 (82·0%) species and with method 3 for 80 (75·5%) species. Logistic regression showed that, for all three methods, the likelihood of obtaining a consistent model for any given species increased significantly with the number of observations in the data set (method 1, *P* < 0·0001; method 2, *P* < 0·0001; method 3, *P*= 0·0001; see Figure S2 in the Supplementary material).

Error rates and summary performance criteria were also affected by the number of observations with which we trained the model (Fig. 2; see Figures S3–S8 in the Supplementary material). We computed Spearman rank-order correlation between each measure of accuracy except *f*_{1} (which, in one case, was undefined) and sample size, by pre-processing method, using only species for which we obtained consistent models. We consistently found highly significant relationships (see Table S4 in the Supplementary material). Surprisingly, AUC was not significantly correlated with sample size.

To see if performance differed significantly among protocols, for each measure of performance we used two-way anova with pre-processing method as a fixed effect and individual species’ identities as a random effect, using only species for which we obtained consistent models. Using species identity as a factor accounted both for the effect of sample size (which was shown to significantly affect performance) and for differences among species in their ability to be modelled with the observed environmental variables. Not surprisingly, species identity had a significant effect on each measure of performance (*P* < 0·0001). There was no evidence for an effect of modelling method on recall (*P* = 0·248) or false-negative rate (*P* = 0·248), which was the target of optimization and so was expected to be approximately the same across methods. Modelling method did have a significant effect on false-positive rate (*P* < 0·0001), precision (*P* < 0·0001), *f*_{1} (*P* < 0·0001) and AUC (*P* < 0·0001). The group mean performances for each pre-processing method across species showed that, where consistent models could be obtained, methods 1 and 2 typically performed similarly and were superior to method 3 (Fig. 2).

To see if accuracy was driven by the idiosyncratic response of different sampling locations, i.e. if some locations were consistently unpredictable while others were consistently predictable, we compared the predictions (method 1) for each species at each sampling location in the testing data with known occurrence, and performed two-way anova with species and sampling site as factors. Both effects were highly significant (*P* < 0·0001) but explained a small portion of the overall variation (*R*^{2} = 0·203, partial , partial ).

### Discussion

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information

We used a class of recently developed machine-learning algorithms (support vector machines) to model species’ distributions using only data concerning species’ occurrences. This method assumes only that observations reasonably represent the range of habitable environments; in particular, independence is not assumed. Thus, where data are available only concerning species presence these methods are theoretically superior to classification/discrimination techniques (Hirzel *et al*. 2002). We note that SVM can also be used when there are observations of both habitat and non-habitat (i.e. confirmed absence of species) and indeed can be optimized to make use of as many observations of absence and presence as are available, without requiring balanced observations. We emphasize that the SVM models the support of the statistical distribution of environments from which the species presence observations are drawn, an environmental hyperspace. Thus, the interpretation of the SVM model as an ecological niche is consistent with the classical definition of a niche as a multidimensional environmental space (Hutchinson 1957). Logistic regression (Keating & Cherry 2004), MAXENT (Phillips, Dudík & Schapire 2004), ENFA (Hirzel *et al*. 2002) and other models based on probability densities (Robertson, Caithness & Villet 2001) represent the relative frequency of habitat use and are therefore more closely related to the idea of resource utilization or resource selection (Schoener 1989; Boyce *et al*. 2002; Keating & Cherry 2004).

Of course, theoretical warrant for using support vector machines to model habitat would be unimportant if models performed poorly in independent validation. We used independent observations of species’ presences and absences to estimate model accuracy. Summary measures of model performance were generally high. For instance, using our best procedure (method 1) the average AUC obtained was 0·79. For comparison, an analysis of 30 bird species in the Catalan region (Brotons *et al*. 2004) obtained an average AUC of 0·74 on independent data for a model fit using ENFA with only observations of species’ presences (Hirzel *et al*. 2002), and 0·82 for logistic regression fit to both species’ presences and absences. Zaniewski, Lehmann & Overton (2002) used generalized additive models (GAM) with a logistic link function and binomial distribution fit to both presences and absences of 43 fern species sampled at 19 875 plots in New Zealand to obtain an average AUC of 0·86. Thus, our results are comparable with published results obtained with data for only species presence and data comprising both presence and absence.

We also studied pre-processing approaches that might be taken to increase model performance. Method 1 used no pre-processing or data reduction. Method 2 pre-processed training data using the technique of *k*-whitening. Method 3 relied on a restricted data set in which highly correlated variables were removed from the model training data set. We found that when consistent models could be obtained, method 1 resulted in models with the highest recall and lowest false-negative rate. In contrast, method 2 resulted in models with the highest precision and lowest false-positive rate. Methods 1 and 2 performed similarly as evaluated according to the summary measures *f*_{1} and AUC. In comparison, method 3 performed poorly overall. Consistent models were obtained using method 1 much more frequently than using method 2. Thus, method 1 appears to be the most reliable method in general. Finally, we observed that the relative performance of method 3 compared with methods 1 and 2 indicates that useful information can be obtained by the addition of more environmental variables, even if they are highly correlated.

Finally, we studied how model performance depends on the sample size of the training data set. For 106 species with all three methods we were almost always able to identify a consistent model when the model training data set contained at least 40 observations, which we suggest is the minimum number of observations with which models should be trained in practice. Not unexpectedly, measures of accuracy, such as error rates and precision, were also related to sample size (see Table S4 in the Supplementary material). Minimum sample sizes for modelling and heuristics about how sample size should scale with the number of environmental variables are important topics for research. Curiously, when all species were considered together, AUC was not significantly related to sample size, although the lowest observed AUC scores were always obtained for species represented by fewer than 30 observations (see Figures S6–S8 in the Supplementary material). These results are promising and indicate that often the most accurate models can be obtained with relatively modest data sets. Indeed, models obtained for species with only 40–50 observations routinely performed as well as models for species represented by more than 100 observations. Only precision seemed to increase continuously over the entire range (see Figures S3–S5 in the Supplementary material). These results are about the same as for GARP, which is another machine-learning algorithm and on average obtains near maximal accuracy with 50 observations (Stockwell & Peterson 2002). In contrast, to obtain similar accuracy with logistic regression required 100 observations (Stockwell & Peterson 2002).

An important unanswered question is how many environmental variables are required to predict accurately species’ potential distributions, whether with support vector machines or any other technique. In our study, the method with the greatest number of variables (method 1) and no pre-processing provided the best results. An underlying worry is that the higher dimensionality of this method leads to a model that is overfit and would generalize poorly. We overcame this obstacle by fixing a target error rate and tuning models using cross-validation, which estimates the generalization error directly. Thus, our comparison of the different methods was designed to create the fairest comparison: each method was optimized to achieve (approximately) the same generalization error. Three lines of evidence point to success at achieving this fair comparison. First, we failed to detect an effect of modelling method on the false-negative rate when the model was tested with independent data. Thus, the true generalization error was consistent across methods. Secondly, if correlation among environmental variables had led to overfitting, method 2 would have performed best as the algorithm would only have been trained on the information contained in the first few principal components of the data. Finally, in image-recognition experiments (classifying digital images of hand-written numerals) Tax & Müller (2004) found that the optimized model was sometimes not complex enough when non-target observations (i.e. species’ absences) were too close to the training data. Therefore, if anything, there is reason to suspect that our models are underfit rather than overfit. Indeed, simply including more environmental variables, rather than developing more sophisticated ways of reducing dimensionality, might result in the greatest improvements to accuracy. Our analysis was limited by the availability of relevant systematically collected data that had been geo-referenced to the particular sampling sites where our species’ distribution records were collected. Future studies could certainly include many more variables as the computational cost that would be imposed is minor. Indeed, we suggest that the computational complexity of the SVM approach is one of its primary features. This aspect could be exploited in many ways that await development. Some obvious possibilities are that different ‘submodels’ obtained from subsets of the data corresponding to classes of environmental variables (biotic vs. abiotic, chemical vs. physical, etc.) could be compared to explore how these differently affect species’ distributions; modelling could be embedded in a non-parametric bootstrap to obtain confidence bounds on the estimated distributions; and resampling schemes could be devised to test hypotheses about niche differentiation, partitioning and competitive exclusion or facilitation. These possibilities, together with the relatively strong performance already shown by this approach, should motivate further research, resulting in both methodological improvements and applications in many areas.

Species distribution modelling is a part of many ecological applications, including forecasting species invasions, devising protocols for biodiversity monitoring, designing nature reserves and planning for habitat conservation, managing vector-borne and environmentally mediated disease, and cultivating renewable resources (e.g. aquaculture and timber). Finally, species distribution modelling is often a fundamental component of projects aimed at understanding the consequences of anthropogenic climate change, such as the MODIPLANT initiative that generated the data used in this study. As SVM are stable algorithms that can deal with large sets of predictors at once, they may prove particularly useful in this arena. In conclusion, these results support the continued use of SVM for ecological niche modelling. Where data are available concerning only species’ presences and not species’ absences, support vector machines are theoretically superior to classification techniques that rely on simulation of pseudo-absence data. We have shown that support vector machines are also comparable with such models when validated by independent observations of both species presence and absence.

### Supporting Information

- Top of page
- Summary
- Introduction
- Methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information

**Table S2.** Predictor variables used to model ecological niches.

**Table S3.** Pearson correlations for environmental variables used to model ecological niches ranked by the absolute value of the correlation coefficient.

**Table S4.** Spearman rank-order correlations between performance criteria and sample size for each modelling protocol.

**Figure S1.** Rescaled histograms of nine predictor variables used to model ecological

**Figure S2.** Frequency with which consistent models could be obtained as a function of observations in the data set using (A) Method 1, (B) Method 2, (C) Method 3. Red line represents best-fit logistic regression (*P*<0.0001 for panels A and B; *P*=0.0001 for panel C).

**Figure S3.** Performance of Method 1 (support vector machine without k-whitening using nine environmental variables) is a function of the number of observations in the dataset used for model fitting. Circles represent models that were optimized by the consistency criterion. Crosses represent models for which no consistent model could be obtained (performance evaluations are based on the simplest model). Performance criteria are (A) false negative rate, (B) false positive rate, (C) precision, and (D) recall.

**Figure S4.** Performance of Method 2 (support vector machine using k-whitening with nine environmental variables) is a function of the number of observations in the dataset used for model fitting. Circles represent models that were optimized by the consistency criterion. Crosses represent models for which no consistent model could be obtained (performance evaluations are based on the simplest model). Performance criteria are (A) false negative rate, (B) false positive rate, (C) precision, and (D) recall.

**Figure S5.** Performance of Method 3 (support vector machine without k-whitening using four environmental variables) is a function of the number of observations in the dataset used for model fitting. Circles represent models that were optimized by the consistency criterion. Crosses represent models for which no consistent model could be obtained (performance evaluations are based on the simplest model). Performance criteria are (A) false negative rate, (B) false positive rate, (C) precision, and (D) recall.

**Figure S6.** Summary measures of performance for Method 1 (support vector machine without k-whitening using nine environmental variables) are a function of the number of observations in the dataset used for model fitting. Circles represent models that were optimized by the consistency criterion. Crosses represent models for which no consistent model could be obtained (performance evaluations are based on the simplest model). Performance criteria are (top) summary performance criterion *f*_{1}, and (bottom) area under the receiver-operator curve (AUC).

**Figure S7.** Summary measures of performance for Method 2 (support vector machine using k-whitening with nine environmental variables) are a function of the number of observations in the dataset used for model fitting. Circles represent models that were optimized by the consistency criterion. Crosses represent models for which no consistent model could be obtained (performance evaluations are based on the simplest model). Performance criteria are (top) summary performance criterion *f*_{1}, and (bottom) area under the receiver-operator curve (AUC).

**Figure S8.** Summary measures of performance for Method 3 (support vector machine without k-whitening using four environmental variables) are a function of the number of observations in the dataset used for model fitting. Circles represent models that were optimized by the consistency criterion. Crosses represent models for which no consistent model could be obtained (performance evaluations are based on the simplest model). Performance criteria are (top) summary performance criterion *f*_{1}, and (bottom) area under the receiver-operator curve (AUC).

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.