Aim To test statistical models used to predict species distributions under different shapes of occurrence–environment relationship. We addressed three questions: (1) Is there a statistical technique that has a consistently higher predictive ability than others for all kinds of relationships? (2) How does species prevalence influence the relative performance of models? (3) When an automated stepwise selection procedure is used, does it improve predictive modelling, and are the relevant variables being selected?
Location We used environmental data from a real landscape, the state of California, and simulated species distributions within this landscape.
Methods Eighteen artificial species were generated, which varied in their occurrence response to the environmental gradients considered (random, linear, Gaussian, threshold or mixed), in the interaction of those factors (no interaction vs. multiplicative), and on their prevalence (50% vs. 5%). The landscape was then randomly sampled with a large (n = 2000) or small (n = 150) sample size, and the predictive ability of each statistical approach was assessed by comparing the true and predicted distributions using five different indexes of performance (area under the receiver-operator characteristic curve, Kappa, correlation between true and predictive probability of occurrence, sensitivity and specificity). We compared generalized additive models (GAM) with and without flexible degrees of freedom, logistic regressions (general linear models, GLM) with and without variable selection, classification trees, and the genetic algorithm for rule-set production (GARP).
Results Species with threshold and mixed responses, additive environmental effects, and high prevalence generated better predictions than did other species for all statistical models. In general, GAM outperforms all other strategies, although differences with GLM are usually not significant. The two variable-selection strategies presented here did not discriminate successfully between truly causal factors and correlated environmental variables.
Main conclusions Based on our analyses, we recommend the use of GAM or GLM over classification trees or GARP, and the specification of any suspected interaction terms between predictors. An expert-based variable selection procedure was preferable to the automated procedures used here. Finally, for low-prevalence species, variability in model performance is both very high and sample-dependent. This suggests that distribution models for species with low prevalence can be improved through targeted sampling.