Simulations of virtual species (i.e. species for which the environment–occupancy relationships are known) are increasingly being used to test the effects of different aspects of modelling and sampling strategy on performance of species distribution models (SDMs). Here we discuss an important step of the simulation process: the translation of simulated probabilities of occurrence into patterns of presence and absence. Often a threshold strategy is used to generate virtual occurrences, where presence always occurs above a specific simulated probability value and never below. This procedure effectively translates any shape of simulated species response into a threshold one and eliminates any stochasticity from the species occupancy pattern. We argue that a probabilistic approach should be preferred instead because the threshold response can be treated as a particular case within this framework. This also allows one to address questions relating to the shape of functional responses and avoids convergence issues with some of the most common SDMs. Furthermore, threshold-based virtual species studies generate over-optimistic performance measures that lack classification error or incorporate error from a mixture of sampling and modelling choices. Incorrect use of a threshold approach can have significant consequences for the practising biogeographer. For example, low model performance may be interpreted as due to sample bias or poor model choice, rather than being related to fundamental biological responses to environmental gradients. We exemplify these shortcomings with a case study where we compare results from threshold and probabilistic simulation approaches.

**Journal of Biogeography**

# Using virtual species to study species distributions and model performance

Correspondence: Christine N. Meynard, INRA, UMR CBGP (INRA/IRD/Cirad/Montpellier SupAgro), Campus International de Baillarguet, CS 30016, FR-34988 Montferrier-sur-Lez Cedex, France.

## Abstract

## Setting the scene

Species distribution models (SDMs) represent a key modelling tool in biogeography and macroecology. They have been used to predict species spatial distributions under current and future climatic conditions, in native as well as introduced areas (see reviews in Guisan & Thuiller, 2005; Elith & Leathwick, 2009; Franklin, 2009). However, testing SDMs – either for their accuracy or for potential modelling artefacts due to sample bias – is often impossible with real world data: the large spatial scales involved and the difficulty of developing adequate controls for model tests using classical experimentation are among the principal constraints. The use of simulations can overcome some of these limitations (Hirzel *et al*., 2001; Meynard & Quinn, 2007; Elith & Graham, 2009). Simulating species distributions using known, artificially determined species responses to environmental gradients is attractive because species and sample prevalence (i.e. the frequency of the species over its entire distribution range versus within the sample, respectively), the shape of the functional responses to environmental gradients, and numerous other species and sampling characteristics can be controlled (Hirzel *et al*., 2001; Meynard & Quinn, 2007; Elith & Graham, 2009), allowing one to perform ‘virtual experiments’ at large spatial scales (Zurell *et al*., 2010).

Since Hirzel *et al*. (2001) coined the term ‘virtual species’ for this type of simulation study, many others have used similar approaches to test different aspects of the implementation of SDMs. For example, Real *et al*. (2006) proposed a correction on model output when using a sample whose prevalence is different from the species prevalence, and Albert & Thuiller (2008) tested a different correction strategy for this type of sample bias. Virtual species have also been used to identify the effects of using background data (i.e. a random sample of overall environmental conditions) as pseudo-absences (i.e. sites that are presumed to represent true absences) when presence-only data are available (Ward *et al*., 2009; Li *et al*., 2011; Lobo & Tognelli, 2011; Barbet-Massin *et al*., 2012), in order to study the impact of different functional responses to the environment on model performance (Meynard & Quinn, 2007; Elith & Graham, 2009; Santika & Hutchinson, 2009; Meynard & Kaplan, 2012) and to assess the effects of sample prevalence on model performance (Jiménez-Valverde *et al*., 2009; Santika, 2011). Among the most recent papers using a virtual species approach, there are at least two published in this journal: Bombi & D'Amen (2012) test the effects of downscaling distribution maps and Peterson (2011) tests niche-identity and niche-similarity analyses (Warren *et al*., 2008).

Because of this growing body of literature, we believe it is important to clarify some of the assumptions and methodology underlying this approach. The simulation of virtual species usually involves three steps: (1) simulation of a functional response, often represented by a probability of occurrence or a habitat suitability index that varies along one or several environmental gradients; (2) translation of these probabilities of occurrence into a presence–absence map; and (3) sampling of simulated data to test SDMs under different sampling and/or model conditions. Here, we highlight several issues that may affect the generality and interpretation of virtual species studies, focusing on the second step, for which two opposing approaches have been used. We then exemplify these points through a case study.

## Threshold versus probabilistic simulation approaches

In numerous simulation studies, a fixed threshold value is used to convert the simulated probabilities of occurrence of the virtual species (Fig. 1a) into a presence–absence map (e.g. Hirzel *et al*., 2001; Real *et al*., 2006; Jiménez-Valverde & Lobo, 2007; Albert & Thuiller, 2008; Jiménez-Valverde *et al*., 2009; Santika & Hutchinson, 2009; Peterson, 2011; Bombi & D'Amen, 2012). This ‘threshold approach’ is indeed an interesting extreme case where the species is invariably present on one side of an environmental gradient and absent on the other (Fig. 1b). By contrast, under a ‘probabilistic approach’, presence–absence is a random process linked to probabilities of occurrence that may respond gradually to environmental variables and, therefore, differ from 0 or 1. Using this approach, a probability of occurrence of 0.5 will lead, on average, to 5 occupancies out of every 10 sites with identical environmental conditions (e.g. Meynard & Quinn, 2007; Elith & Graham, 2009; Li *et al*., 2011; Santika, 2011). The implementation of the probabilistic approach is fairly simple. For example, in R (R Development Core Team, 2011) a single command, *rbinom*, can be used to generate random presence–absences with a specified probability of success at each trial. Repeated realizations of the presence–absence landscape derived from a single probability of occupancy map will differ, each providing a statistically valid realization of the true species distribution map (Fig. 1c,d).

There are at least five reasons why the use of a threshold approach may be problematic. First, ecological theory supports the idea of dynamic occupancy patterns in space and time with processes of colonization and extinction having important influences at different scales (Hanski, 1999; Holyoak *et al*., 2005). Despite this, static, near-threshold responses to environmental gradients may be observed in some real-world datasets. For example, atlas presence–absence data may approach a threshold response if data pooling over long time spans and/or over large spatial scales eliminates variability in occupancy patterns. However, threshold responses are unlikely to be generally appropriate for all datasets. In particular, fine spatial-scale datasets collected over short time-scales represent a situation for which dynamic patterns of occupancy are important. Fortunately, one is not obliged to choose between threshold and probabilistic approaches: threshold responses emerge naturally in the probabilistic approach when probability of occupancy is taken to change rapidly from 0 to 1 over a very narrow range of environmental conditions. As the threshold approach does not represent the most general case scenario, its use requires a discussion of the motivation and consequences of such a choice.

Second, the use of a threshold approach will sometimes give an incorrect or incomplete answer to the question being asked. For example, Santika & Hutchinson (2009), studying the effects of different functional responses on model performance, simulated species using the threshold approach. This effectively converted any gradual response into a threshold one (or two consecutive opposing thresholds in the case of a simulated bell-shaped probability of occurrence). Not surprisingly, they concluded that choice of modelling technique was more important than the shape of the species' response, but that including a quadratic term (which effectively allows for two consecutive thresholds rather than just one) improved predictions when the response was bell-shaped. The use of a threshold approach to generate species occupancies in other scenarios (e.g. Hirzel *et al*., 2001; Bombi & D'Amen, 2012) always resulted in no significant effects of the species functional response type. By contrast, Meynard & Quinn (2007) and Elith & Graham (2009), both using a probabilistic approach, found important differences according to the shape of the species' responses. Therefore, the use of a threshold approach is generally incompatible with robust assessment of the impact of functional response type on model performance.

Third, the capacity of models to discriminate between presence and absence when a probabilistic approach is used is fundamentally lower than for the threshold approach (Meynard & Quinn, 2007; Elith & Graham, 2009; Meynard & Kaplan, 2012). This point can be illustrated with a simple example. If a species is always present above a certain environmental state (e.g. 20 °C maximum annual temperature) and always absent below, then models can perfectly predict presence and absence presuming that the correct forcing variable(s) can be identified (but see the case study below for an example of how, even when not using the true forcing variables, predictions can be almost perfect). In contrast, suppose that the species' probability of occurrence increases gradually over a finite range of environmental conditions (e.g. as maximum temperature goes from 20 to 26 °C). A large number of sites will have probabilities of occurrence intermediate between 0 (always absent) and 1 (always present) (Fig. 1a). As is evident from comparing carefully two realizations of presence–absence patterns for the same probability of occurrence map (Fig. 1c versus Fig. 1d), even the best model will only be able to predict the probability of occurrence, not the actual occupancy pattern (e.g. Elith & Graham, 2009; Meynard & Kaplan, 2012). Including dispersal, colonization and extinction rules in a mechanistic model may increase our success in predicting specific occupancy patterns, but it is unlikely that processes such as dispersal will ever be fully described mechanistically. As such, this issue of ‘low’ performance values cannot be avoided and proper interpretation of discrimination statistics needs to consider a probabilistic response.

Fourth, many standard statistical modelling techniques based on the use of a logistic curve [e.g. generalized linear models (GLMs) and generalized additive models (GAMs)] may be inappropriate for modelling threshold responses (Venables & Ripley, 2002, p. 198). The slope of the logistic curve at the point of inflection (i.e. the point at which the slope of the curve is maximal) is infinite for a threshold response, often preventing proper model convergence. Alternative algorithms may provide partial solutions to this problem (e.g. Venables & Ripley, 2002, p. 445), but even these fail under specific scenarios. In these cases, model fitting problems may obscure the true species responses to environment that one aims to study. Systematically checking model convergence should be central to any simulation study.

The fifth point has to do with the possibility of repeating simulations. As the threshold approach eliminates all variability in the species response (Fig. 1b), any simulation of the presence–absence pattern will result in exactly the same occupancy map. Iterating this process has no value. Some threshold simulation studies have varied the sub-sample used to test models, or artificially manipulated sample prevalence to be different from species prevalence in order to generate variability in results (see case study below). By contrast, under the probabilistic approach, any new iteration will result in a different pattern of occupancy, with the degree of variability among realizations of the presence–absence landscape being related to the level of gradualism in the species response. The whole simulation and modelling process can therefore be repeated under exactly the same circumstances. This opens up the possibility of simultaneously assessing the consequences of various sources of error in presence–absence observations (e.g. species or environmental variability, as well as observational error), and separating the effects of species prevalence and sample bias.

## A case study: Jiménez-Valverde & Lobo (2007)

The difference between a threshold and a probabilistic approach is best exemplified via a case study. We replicated here a study by Jiménez-Valverde & Lobo (2007), and compared results of their threshold-based simulations with those from a probabilistic approach to species occupancy, complementing both with analytical results. Because our emphasis is on the second step of the simulation process, we only varied this particular stage. Before proceeding to results, which show that some of the most important conclusions from a threshold approach do not hold under the more general probabilistic approach, we briefly introduce the motivation and methodology of the original study.

### Motivation and methodology

Statistical models used as SDMs yield a continuous range of values of predicted probability of occurrence, which is usually converted a posteriori into a categorical prediction of presence or absence (Liu *et al*., 2005; Jiménez-Valverde & Lobo, 2007). A threshold in the predicted probability of occurrence above which the species is most likely to be present is usually chosen in order to generate presence–absence predictions, but several methods to choose such a threshold have been proposed (Liu *et al*., 2005). Jiménez-Valverde & Lobo (2007) used virtual species seeded onto the real European climatic environment to study the relationship between sample prevalence, model performance measures and the different methods used to determine this arbitrary threshold.

In the first step of the simulation process, i.e. the simulation of a functional response, four climatic variables (monthly precipitation, precipitation during the warmest quarter, monthly maximum and minimum temperature) were used, just as in Jiménez-Valverde & Lobo (2007). No effort to generate a particular shape in the functional response was carried out at this stage, but the original variables were Box–Cox transformed to make them normally distributed (data kindly provided by Jorge Lobo, Museo Nacional de Ciencias Naturales, Spain).

In the second step of Jiménez-Valverde & Lobo (2007), i.e. the translation of these gradients into a presence–absence pattern, the virtual species was defined as present when environmental conditions were within the mean ± a single, fixed fraction of the standard deviation of the four environmental gradients. Instead of following this threshold approach, we examined virtual species with gradually varying probabilistic patterns of occupancy. The simulated probability of occurrence of the virtual species was set by two symmetric and opposing logistic functions in the same four environmental variables. The inflection points of the logistic curves occurred at ± a single fixed fraction of the standard deviation of each variable. When the slope of the two opposing logistic curves is intermediate, the two logistic curves combined simulate a symmetric bell-shaped response centred on the mean of each variable. On the contrary, when the slope is very steep, the logistic curves generate threshold-like responses equivalent to those inherent to Jiménez-Valverde & Lobo's (2007) strategy. Here we denote α as the inverse of the logistic curve slope. Different slopes were used ranging from a threshold environmental response (α = 1) to gradual responses (α = 0.5). The logistic inflection points were shifted so that theoretical species prevalence was as in Jiménez-Valverde & Lobo (2007), i.e. 0.17.

We further complemented simulations with an analytical approach based on that described in Meynard & Kaplan (2012). The principle involves using probability theory to calculate theoretical maximum discrimination ability (in terms of presence and absence) of a model that would predict perfectly well the true probability of occurrence of the virtual species. This analytical procedure was expanded here to incorporate sample bias due to consistent, random over-sampling or under-sampling of presences (see Appendix S1 in Supporting Information for details).

The third step of the simulation involves sampling the virtual occurrences to test for the best threshold strategy to predict occurrences. At this step, we followed Jiménez-Valverde & Lobo (2007): from a unique virtual species, we sampled different numbers of presences and absences. The combination of nine levels of presences (*n *=* *91, 456, 911, 4557, 9114, 22,784, 45,572, 68,358, 91,144) and nine levels of absences (*n *=* *91, 456, 911, 4557, 9114, 22,784, 45,572, 68,358, 91,144) results in 81 different samples which varied simultaneously in sample size, sample prevalence and sample prevalence bias (i.e. how different the sample prevalence was from true species prevalence). While the original environmental variables were directly used to simulate the virtual species, a principal components analysis (PCA) calculated from the four original environmental variables was used to build the predicted probabilities of occurrence. In other words, the variables used to simulate the species were different from (but correlated with) those used to build the statistical models. Jiménez-Valverde & Lobo (2007) argue that this is a more realistic scenario than using the true predictors directly, because this is what ecologists most often do. We used the default GLM function in the statistical package R (R Development Core Team, 2011) with the first two PCA axes (representing 73% of the total variance in the original four variables) as predictors and including all polynomial terms out to third order for each predictor. The alternative GLM fitting function from Venables & Ripley (2002, p. 445), referred to here as *logitreg*, was also used to identify problems of model convergence (see scripts in Appendix S2). Unlike in Jiménez-Valverde & Lobo (2007), stepwise regression was not implemented because it is not readily available for the alternative *logitreg* (Venables & Ripley, 2002).

Once the probability of occurrence of the virtual species is simulated, it needs to be translated into presence–absence predictions. Four optimal threshold selection criteria described in Jiménez-Valverde & Lobo (2007) were used: (1) the arbitrary threshold of 0.5 (0.5T), (2) maximization of the Kappa statistic (KMT), (3) minimization of the difference between specificity (i.e. success rate at predicting absences) and sensitivity (i.e. success rate at predicting presences) (MDT), and (4) maximization of the sum of specificity and sensitivity (MST) (see Jiménez-Valverde & Lobo, 2007, for details).

Combining these simulations with analytical results allows us to compare the performance of SDMs for the following scenarios: (1) the best possible model without sample prevalence bias; (2) the best possible model with sample prevalence bias using the true predictor variables; (3) the best possible model with sample prevalence bias and using predictors that are different from the true causal variables; and (4) simulation results, which will incorporate effects of sample prevalence bias, sample size, use of correlated (but not true causal) predictors and model convergence issues. We summarize below several results that exemplify the points made in the previous section.

### Case study results

The overall comparison of predictive performance between the KMT and MST threshold for a probabilistic approach reveals some important differences with respect to Jiménez-Valverde & Lobo's (2007) conclusions, exemplifying why it is important to consider the more general probabilistic framework in simulation studies (point 1). For example, we can confirm that values of specificity and sensitivity for the MST threshold are more stable than those for KMT (Fig. 2b,e versus 2c,f) throughout the range of sample prevalence. However, for species with a gradual environmental response, either sensitivity or specificity, but not both, are higher for KMT than MST, with sample prevalence determining which of the two will be favoured when maximizing the Kappa statistic. Ultimately, the best optimal threshold strategy to use for a given real-world case depends on the objectives of the study (i.e. do we prefer to be wrong with respect to presences or to absences, or balance error in estimating both?) and on information regarding the sample bias (i.e. is sample prevalence similar to, larger than or smaller than the real species prevalence?). Without such information, one cannot determine whether KMT or MST is superior for a given performance statistic. Consequently, one of the main conclusions in the case study, namely that MST and MDT are generally better than KMT, does not hold in the more general probabilistic simulation framework.

Results also corroborate the fact that converting all simulated responses into threshold ones (point 2) would have provided the wrong answer if one were interested in the effects of the shape of the functional response on model performance: results do differ when comparing threshold versus probabilistic approaches (Figs 2 & 3). Sample bias leads to model estimates of probability of occurrence that are also biased in the same direction as the sample (Fig. 3a), with the level of bias increasing with the gradualism in species environmental response. The use of an optimal threshold strategy such as MST, KMT or MDT reduces the effect of sample bias on estimated prevalence (Fig. 3b,c), particularly for MST (Fig. 3c). Nevertheless, this reduction does not reproduce the true species prevalence, and more sophisticated techniques are necessary to correct the probability of occupancy for sample bias (Ward *et al*., 2009; Li *et al*., 2011).

Discrimination ability for model outputs is lower when a probabilistic simulation approach is used (point 3). For a threshold response, theoretical predictions of SDM discrimination simply reproduce the maximum value for the statistics, i.e. 1 for sensitivity and specificity, meaning that discrimination between presences and absences is perfect (solid curves in Fig. 2a,d). The combined effects of sample bias, building the SDM from environmental factors that are different from the true forcing variables (dash-dotted curves in Fig. 2) and using a finite sample size (points resulting from simulations in Fig. 2) produce more interesting non-trivial results, where sensitivity and specificity are still very high but may deviate from perfect predictions, for extreme values of sample prevalence (Fig. 2). This contrasts with results for a virtual species with a gradual probabilistic response (Fig. 2b,c,e,f). In this case, discrimination statistics based on simulations (shown as dots), a hypothetical optimal model reproducing the true probability of occupancy on a prevalence-biased sample (solid curves) and a hypothetical optimal model considering sample prevalence bias and PCA predictors (dash-dotted curves) all closely agree and are < 1. When sample prevalence matches species prevalence (vertical dashed line), theoretical and simulation results match those from a hypothetical model reproducing the true probability of occupancy on an unbiased sample (horizontal dashed line), but otherwise may be higher or lower than these optimal predictions.

Although simulation results based on the two different GLM fitting mechanisms in general agree with theoretical results and with each other, a number of cases of disagreement due to convergence problems (point 4) are clearly evident (black dots not overlaid by grey dots in Fig. 2). Samples with very low prevalence result in predictions of the species being totally absent in the whole landscape, and therefore sensitivity is null and specificity is 1. As expected, convergence problems occur more regularly and for a wider range of sample prevalence for virtual species with a threshold environmental response, but are also present at low sample prevalence in results for species with a gradual response.

Finally, the lack of iterations in the threshold simulation process complicates separating uncertainty in results due to sample size, predictor variables and environmental variability (point 5). However, notice that complementing these simulations with analytical results provides considerable information, even without iterations in the virtual species realizations.

## Concluding remarks

For the practising biogeographer, this case study shows how the threshold simulation approach results in over-optimistic estimates of discrimination ability. Results do not include any classification error, and are more likely to be influenced by convergence issues. Finally, the combination of the simulation and sample strategy makes it difficult to separate the effects of sample size, sample bias and sample prevalence.

In general, it is difficult to anticipate what specific problems or limitations using a threshold approach impose on a given study using virtual species. However, given that there are important differences in discrimination ability and separation of causal factors between threshold and probabilistic approaches, we advocate a thorough examination of results based on threshold approaches. For example, there have been two recent studies looking at the effects of pseudo-absences from a simulation perspective (Lobo & Tognelli, 2011; Barbet-Massin *et al*., 2012). They both support the idea that a large number of pseudo-absences should be taken at random from all available background environments, and that pseudo-absences and presences should be weighted equally. However, both studies used the threshold simulation approach. It remains to be seen whether or not these conclusions continue to be valid when gradual probabilistic environmental responses are considered. Ward *et al*. (2009) and Li *et al*. (2011) used analytical and probabilistic simulation approaches to test particular correction strategies for pseudo-absences. They showed that taking into account true species prevalence is necessary, although different algorithmic implementations may help reduce the bias introduced in predicted probabilities of occurrence when sample prevalence is different from species prevalence. We suspect that using a probabilistic approach will increase the impact of sample prevalence bias on results. We therefore advocate for caution regarding the wide application of the threshold simulation approach, and for the use of the more general probabilistic approach in combination with analytical results in future virtual species studies.

Finally, and beyond the use of threshold versus probabilistic simulation approaches, we have given an example of how comparisons between simulations and analytical results are extremely valuable for separating out fundamental behaviour of models and performance measures from sampling artefacts: analytical approaches usually represent situations where sample size is large, whereas simulations can help separate the effects of small sample size or data bias. We therefore feel that the use of virtual species in SDM studies in general may benefit from enhanced used of these analytical approaches as a complement to simulations.

## Acknowledgements

We thank Jorge Lobo for sending us the data used in Jiménez-Valverde & Lobo (2007), and for clarifying some parts of their methodology, allowing us to reproduce their study, and Alberto Jiménez-Valverde and two anonymous referees for constructive comments on the manuscript. This work was possible thanks to funding from an INRA AAP-SPE project no. 470338 awarded to C.N.M., and the MORSE project ANR 11 CEPL 006 01.