Effects of simulated observation errors on the performance of species distribution models

Species distribution information is essential under increasing global changes, and models can be used to acquire such information but they can be affected by different errors/bias. Here, we evaluated the degree to which errors in species data (false presences–absences) affect model predictions and how this is reflected in commonly used evaluation metrics.


| INTRODUC TI ON
As biodiversity and ecosystems are under growing pressure by global changes, we need to urgently increase our understanding of, and associated capacity to model, the main factors driving changes in the distributions of species, assemblages and ecosystems (Dawson, Jackson, House, Prentice, & Mace, 2011). Species distribution models (SDMs; Guisan, Thuiller, & Zimmermann, 2017) allow modelling the distribution of species and their assemblages at different spatial and temporal scales (D'Amen, Rahbek, Zimmermann, & Guisan, 2017;. SDMs statistically correlate species observations (presence-absence or presence-only) with environmental data (Guisan & Thuiller, 2005) and are commonly evaluated by assessing their predictive performance and accuracy (Peterson et al., 2011).
The most used metric is, by far, the area under the receiver operating characteristic curve (AUC-ROC; Fourcade, Besnard, & Secondi, 2018). It is calculated by plotting a model's sensitivity against its false-positive rate at all possible thresholds (Hanley & McNeil, 1982), measuring the model's performance in discriminating between species presences and absences (Lobo, Jimenez-Valverde, & Real, 2008).
Alternative metrics have also been proposed, mainly due to the known limitations of the AUC (e.g., dependence on the calibration area, ignores spatial distribution of errors, relies on the ranking of sensitivity/specificity across thresholds and ignores the probability values given by a model or equally weights omission/commission errors; Lobo et al., 2008;Peterson, Papes, & Soberon, 2008;Jiménez-Valverde, 2012;Jiménez-Valverde, Acevedo, Barbosa, Lobo, & Real, 2013). The most common alternatives are Cohen's Kappa (Kappa;Cohen, 1960) and the true skill statistic (TSS; Allouche, Tsoar, & Kadmon, 2006). Kappa corrects the overall accuracy of model predictions by the accuracy expected to occur by chance while TSS corrects Kappa's dependency on prevalence (see Table 1 for more information). Moreover, SDMs can contain uncertainty from various sources (reviewed by e.g., Barry & Elith, 2006;Beale & Lennon, 2012), including errors associated with species data (e.g., unavailable absence data, small or insufficient sample sizes, unexplored geographical bias or spatial errors; e.g., Fielding & Bell, 1997;Pearce & Ferrier, 2000;Jenkins, Powell, Bass, & Pimm, 2003), environmental variables (e.g., missing important ones; Mod, Scherrer, Luoto, & Guisan, 2016) or modelling techniques (e.g., Guisan, Zimmermann, et al., 2007;Thibaud, Petitpierre, Broennimann, Davison, & Guisan, 2014). One problem commonly affecting SDMs concerns the inability to separate potentially false and true species' absences obtained through field surveys (Lahoz-Monfort, Guillera-Arroita, & Wintle, 2014) leading to underestimation of species occupancy (i.e., when occupied sites are misclassified as unoccupied; Guillera-Arroita, Ridout, & Morgan, 2010), incorrect inference about species distributions or inaccurate predictions (Lahoz-Monfort et al., 2014). The wrongly recorded absences (false absences) in presence-absence datasets or the omission of presences in presence-only models can then lead to predictions that will reflect where the species is more or less likely to be detected instead of the locations where if should occur or not (Kéry, 2011;Lahoz-Monfort et al., 2014). This means that one would eventually model what is called the "apparent distribution" and not the true distribution (Kéry, 2011). Additionally, some environmental relationships that are important to explain species occurrence and distribution might be wrongly identified or completely missed when false absences/presences are recorded (Kéry, 2011). The effect of detection errors on model performance is likely to depend on the modelling techniques used as those differ in their ability to fit complex response curves (i.e., species-environment relationships; Guisan, Zimmermann, et al., 2007;Merow et al., 2014).

K E Y W O R D S
artificial data, AUC, ecological niche models, evaluation metric, habitat suitability models, Kappa, model fit, predictive accuracy, TSS, uncertainty TA B L E 1 Detailed information about the evaluation metrics used to assess the predictive performance of SDMs (adapted from Liu et al., 2005 andAllouche et al., 2006), a is true positives (or presences), b is false positives (or presences), c is false negatives (or absences), d is true negatives (or absences), n (=a + b + c + d) is the total number of sites. Sensitivity is the probability that the model will correctly classify a presence (a/a + c). Specificity is the probability that the model will correctly classify an absence (d/b + d)  Liu et al. (2005) Several of these issues have received considerable attention in recent years, providing information to improve survey designs, proposing approaches to account for imperfect detection and evaluating the impacts of non-detection of species in models of individual species (Gu & Swihart, 2004;Guillera-Arroita et al., 2010;MacKenzie et al., 2002). However, the majority of the studies focusing on uncertainties in SDMs used real species observations, putting a limit to proper assessment of model accuracy because the complete distribution and the species-environment relationships cannot be entirely known and may result from factors that cannot be controlled. A way to avoid these limitations is to use artificial data (Austin, Belbin, Meyers, Doherty, & Luoto, 2006) in a virtual F I G U R E 1 Workflow of the analytical steps followed in the study.
Step 1-We started by creating binary distribution maps for 100 virtual species from models based on real species' data (using either generalized linear models (GLM), boosted regression trees (BRT) or random forests (RF) as modelling techniques and the receiver operating characteristic (ROC), true skill statistic (TSS) or KAPPA as thresholding techniques).
Step 3-To each of the sampled datasets, errors were added according to six different levels (0%-training data without error added, the control; 10%, 20%, 30%, 40% and 50%-training data with error added) and two different types of error (errors added to presences, creating false negatives or added to absences, creating false positives).
Step 4-Each occurrence dataset was used to create single species distribution models (probability and binary maps), using three different modelling techniques (GLM, BRT and RF).
Step 5-The predictions for each species were then evaluated with three evaluation approaches: model fit probability (MFp), predictive success probability (PSp) and predictive success binary (PSb), using different metrics: maximized Kappa (MaxKappa), maximized TSS (MaxTSS) and Somers'D (rescaled measure of AUC) for MFp and PSp; Kappa and TSS for PSb ecologist approach (see Zurell et al., 2010 for a review), where all the information necessary for a study can always be obtained in a fully artificial or semi-artificial world, allowing complete or at least partial control on the data and models being tested (Austin et al., 2006). In one of the first application to SDMs, Hirzel, Helfer, and Metral (2001) created virtual species to test different habitat suitability methods and their predictive power under different scenarios. Virtual species have been used to test different ecological models and assumptions, to test different approaches to sample species data (Hirzel & Guisan, 2002), to downscale coarse-grain data into high-resolution predictions (Bombi & D'Amen, 2012) or to measure the relative effect of different factors affecting predictions (Thibaud et al., 2014).
In this study, we take a virtual ecologist approach, using 100 evolution and conservation, this paper provides an essential analysis of the potential effects of errors in species data on SDM reliability and on the interpretation of common evaluation metrics.

| Analytical framework
We implemented a virtual ecologist approach (see Figure 1), based initially on real data in a real landscape (i.e., which can also be considered as a semivirtual study; Albert et al., 2010) in the western Swiss Alps (a priority research area; https://rechalpvd.unil.ch), covering approximately 700 km 2 . We defined the distributions of virtual species based on predictions of models fitted on real data in this study area to keep ecological realism (see Step 1 below). The approach consisted of five steps:

| Step 1. Creating virtual species
From a set of real species data (previously sampled in the study area), we generated 100 virtual species, by fitting SDMs (initial SDMs in Figure 1) using presence-absence data against five environmental predictors: summer mean monthly temperatures (2-19°C), sum of winter precipitation (65-282 mm), annual sum of potential solar radiation (KJ), slope (°) and topographic position (unitless; indicating ridges and valleys; see Supporting Information Appendix S1).
The models were fitted using generalized linear models (GLMs; McCullagh & Nelder, 1989), random forests (RFs; Breiman, 2001)  In this study, all initial environmental and species data were available at a 25 m resolution. In real-world studies, the spatial resolution can have an important influence on model predictions, with diverging results being observed between small-and large-scale studies (e.g., Meyer & Thuiller, 2006;Mertes & Jetz, 2018;Record et al., 2018), or when changing resolution or extent (e.g., Thuiller, Brotons, Araujo, & Lavorel, 2004;Guisan, Graham, Elith, & Huettmann, 2007). This can, for instance, result from the scale dependency of the environmental predictors (Vicente et al., 2014) and spatial stochastic effects at smaller spatial scales (Scherrer et al., 2018;Steinmann, Eggenberg, Wohlgemuth, Linder, & Zimmermann, 2011). As a result, the distribution of real species cannot usually be fully explained by the abiotic predictors, as dispersal and biotic factors also play a role and interact with scale (Soberon & Nakamura, 2009). Here, we avoid this problem by using a virtual species approach, with the same predictors being used to create the species and fit their distribution models, and therefore, the initial species distributions are fully explained by the chosen predictors at the study scale (extent and resolution). This approach guaranteed that the virtual species showed realistic response curves for our landscape resulting in realistic species assemblages.
In theory, the resolution should thus not matter in our study and should not affect our findings. All models were run in R software version 3.3.3 (R Core Team, 2017), using biomod2 default settings (Thuiller, Lafourcade, Engler, & Araujo, 2009), as in most published studies.

| Step 2. Sampling observations from virtual species maps
Virtual species presence-absence data were randomly sampled to create training datasets of different sizes (100-400-1,600) using two sampling designs: (a) selection of species data with equal number of presences-absences (equal prevalence; "EqualPrev") and (b) selection of species data taking into account a presence-absence ratio reflecting the true prevalence of the species (species true prevalence; "TruePrev"). These datasets served as control (0% error) to establish a baseline of the potential (re-)sampling bias for the different sampling schemes, modelling techniques and species (Step 2, Figure 1).

| Step 3. Addition of errors to the training data
For each sampling design and size, errors were randomly added (using software R) to the training data according to six different levels (i.e., "levels of error"; 0% (no error added), 10%, 20%, 30%, 40% and 50%). These errors were added either to presences only (creating false negatives [FN] by changing presences to absences), or to absences only (creating false positives [FP] by changing absences to presences; Step 3, Figure 1).

| Step 4. Modelling procedure
The control dataset (without error) and all the datasets with errors added were used to create SDMs (final SDMs in Step 4, Figure 1) using the same environmental predictors and modelling techniques employed to initially create the virtual species. This ensures that-without error added (i.e., controls)-the models can potentially replicate perfectly the distributions of our species, since all information that initially defined these distributions is available (i.e., same predictors), and the response curves could be fitted perfectly (i.e., if using the same technique). The only factors that can affect the performance are therefore the sample size, the change of modelling technique, the threshold method and the errors added, which we can untangle through the control and the known full distribution. In other words, having SDM predictions for our control and degraded datasets allowed us to distinguish decreases in model performance only caused by resampling (using the control dataset), the thresholding effect and from the effects caused by the errors added to the presences (FN) and/or the absences (FP).

| Step 5. Evaluating predictions across species and levels of errors
Finally, we evaluated all predictions built for each sample size, sampling design, modelling technique and threshold approach by measuring model fit on probability (MFp) at sampled sites and predictive success for probabilistic (PSp) and binary predictions (PSb) across the whole area (i.e., evaluation approaches; see description below), using five widely used agreement/evaluation metrics (for more information see Table 1 and Liu et al., 2005)  However, models fitted using virtual species created by BRTs/RFs F I G U R E 2 Evaluation values of control model (using training data without errors added; 0%) for "EqualPrev" (left column; a, c and e) and "TruePrev" (right column; b, d and f) sampling designs, with measured MFp, PSp and PSb for virtual species (n = 100), created using generalized linear models (GLM), boosted regression trees (BRT) or random forests (RF; initial SDMs) and with different sample sizes (100-400-1,600). Model fit probability (MFp) and predictive success probability (PSp) were measured using maximized Kappa (MaxKappa; yellow), maximized TSS (MaxTSS; green) and Somers'D (blue), while predictive success binary (PSb) was measured using Kappa (gold) and true skill statistic (TSS; light green). For each sample size, three sets of three box plots are displayed, corresponding to models fitted (final SDMs) using either GLMs (solid plots), BRTs (dashed plots) or RFs (dotted plots) and evaluated with corresponding metrics. The same applies to PSb, but only two box plots are displayed in each of the three model sets, corresponding to the two metrics used presented a wider range of values, with model performance being worse than when species were created by GLMs.

| Effects on model evaluation of adding errors to the training data
As the patterns observed across sample sizes were similar, we only report results on the intermediate sample size (i.e., 400; but see F I G U R E 3 Observed difference of measured model fit probability (MFp), predictive success probability (PSp) and predictive success binary (PSb) between control (training data without errors added; 0%-sampled data) and degraded data (training data with errors added) models, under the sampling design EqualPrev and sample size 400, for virtual species created using GLM (generalized linear models). Errors were added to the occurrence dataset, creating either false positives (errors added only to absences; left column; a, c and e) or false negatives (errors added only to presences; right column; b, d and f). MFp and PSp were measured using maximized Kappa (MaxKappa; yellow), maximized TSS (MaxTSS; green) and Somers'D (blue), while PSb was measured using Kappa (gold) and true skill statistic (TSS; light green). For each level of error, three sets with three plots are observed, corresponding to models fitted using either GLMs (solid plots), BRTs (dashed plots) or RFs (dotted plots). For PSb, only two plots are present in each of the three sets Supporting Information Appendix S2-S3 for complete results on "EqualPrev" and "TruePrev" sampling designs, respectively). The effect of error added decreased with sample size, with more accurate models being observed at higher sample sizes (i.e., difference between control and degraded models was smaller).
Regardless of the evaluation approach, as errors were increasingly added to training data, evaluation values decreased when compared with the control models ( Figure 3). This decrease in model performance was more pronounced in model fit probability and predictive success binary (MFp/PSb). Still, models whose performance F I G U R E 4 Observed difference of measured model fit probability (MFp), predictive success probability (PSp) and predictive success binary (PSb) between control (training data without errors added; 0%-sampled data) and degraded data (training data with errors added) models, under the sampling design TruePrev and sample size 400, for virtual species created using GLM (generalized linear models). Errors were added to the occurrence dataset, creating either false positives (errors added only to absences; left column; a, c and e) or false negatives (errors added only to presences; right column; b, d and f). MFp and PSp were measured using maximized Kappa (MaxKappa; yellow), maximized TSS (MaxTSS; green) and Somers'D (blue), while PSb was measured using Kappa (gold) and true skill statistic (TSS; light green). For each level of error, three sets with three plots are observed, corresponding to models fitted using either GLMs (solid plots), BRTs (dashed plots) or RFs (dotted plots). For PSb, only two plots are present in each of the three sets decreased the most in each approach depended on the modelling technique used (Figures 3 and 4). As a result, random forests (RF) presented higher model performance when MFp/PSp were measured and generalized linear models (GLM) when PSb was measured.
In general, the creation of false positives (FP; Figure 3 The results obtained with sampling design "TruePrev" (Figure 4 and Supporting Information Appendix S3) did not differ from those previously described in "EqualPrev" (Figure 3  After increasingly degrading the data and when MFp/PSp were measured, models fitted by GLMs (Figures 3 and 4) presented the highest decrease when compared with control models. On the opposite side, models fitted by RFs (Figures 3 and 4) were the least affected by the addition of degraded data. Still, when measuring PSb, the decrease in model performance was higher for models fitted by RFs and more stable for models fitted by GLMs (especially as the errors added increased).
We performed an additional evaluation approach, predictive success on calibration data (PSc), not providing the results here since it is a subset of PSb and accordingly yielded similar patterns (but see Supporting Information Appendix S5).

| D ISCUSS I ON
We used a virtual ecologist approach with artificial species data to evaluate the degree to which errors in presences/absence data (see Graham, Ferrier, Huettman, Moritz, & Peterson, 2004;Guillera-Arroita et al., 2010;Tyre et al., 2003 for examples of causes like false-negative errors or imperfect detection, taxonomic inaccuracies or biases in the spatial coverage of data) can affect SDM predictions and assess the reliability of currently used evaluations metrics.
By using artificial data, we prevented limitations of real-world data (most previous studies used real species data from surveys, herbaria or museums; e.g., Hernandez, Graham, Master, & Albert, 2006;Osborne & Leitao, 2009;Mitchell, Monk, & Laurenson, 2017) Randin et al., 2006). Fourth, the creation of false positives had a stronger effect in decreasing model performance than the creation of false negatives. We discuss these findings below.

| Confirming the effect of sample size and controlling for it
We found that the effects of degraded data consistently decreased with sample size, showing sample size as an important factor affecting model performance. This relationship between model performance and sample size is well known (e.g., Stockwell & Peterson, 2002;Wisz et al., 2008;Thibaud et al., 2014;Mitchell et al., 2017).
It can be partially explained by the fact that with greater number of presence/absence data, a more complete (broader) information about the occupied environmental space will likely be available. This improves parameter definition, leading to more accurate predictions (Carroll & Pearson, 1998). Our results could be useful since we showed that accurate models (i.e., when all the metrics show high evaluation values) could be generated even when substantial levels of errors (>30%) are present in the training data (if a large enough sample is collected and the adequate modelling techniques are used).

| Importance of contrasting model fit and predictive success
Different conclusions about model performance can be inferred depending on how model performance is measured (i.e., our different evaluation approaches). To our knowledge, this is the first study to formally test and compare the outputs of these evaluation approaches to assess model predictions. This was possible through the use of virtual species allowing us to simultaneously assess how well models reproduced the partially degraded training data ( ing a decrease in evaluation values between model fit and independent evaluation (e.g., Randin et al., 2006;transferability test, where General Additive Models (GAMs) fit better than GLMs but predict worse to independent data).

| How do evaluation metrics reflect model performance?
A consistent pattern was identified, with models evaluated by Somers'D (rescaled AUC) always yielding the highest evaluation values, usually followed by MaxTSS and MaxKappa, or TSS and Kappa (for probabilistic and binary predictions, respectively). Within the same model, Somers'D values had very small differences when compared with the control model (even with errors >30%). Somers'D (rescaled AUC; from −1 to 1) was used instead of the widely used AUC to allow direct comparisons to the other evaluation metrics, as they all range between −1 and +1, being interpreted roughly in a same way as correlation coefficients. This means that when considering Somers'D (or AUC, with even higher evaluation values, concentrated between 0.5 and 1), all models evaluated in this study would be considered at least fair (based on thresholds proposed by Swets (1988, note nr 11); i.e., models with AUC values above 0.7 are considered "useful for some purpose," while models with AUC > 0.9 are considered as being "of rather high accuracy"). However, when evaluated by the other metrics, a large amount of these models would be considered poor or not different than random. Therefore, concluding whether a model is good, fair or poor partly depends on the evaluation metric used and not only on model performance. In particular, our results suggest a strong tendency of Somers'D (i.e., AUC) to yield over-optimistic evaluations. We also observed, although in a lesser measure, a tendency of TSS (resp. MaxTSS) to yield over-optimistic values, whereas Kappa (resp. MaxKappa) proved to better reflect the level of errors added to the training data. These results are supported by recent findings showing that AUC/TSS are not the most efficient metrics to assess model performance (being over-optimistic or unrealistic) and that these could be classified as having good performance even when "dummy" data (e.g., pseudopredictors derived from paintings; Fourcade et al., 2018) or wrong information (e.g., locational uncertainty; Graham et al., 2008;Mitchell et al., 2017) was used. As a result, many models could be considered as satisfactory despite generating partially wrong spatial predictions. Our results confirm these previous criticisms and show how important it is to take into account these drawbacks in future uses of AUC (or Somers'D)-and to a lesser extent of TSS-to assess model performance. Some suggested approaches might be to assess the spatial predictions when comparing models (Mitchell et al., 2017;Randin et al., 2006) or accounting for the most relevant section of the ROC curve ; assuming that true absences and independent data exist). However, as noted by Fourcade et al. (2018), this "perfect" data are usually unavailable and detailed screening of ROC plots can be difficult when modelling multiple species. Therefore, the use of AUC needs to be considered with great care in future studies and the interpretation scales (Araujo et al., 2005;Swets, 1988) used to assign a level of model performance to its values need to be revisited. We believe it is probably more effective and productive to investigate new ways/methods to correctly evaluate model performance and predictions, with the use of artificial data being a useful tool to completely assess the value of these new methods.

| How do different modelling techniques deal with the degraded training data?
The contrasted results of predictions with high model fit ( to the modified training data). So, models with simpler response curves (like GLM) tend to better manage errors when present in the species data, resulting in better predictions to independent data (see e.g., Randin et al., 2006 when compared to GAMs), and better fit to ecological theory (Austin, 2002(Austin, , 2007. More complex methods (here RF/BRT) seem to overfit the degraded data, maintaining a good/high model fit (MFp) but at the cost of a poorer/low predictive success (PSb; see also Merow et al., 2014).

| How do different types of errors affect models and metrics?
The creation of false positives (FP) had a stronger negative effect on model performance than when false negatives (FN) were created.
This is especially true when species had the same number of presences-absences ("EqualPrev"), not being obvious when sampling true prevalence ("TruePrev"), possibly due to the characteristic low prevalence of some species. False positives had a stronger negative effect because presences are expected to be on average more informative. They generally occur in a unimodal and limited way along environmental gradients, contributing to a fairly clear signal that can be captured in a model. On the other hand, absences are usually less informative since they can span entire environmental gradients and thus be found for example, on both sides of the mode of a species' occurrences (i.e., would need a bimodal response to be captured).
Depending on the species, absences can still hold a signal in some cases (e.g., low elevations for alpine plants), but it is likely to be on average much weaker than that of presences. We can think of the creation of false negatives (in "EqualPrev") the same way as one uses pseudo-absences (i.e., when real absences are not available), setting the weights of those pseudo-absences to 0.5 (therefore ensuring equal prevalence). As the addition of errors to presences/absences decreased model performance in both cases, it is important to account for imperfect detection in models (see Guélat & Kéry, 2018; Lahoz-Monfort et al., 2014 for recommendations).

| Conclusions, recommendations and perspectives
Our study showed that much can be learned by using artificial data where truth is known, especially by contrasting model fit and predictive success, and modellers would gain much by using these virtual approaches more systematically in the future in complement to real data. Several important findings emerged specifically from this study: 1. The effect of errors added to species data decreased with increasing sample size.
2. Different conclusions about model performance can be inferred depending on how it is measured (i.e., which metrics and which data): 3. Models with high evaluation values can be obtained even with high levels of error artificially added to the training data; 4. The classical classification or interpretation of a model as excellent, good, fair or poor strongly depends on the metric used to evaluate it and thus can be misleading; 5. Evaluation metrics matter: We identified AUC as a particularly over-optimistic metric and in a lesser measure MaxTSS (TSS for binary predictions), with often high evaluation values produced even with high levels of errors in the training data, thus not necessarily translating a good predictive success; therefore, we recommend the use of MaxKappa.
6. Modelling techniques were differently affected by added error, with some delivering better measures of predictive success (GLMs here) and others delivering better model fit (RFs).
7. The creation of false positives had a stronger effect on the measured evaluation approaches than the creation of false negatives.
A particularly important finding in our study is thus the need to seriously reconsider the current use of AUC (here rescaled, Somers'D) and its scale of interpretation. We advise caution when models are solely evaluated with this metric (and to a lesser extent by TSS and MaxTSS) as the interpretation of their quality, reliability and transferability might be too optimistic and lead to biased conclusions. The incorrect interpretation of how good/accurate a model is might have serious consequences if not considered. For example, the prioritization of specific areas for conservation can be wrong if the models used for that prioritization are over-optimistic or biased. The same can be said if invasive species prevention/eradication efforts are occurring, with an over-optimistic prediction possibly leading to management being directed to areas where those efforts are unnecessary. Taking into account previous and current studies, the most appropriate measure might be to completely cease to use AUC and instead focus on more effective evaluation metrics. Based on our results, we recommend using MaxKappa (resp. Kappa) if one wants a metric that better reflects the actual level of errors in the predictions. As it is usually preferable to evaluate models using spatially independent data James, Witten, Hastie, & Tibshirani, 2013), our results suggest that techniques that are better at reproducing ecological theory (Austin et al., 2006), like GLMs here, tend to show a better overall behaviour for modelling species distributions. However, additional modelling techniques (e.g., as found in Elith et al., 2006) should also be tested to determine the most suitable ones. Additionally, effort should be put in minimizing false-positive rates when collecting training data (e.g., improving species identification or detectability).
Finally, research using a virtual ecologist approach could also be employed to further develop more reliable evaluation metrics that could be properly tested in a "controlled environment." In a general manner, a more systematic use of artificial data bears the potential to improve methodological developments considerably in future ecological and evolutionary research.