Assessing the reliability of species distribution projections in climate change research

Aim Forecasting changes in species distribution under future scenarios is one of the most prolific areas of application for species distribution models (SDMs). However, no consensus yet exists on the reliability of such models for drawing conclusions on species distribution response to changing climate. In this study we provide an overview of common modelling practices in the field and assess model predictions reliability using a virtual species approach. Location Global Methods We first provide an overview of common modelling practices in the field by reviewing the papers published in the last 5 years. Then, we use a virtual species approach and three commonly applied SDM algorithms (GLM, MaxEnt and Random Forest) to assess the estimated (cross-validated) and actual predictive performance of models parameterized with different modelling settings and violations of modelling assumptions. Results Our literature review shows that most papers that model species distribution under climate change rely on single models (65%) and small samples (< 50 presence points, 62%), use presence-only data (85%), and binarize models’ output to estimate range shift, contraction or expansion (74%). Our virtual species approach reveals that the estimated predictive performance tends to be over-optimistic compared to the real predictive performance. Further, the binarization of predicted probabilities of presence reduces models’ predictive ability considerably. Sample size is one of the main predictors of real accuracy, but has little influence on estimated accuracy. Finally, the inclusion of irrelevant predictors and the violation of modelling assumptions increases estimated accuracy but decreases real accuracy of model projections, leading to biased estimates of range contraction and expansion. Main conclusions Our study calls for extreme caution in the application and interpretation of SDMs in the context of biodiversity conservation and climate change research, especially when modelling a large number of species where species-specific model settings become impracticable.

reliability of such models for drawing conclusions on species distribution response to changing 23 climate. In this study we provide an overview of common modelling practices in the field and 24 assess model predictions reliability using a virtual species approach. We first provide an overview of common modelling practices in the field by reviewing the papers 31 published in the last 5 years. Then, we use a virtual species approach and three commonly applied 32 SDM algorithms (GLM, MaxEnt and Random Forest) to assess the estimated (cross-validated) and actual predictive performance of models parameterized with different modelling settings and 34 violations of modelling assumptions. 35

36
Results 37 Our literature review shows that most papers that model species distribution under climate change 38 rely on single models (65%) and small samples (< 50 presence points, 62%), use presence-only data 39 (85%), and binarize models' output to estimate range shift, contraction or expansion (74%). Our 40 virtual species approach reveals that the estimated predictive performance tends to be over-41 optimistic compared to the real predictive performance. Further, the binarization of predicted 42 probabilities of presence reduces models' predictive ability considerably. Sample size is one of the 43 main predictors of real accuracy, but has little influence on estimated accuracy. Finally, the 44 inclusion of irrelevant predictors and the violation of modelling assumptions increases estimated 45 accuracy but decreases real accuracy of model projections, leading to biased estimates of range 46 contraction and expansion. 47

Introduction 58
Recently, a number of experts have delineated a set of best practices, and shown that many 92 relationships are wrong (Fourcade et al. 2018, Warren et al. 2020, relationships need to be realistic 127 in order to make meaningful predictions to different conditions. However, although guidelines for 128 transferability have been provided (Sequeira et al. 2018), it is common to validate models on 129 present data and assume they perform equally well for future predictions. While a number of studies 130 have discussed and tested the influence of multiple sources of uncertainty on the predictive 131 accuracy of SDM predictions under present conditions (e.g. Wenger  MaxEnt and RandomForest). Our approach allows validating model predictions against the virtual 143 "reality", therefore estimating true model predictive accuracy. We generated 50 virtual species 144 distributions, fitted SDMs under different conditions, and assessed the discrimination ability of 145 present and future model predictions against the real distribution. We also compared this predictive 146 ability with that estimated using a cross-validation (split-plot) approach, which is the most common 147 way of assessing model discrimination accuracy in most SDM studies. We systematically assessed 148 the combined effect of 1) the number of presence points (i.e. sample size), 2) the geographic extent 149 over which background points are drawn, 3) the number of biologically relevant (i.e. true niche 150 axes) and irrelevant predictors (i.e. spurious correlates), 4) the species prevalence (proportion of 151 study area occupied by the species), 5) the sample prevalence (proportion of presences over 152 background points), 6) the proportion of niche filling (the degree to which the species is at 153 equilibrium with the environment), 7) and the spatial bias in presence points. We then assessed 154 model predictions using two common discrimination metrics: the AUC and the TSS.  We generated 50 virtual species using the 'virtualspecies' R package (Leroy et al. 2016). For each 177 virtual species, we first determined the study area by generating a random extent between 3 and 10 178 decimal degrees (~330-1100 km) in both longitude and latitude centered around a random location 179 in the globe (Fig. 1). We then selected 6 random bioclimatic variables and sampled their values 180 within the extent using 100 random points. We used the mean and standard deviation estimated for 181 the 6 bioclimatic variables to generate the niche tolerance for the virtual species. 182 We then projected the niche within the study area for present and future conditions and 183 defined the occupied area using a threshold sampled randomly between the 0.2 to 0.8 quantiles of 184 the suitability values in the study area. The threshold is meant to represent the values above which 185 the species can survive and is assumed to be present for the validation of the distribution models 186 (see section 2.5). Note that the virtual species can potentially be present outside this study area in 187 environmentally analogous conditions, but we assume that the species is either limited by dispersal, 188 absent because of biotic interactions, or its presence outside the study area is simply unknown to the 189 modeller. For each virtual species, we fitted species distribution models using different cross-193 combinations of the settings presented in Table 1. To assess the influence of sample size, we 194 sampled random presence points (10, 25, 50, 100, 250, 500 and 1000 points) within the distribution 195 area of the species. Presences were sampled randomly and not as a function of niche suitability Hastings 2018) and observation probability is often also a function of other factors such as 198 vegetation structure or human presence. Then, to assess the influence of the geographic extent, we 199 fitted a minimum convex polygon (MCP) around presence areas to generated a buffer expressed as 200 percentage increase of the MCP, which delimited the geographic extent within which the 201 background points were sampled (0%, 100%, 500%, 5,000%, 50,000%; the latter often resulting in 202 the entire continent). We set the number of background points depending on the number of presence 203 points and the level of sample prevalence. We used three sample prevalence: 0.01, 0.1 and 1. Not all 204 background points, however, could always be sampled depending on the selected geographic extent 205 (i.e. insufficient number of cells), leading to variable sample prevalence values. 206 In each model, we used a total number of predictor variables between 3 and 12. To assess the 207 influence of biologically relevant and irrelevant predictors of species presence, we sampled none, 3, 208 or 6 relevant bioclimatic predictors (those describing the true species niche), and none, 3 or 6 209 irrelevant bioclimatic predictors (not describing the niche) from the other 13 bioclimatic variables 210 (Table 1). Combinations yielding 0 predictor variables were not considered. We tested collinearity 211 We also considered violations of two important assumptions underlying SDMs that are 217 common in real study cases: non-equilibrium with the environment (niche filling) and non-random 218 sampling of presence points that results in a bias along an environmental gradient. Decreasing 219 proportions of niche filling were simulated by only sampling presence points below a given quantile 220 (0.33, 0.66, 1) of the human footprint index values within the study area (Table 1). This mimics a 221 scenario where a species is potentially present (given climatic conditions) and yet absent because of 222 human impact. Note that species may be in disequilibrium with the environment for different 223 reasons (e.g. biotic interactions, dispersal limitations) but the result would be similar. For simplicity 224 we restrict our analyses to the case where species are not an equilibrium because of human impact. 225 Environmental bias was simulated by randomly sampling one of the biologically relevant 226 bioclimatic predictors used in the distribution model and sampling presences only below a given 227 quantile (0.33, 0.66, 1) of the distribution of environmental values (Table 1). This represents the 228 situation where the species has only been observed under certain conditions (i.e. sampling bias 229 correlates with environmental gradients), therefore potentially biasing the estimation of the species 230 niche. When no biologically relevant variable was included, an irrelevant predictor was selected 231 instead.
The full set of combinations of settings in Table 1  Among 250 papers reviewed, 92 included correlative species distribution models projected to 293 different times, and therefore were deemed relevant for our scopes (Table S1). Based on our sample 294 and using a bootstrapping approach, we estimated that the total number of papers published 295 between 2015 and 2019 that matched this criterion is 1194-1665 (95CI), indicating that we sampled 296 approximately between 5.5 and 7.7% of the total (Appendix 1). 297 Most of the papers inspected included models fitted on relatively small sample sizes (N < 50; Fig.  298 2a), with only 18.4% including minimum samples larger than 50 and 16.1% not reporting the 299 sample size used. More than 50% of the papers included all bioclimatic variables with no biological 300 justification (Fig. 2b). Among these papers, in ~50% of the cases the authors reduced the number of 301 variables using automatized approaches based on correlations or best fit to the data. A smaller number of studies selected variables a priori, some of which did not provide a justification for this 303 choice (7.8%). Most of the studies used a single model (Fig. 2c,d), with MaxEnt being the most 304 common algorithm used (78.3%), followed by linear models (GLM = 30.4%; GAM = 26.1%) and 305 machine learning models (RF = 27.2%; GBM = 20.7%) (Fig. 2c). The majority of studies did not 306 include real absences but used pseudo-absences, background data, or presence-only methods (i.e. 307 climatic envelopes; 84.8%; Fig. 2e). A large proportion of papers using pseudo-absences or 308 background points did not report the area of sampling (48.7%), while others used a variety of 309 different approaches, the most common being sampling randomly across the pre-defined study area 310 (Fig. 2f). Finally, most studies binarized the continuous probability outputs based on discrimination 311 metrics (e.g. max TSS or equal sensitivity and specificity; 73.9%), almost one quarter of the studies 312 The three algorithms showed a consistent pattern across the two scenarios and accuracy metrics. 317 The predictive accuracy of models' predictions estimated by cross-validation was consistently 318 above the typically accepted performance thresholds (AUC=0.7, TSS=0.5; Landis and Koch 1977, 319 Swets 1988), and higher than the true predictive accuracy for present, future predictions, and 320 contraction and expansion areas (Fig. 3). However, the accuracy of binary predictions (TSS) was 321 substantially lower than that measured for continuous predictive outputs (AUC), suggesting that the 322 binarization of relative probabilities of presence decreases models' predictive ability considerably 323 (Fig. 3). Models' predictions for the future and contraction and expansion areas showed lower 324 predictive performance (Fig. 3a), especially when binarized (Fig. 3b). 325 Under optimal modelling settings (e.g. large sample size, relevant predictors, no violation of 326 assumptions regarding niche filling and unbiased sampling), models performed relatively well 327 according to AUC (Fig. S1a), but poorly when considering binary outputs (Fig. S1c). On the 328 contrary, under poor modelling settings and conditions (small sample size, irrelevant predictors, 329 violation of the main assumptions), the estimated predictive abilities remained high, but the true 330 predictive abilities dropped considerably, especially when predictions were binarized into presence-331 absence (Fig. S1b,d). TSS and AUC were highly correlated, but while high TSS always 332 corresponded to high AUC, the opposite was not always true (Fig. S2). The importance and effect of different factors on the estimated predictive accuracy by cross-predictors of estimated predictive accuracy were species prevalence, the environmental gradient 338 sampled (inverse of environmental bias), and the geographic extent of background point sampling 339 (Fig. 4). Additionally, sample prevalence was important for Random Forest, and the number of 340 presence points for GLM (Fig. 4). Predictive accuracy decreased with increasing species prevalence 341 and decreasing environmental gradient sampled (i.e. increased with environmental bias), and 342 increased with increasing geographic extent sampled (Fig. S3-S8). The number of presence points 343 had a positive effect when fitting GLM and MaxEnt models, but had little effect when using 344 Random Forests. Sample prevalence a had positive effect in Random Forests, and weakly negative 345 in the other two models. The number of relevant and irrelevant predictors had a weak but positive 346 effect regardless of the model (Fig. S3-S8). 347 348 3.4 Determinants of true predictive ability 349 The true predictive accuracy (i.e. measured against the virtual reality) of the models for the present 350 was mostly affected by species prevalence, the number of presences, and the environmental 351 gradient sampled. The geographic extent was also important when fitting MaxEnt models (Fig. 4). 352 Both the number of presence points and the environmental gradient sampled had a positive 353 influence on predictive accuracy, geographic extent had weak positive effect, and the species 354 prevalence a negative effect (Fig. S3-S8). 355 The number of biologically relevant and irrelevant predictors, and niche filling, were relevant for 356 present predictions, but became especially influential for the predictive accuracy of models 357 projected into the future, with the number of relevant predictors and niche filling increasing 358 predictive performance, and the number of irrelevant predictors decreasing predictive performance 359 ( Fig. S3-S8). An important predictor of the predictive accuracy of future projections was the degree 360 of environmental similarity between the present and future environmental conditions of the study 361 area (Fig. 4, Fig. S3-S8). 362 Species with high prevalence were more likely to expand and less likely to contract the 363 range. However, a number of additional factors contributed to these estimates (Fig. S9-S12), such as 364 the number biologically relevant and irrelevant predictors, showing a positive effect on contraction 365 and expansion estimates in GLM and MaxEnt, and a negative effect on contraction areas in random 366 forest (Fig. S10-S12). The environmental similarity between present and future conditions yielded a 367 negative effect on contraction and expansion areas, but showed non-linearity for contraction areas 368 estimated by GLM and Random Forest models. Violation of equilibrium and random sampling 369 assumption also contributed to increase range contraction and expansion estimates (Fig. S10-S12). 370

Discussion 373
In this paper we report on common practices in SDM and use this information to assess the effects 374 of these practices on the predictive accuracy of SDMs, and thus, on the reliability of future climate-375 induced range shifts. Our literature review points out that a large part of papers that model species show low discrimination ability, whereas under non-optimal settings, predictions may not be better 395 than random. Ultimately, our results suggest that irrespective of the estimated performance, we may 396 be unable to make meaningful future predictions for many species, and even when we can, 397 binarization of models' outputs should be avoided. Based on our results, we elaborate in the 398 following paragraphs on guidelines and recommendations for good modelling practices when fitting 399 SDMs. 400 An important driver of the variation in model performance is species prevalence (Leroy et al. 2018). 474 in the context of species distribution models, as they are defined based on the geographic extent 477 being sampled. Species prevalence over the geographic extent is something we are unaware of in 478 real study cases, and will always be an unknown factor that affects our predictive ability (Leroy et 479 al. 2018). In this sense, we should aim to optimize other model settings that can be controlled for, 480 such as the choice of predictors, the sample prevalence or having a biologically plausible 481 geographic extent. Accuracy metrics can fool us easily, and should not be used acritically to assess the reliability of a 521 model, especially considering that they can provide higher estimates in sub-optimal conditions as 522 we have shown here (Fig. S1). Some of the problems discussed above arise from the binarization of 523 probabilistic model outputs into suitable and unsuitable areas (e.g. to determine the area of range 524 contraction or expansion) based on a threshold. In fact, the true AUC tends to perform better than 525 true TSS (Fig. S1-S2), and the estimated AUC has similar values to those of the true AUC under 526 optimal conditions, whereas the estimated TSS is consistently higher than the true TSS, thus 527 overestimating accuracy. Studies using MaxEnt typically only show AUC values (standard output 528 of the software), even though model predictions are binarized. Here we show that while AUC is, as 529 expected, highly correlated with TSS, high AUC can correspond to low TSS (Fig. S2). The problem 530 arises from the fact that even when the model performs well, the threshold that maximizes 531 discriminatory ability on the training dataset may not discriminate well true presence/absence, 532 especially under different environmental conditions. Additionally, classical cross-validation is 533 performed by using a split-sample approach, but a better and more informative option is to cross-534 validate on spatially independent samples ( results support this recommendation, and actually indicate that binary outputs should never be 539 considered or used to quantify changes in distribution areas. Alternative approaches to summarize 540 the results should be considered, such as looking at trends in predicted probabilities per areas. 541 542 4.8 Additional sources of uncertainty to be considered 543 In this study we evaluated the sensitivity of SDM predictions to a number of modelling settings and 544 common violations of SDM assumptions. However, there are additional factors that we did not 545 consider that can further contribute to making predictions less reliable. These include the spatial accuracy of data points in relation to the resolution used (Graham et al. 2008), the taxonomic 547 accuracy of the data points (i.e., species confused with others, especially from citizen science data), 548 and the ambiguous taxonomy of the species that may lead to merging data for different species, or 549 viceversa not considering part of the distribution of a species ( Our study indicates that our ability to predict future species distribution is low under on average, 567 and can be low to the point of not being meaningful when conditions are far from optimal, 568 especially when models' predictions are binarized. Hence, SDM based climate change forecasting 569 must adhere to the highest standards, must be clearly described (Zurell et al. 2020), and the 570 estimated accuracy of models should be interpreted with extreme care, as well as the results, 571 especially in relation to the quantification of range shifts, contraction and expansion, and the 572 identification of areas that will be lost or gained. These considerations are also valid (and perhaps 573 more problematic considering the wide temporal window and static niche assumption) in the case of 574 hind-casting to paleoclimates, which is now common in studies focused on refugia and 575 phylogeography (e.g. Svenning et al. 2011). Future research may focus on developing novel 576 approaches to improve, synthesize and communicate SDM projections. 577  S  e  l  e  c  t  c  o  n  t  i  n  e  n  t  a  n  d  s  t  u  d  y  a  r  e  a  a  n  d  s  a  m  p  l  e  c  l  i  m  a  t  i  c  v  a  r  i  a  b  l  e  s  D  e  r  i  v  e  s  p  e  c  i  e  s  n  i  c  h  e  N  i  c  h  e  s  u  i  t  a  b  i  l  i  t  y  (  p  r  e