Testing the ability of species distribution models to infer variable importance

Identifying biophysical factors that define species’ niches and influence geographical ranges is a fundamental pursuit of ecology. Frequently, models of species’ distributions or niches are used to infer the importance of range- and niche-defining variables. However, very few—if any—studies examine how reliably distribution and niche models can be used for inference. Here we use a simulation approach to understand the conditions under which species distribution models reliably measure variable importance. Using a set of scenarios of increasing complexity, we explore how well models can be used to 1) discriminate between variables that vary in importance and 2) calibrate the effect of variables relative to an “omniscient” model used to simulate the species. Variable importance was assessed using a sensitivity test in which each predictor was permuted in turn. Importance was inferred by comparing model performance between permuted and unpermuted predictions and by calculating the correlation between permuted and unpermuted predictions. Of five metrics of importance (correlation statistic and AUC each calculated with presences/absences or presences/background sites, plus the Continuous Boyce Index), only the Continuous Boyce Index was capable of indicating absolute (versus relative) variable importance. In simple scenarios with one influential environmental variable with a linear spatial gradient and one uninfluential randomly-distributed variable, models were unable to discriminate reliably between variables under conditions that are typically challenging (low sample size, high prevalence, small spatial extent, coarse spatial data resolution with low spatial autocorrelation, and high collinearity between variables). In more complex scenarios with two influential environmental variables, models successfully discriminated between variables when they acted unequally, but overestimated the importance of the stronger variable and underestimated the importance of the weaker variable. When variables had equal influence, models underestimated importance when niche breadth was narrow. Generalized additive models and Maxent had better discrimination accuracy than boosted regression trees. Our work demonstrates that permutation tests can reliably discriminate between variables with different levels of influence but cannot accurately measure the magnitude of influence. The frequency with which distribution and niche models are used to identify influential variables begs further research into methods for assessing variable importance.


Introduction 35
What environmental factors determine species' ranges and environmental tolerances? The answer 36 to this question remains elusive for most species on Earth despite being crucial for addressing long-37 standing issues in theoretical ecology (e.g., Francis  preclude studying range limits across wide geographic scales for many species. Alternatively, 43 environmental limits can be inferred using sensitivity analysis of geographic range and niche 44 models. These ecological niche models (ENMs) and species distribution models (SDMs) are 45 constructed by correlating observations of occurrence with data on environmental conditions at 46 occupied sites. 47 Niche and distribution models can be evaluated along several dimensions, one of which is the 48 ability to identify the most important variables and estimate their influence accurately (Araújo et al. Given the frequency with which SDMs and ENMs are used in this regard and the impact these 59 studies have on our understanding of the natural world, there is a pressing need to understand the 60 conditions under which they can identify influential variables. However, we are aware of no studies 61 that evaluate how well SDMs and ENMs actually measure variable importance (i.e., model "realism" 62 sensu Araújo et al. 2019). 63 The ability of an ENM or SDM to infer variable importance will likely be affected by many of the 64 same factors that influence model accuracy such as sample size, study region extent, species' 65 prevalence, spatial autocorrelation in predictors, resolution of environmental data, correlations 66 between variables on the landscape, and interactions between variables in how they define the 67 niche. Although the effect of these factors on variable importance has received some attention (e.g., 68 Menke . The virtual species approach also enables us to isolate 77 specific aspects of study design and the modeling situation that otherwise would be difficult or 78 impossible to change in a real-world setting. In the following section we provide a brief overview of 79 some aspects of modeling practices that influence the ability of models to measure variable 80 importance. We then present results from 10 "experiments" in which we manipulate practices and 81 common ecological conditions we hypothesized would affect the ability of models to measure 82 variable importance. Finally, we end with a discussion emphasizing future directions. Compared to 83 the abundance of studies assessing the predictive accuracy of models of niches and distributions 84 (e.g., Guisan et al. 1999 research on the efficacy of these same models for identifying important variables is a striking 87 oversight. 88

Variable selection, niches, and distributions 89
The present work builds upon yet is distinct from the critical problem of selecting ecologically assumed to be of no importance with zero uncertainty. Thus, both practices are highly relevant to 96 measuring variable importance. However, our focus here is on measuring the influence of variables 97 that are provided to a model, whether or not they "survive" the model selection process or are 98 otherwise downweighted by the model parameterization routine. 99 The present work also intrinsically relates to yet is generally unaffected by the distinction between 100 niche and distribution models (Peterson et al. 2011). Even though we use the term "landscape" to 101 refer to the space from which training and test samples are obtained, our simulations are typically 102 constructed such that this space could represent accessible environmental (niche) space or 103 accessible geographic space and thereby yield models of the niche or distribution, respectively. 104 Obviously, a key focus for further research is distinguishing under what conditions models can be 105 used to infer the importance of niche-limiting versus distribution-limiting variables (Warren et al. 106 bioRxiv). 107

Variable importance 108
We assume a variable is "important" if predictions from a model are highly sensitive to changes in 109 values of the predictor (Smith & Santos in prep.). Sensitivity can be summarized using a test 110 statistic such as a metric of model performance (e.g., the area under the receiver-operator 111 characteristic curve, or AUC). Robust tests of variable importance should have a high degree of 112 discrimination (qualitatively differentiate variables with different influence) and calibration, the 113 latter of which can be divided into "accuracy" (the central tendency of importance should be similar 114 to the central tendency of importance from a model created with perfect knowledge of a species' 115 requirements) and "precision" (variation around central tendency matches that of a perfect model).

116
In addition, ideal test statistics should be bounded within a known range without unknown limits 117 and

General approach 153
To simplify scenarios and help isolate factors affecting assessment of variable importance, unless 154 otherwise noted, we assume an ideal modeling situation: the species is in statistical equilibrium 155 with its environment; sampling is unbiased with perfect detectability; environmental variables are 156 measured without error and the species interacts with the environment at the spatial and temporal 157 scales at which the environmental data is represented; biotic interactions and disturbances do not 158 influence occupancy of the range or niche; and there is no process-based spatial autocorrelation in 159 the distribution of the species. 160 The lack of process-based autocorrelation means that there is no hidden process in the "generative" 161 model that would mislead a correlative model attempting to estimate the generative model from 162 data. For example, the simulations include no source/sink dynamics that would cause an erstwhile 163 unsuitable site to appear suitable simply because it is populated (Pulliam 2000 AUCbg, and CORbg) or 200 absences (for AUCpa and CORpa). We generated 100 sets of presences, 207 absences, and background sites for each combination of factors we wished to manipulate (e.g., 208 differences in prevalence, magnitude of correlation between variables, etc.). Models were trained 209 and tested using each of the data 100 sets. For any given iteration each algorithm was trained and 210 tested against the same set of calibration and evaluation sites. Permutation tests were repeated 30 211 times apiece then averaged for a given treatment (e.g., a specific level of prevalence), set of training 212 and test sites, and algorithm. Landscapes were either square or circular, depending on the scenario. 213 To provide an objective benchmark against which to judge the results from the three SDM 214 algorithms, we used an "omniscient" (OMNI) model which uses the same generative model and 215 parameter values used to create the species' distribution (Meynard et al. In press). We evaluated 216 the OMNI models with the same set of test sites used to evaluate the SDMs. Since test sites are 217 drawn probabilistically from the underlying probability distribution describing the species' actual 218 range, there is variability in the test statistics across iterations using OMNI models even though 219 their predictions of the probability of occurrence are deterministic and thus always "correct." 220 Variability in test statistics of OMNI thus reflects only variation in test sites across 100 data 221 iterations. Variability in test statistics of other algorithms reflects variation in test and training 222 sites across the same 100 iterations plus differences in structural forms of the models. For 223 shorthand, we will use "SDM" to refer to GAMs, MAX, and BRTs together (not including OMNI). 224 We adopted a reductionist approach isolate factors that affect assessment of variable importance by 225 conducting simulation experiments focusing on one aspect of interest at a time. Thus, our first 226 scenario begins with the simplest possible test of methods for assessing variable importance: (1) a 227 species' range is determined by a single "TRUE" variable, but the SDMs are presented with data on 228 this variable plus an uncorrelated "FALSE" variable that has no effect on distribution. The second 229 experiment (2) explores the effects of training sample size (number of presences). We then 230 investigate the effects of spatial scaling on measuring variable importance by varying: (3) the 231 prevalence of the species; (4) study region extent; and (5) spatial resolution and autocorrelation of 232 environmental data. In the next experiment, we explore the outcomes of (6) presenting models with 233 a FALSE variable that is correlated with a TRUE variable. We then consider cases where the 234 species' distribution is shaped by two factors in experiments where (7)  In this scenario, the probability of occurrence of the species has a logistic response to a single TRUE 249 variable: 250 where β represents the strength of the response (we used β = 2). The TRUE variable has a linear 252 gradient in geographic space ranging from -1 to 1 across a square landscape 1024 cells on a side 253 (Fig. 1). Given this arrangement, the generative function (Eq. 1) creates a probability of occurrence 254 with an inflection point midway across the landscape (Fig. 1  and an uninfluential FALSE variable that was "mistakenly" presented to the model. Models 269 successfully discriminated between the two variables (no overlap between green and red bars, 270 indicated by asterisks). If a variable is important then permuting its values will lower model 271 performance, here measured using the Continuous Boyce Index (CBI). "Control" values represent 272 CBI of the unpermuted model. Bars represent the inner 95 th -quantile of the distribution of CBI from 273 100 models across 100 data iterations of training and test sites. Horizontal lines within each bar 274 represent the median value across the 100 iterations. The OMNI ("omniscient") model serves as a 275 benchmark against which to assess the ability of the other models. 276 When presenting results, we plot the median and the inner 95% of the values of the test statistics 277 across the 100 data iterations. Since we are utilizing simulations, we know a priori that treatments which can be discerned by eye. We reiterate that variation in the responses represents variability 281 arising from random sampling of training (GAMs, MAX, and BRTs) and test sites (all algorithms) 282 across 100 models (i.e., 100 OMNI models, GAMs, Maxent models, or BRTs). Thus, they represent 283 the range of variation expected across a set of data samples. In real-world situations there are 284 typically just one set of samples. Thus, if we were to implement the procedures for a real species 285 with just one set of data, the outcome would fall somewhere within the range of variation displayed 286 here. Hence, even if a model cannot reliably discriminate between variables in all of 100 data 287 iterations in our scenarios, it may still be able to do so for some of them. However, in real-world 288 situations the expected distribution of the test statistic will be unknown, so it will not be possible to 289 know if the test statistic robustly indicates the importance of a variable. Hence, define successful 290 discrimination between variables as complete lack of overlap between the distribution of CBI 291 where the TRUE and FALSE variables are permuted in turn. We define successful calibration by 292 comparing the distribution of the test statistic from the SDM and OMNI: the difference between the 293 upper 97.5th and 2.5th percentile of the SDM's distribution of CBI had to be within ±10% of the 294 same range of OMNI's distribution of CBI, and the SDM's median CBI had to be within the inner-295 most 20th percentile distribution of CBI (i.e., between the 40th and 60th percentile). For a given 296 case (e.g., set of test statistics for a control model plus the two sets of predictions) each distribution 297 had to be successfully calibrated compared to OMNI for the case as a whole to be considered 298 successfully calibrated. We found no cases using CBI, CORpa, or CORbg had successful calibration 299 accuracy, although using AUCpa and AUCbg sometimes yielded well-calibrated outcomes when the 300 species was modeled with GAMs or MAX (Appendices 1 and 2). 301 Results for the OMNI model serve as a benchmark for the other SDMs, so we discuss them first. CBI 302 for OMNI control was slightly <1 (inner 95% quantile range: 0.77 to 0.95; Fig. 2). Even though OMNI 303 correctly recreates the generative model, maximum CBI for OMNI is <1 because of the probabilistic 304 assignation of presence/background sites. Thus, the scenario is in theory not too challenging for 305 SDMs to model accurately. Compared to OMNI, MAX had a comparable median CBI (high accuracy) 306 and comparable distribution of CBI (high precision). GAMs also had high accuracy but exhibited 307 lower precision. BRTs had the lowest accuracy and precision. 308 All algorithms had high discriminatory capacity (ability to qualitatively differentiate between TRUE 309 and FALSE variables). In each case, permuting TRUE yielded CBI values centered on 0, which is 310 indicative of a model that performs no better than random. When TRUE was permuted, the SDMs 311 had slightly less precision (greater variability) than OMNI. When FALSE was permuted, only MAX 312 was precise (same distribution as OMNI). BRTs also slightly overestimated the importance of FALSE 313 (lower accuracy indicated by a lower median CBI value compared to OMNI; Fig. 2). BRTs are highly 314 flexible, "local" learners (Elith et al. 2008), which makes them more likely to use information from 315 the FALSE variable. This was evident from examining the BRT-specific measure of variable 316 importance, which is the number of times a variable appears in a model times the deviance 317 reduction associated with its use, standardized to sum to 1 (Elith et al. 2008 sample sizes (Fig. 3c). BRTs were unable to 375 reliably discriminate TRUE from FALSE and had trouble converging when sample size was ≤32. 376 In summary we found that small sample size challenged the ability of SDMs to differentiate 377 influential from uninfluential variables. In our scenario all SDMs could reliably discriminate 378 between TRUE and FALSE when sample size was ≥64, although we expect that this is a lower limit 379 for real-world applications. Hereafter we use 200 occurrences for model training. 380 Spatial scale: Prevalence, extent, and spatial resolution 381

Experiment 3: Prevalence 382
Here we explore the effects of prevalence (mean probability of occurrence; i.e., range size) on 383 assessments of variable importance. As in the previous scenarios we used a square landscape with 384 a TRUE variable with a linear gradient with the range [-1, 1] and a randomly distributed FALSE 385 variable also with the range [-1, 1]. The species responds to TRUE as per a logistic function 386 ( 1 ×( − 11 ) . 387 The generative function includes an offset parameter β11 which shifts the range across the 388 landscape, thereby altering prevalence (Fig. 4). This allows us to alter prevalence while not 389 changing study region extent, which is almost impossible in real-world situations. We chose values 390 of β11 that altered prevalence in 9 steps across a range spanning ~0.05 to 0.95. The strength of the 391 species' response to TRUE, β1, was set equal to 2 as in Experiments 1 and 2.  GAMs and MAX did not notably decline at large extents. We believe this is due to GAMs and MAX 513 actually underfitting the data, wherein areas of very low probability of occurrence are predicted to 514 be disproportionately higher in suitability, thereby mistakenly predicting improbable presences 515 belong to the set of more probable presences. 516 Like OMNI, the SDM algorithms were also unable to discriminate between TRUE and FALSE at small 517 extents. Discriminatory capacity was unreliable when the range of TRUE was ≤0.5 (GAMs) or ≤1 518 (MAX and BRTs; Fig. 7). GAMs and BRTs often did not converge at small extents. 519 In summary, small spatial extents that span shallow environmental gradients (Fig. 6) engender little 520 variation the probability of presence. As a result, models have great difficulty differentiating TRUE 521 from FALSE variables. 522

Experiment 5: Spatial resolution and spatial autocorrelation of environmental data 523
In this experiment we explore the effects of spatial autocorrelation and spatial resolution (grain 524 size) of environmental variables on estimates of predictor importance. As before, we use a square 525 landscape with a TRUE variable with a linear gradient and a FALSE variable with random values. 526 We assumed the species responded to the environment at a scale represented by dividing the 527 landscape into 1024 cells on a side (the "native" resolution). As before, we constructed the TRUE 528 variable so it had a linear gradient spanning the landscape. When unperturbed, this linear gradient 529 has a high degree of spatial autocorrelation because cells close to one another have similar values. 530 Thus, we manipulated spatial autocorrelation by randomly swapping a set proportion of cell values 531 (no cells, or one third, two thirds, or all of the cells). Swapping values diminished spatial 532 autocorrelation in TRUE because cells with dissimilar values were more likely to be close to one 533 another (Fig. RESOLUTION). (By design, FALSE had a level of spatial autocorrelation no different 534 from random). 535 To explore the effects of spatial resolution, we applied bilinear interpolation to resample cells to a 536 finer resolution so that the landscape had 16,384 cells on a side or to a coarser resolution with 64 537 cells on a side. Resampling to a finer resolution caused spatial autocorrelation in TRUE and FALSE 538 to increase slightly because smaller cells on the edge of large cells were spatially averaged with fine 539 cells on the edge of adjacent large cells (Fig. 8). Resampling to a coarser resolution also caused 540 spatial autocorrelation in TRUE and FALSE to increase because aggregation of smaller cells into 541 larger ones "smooths" over values of the smaller cells. This effect was especially pronounced when 542 a large proportion of coarse cells had been swapped (Fig. 8). 543 The species' probability of presence was generated using the landscape at its native resolution 544 (possibly with cells swapped). Training and test sites were then selected using the landscape at the 545 native resolution but assigned cell values based on the landscape at the (possibly) resampled 546 resolution. This recreates a realistic situation where the occurrences representing a species' 547 response to the environment are indicative of the scale of the true response but environmental data 548 used to predict the response are available at a (potentially) different resolution. We created 12 549 scenarios representing all possible combinations of grain size (cell size of 1/64th, 1/1024th, and 550 1/16,384 th of the landscape's linear dimension) and spatial autocorrelation (swapping no cells, or 551 one third, two thirds, or all of the cells; Fig. 8). 552 553 Figure 8. Landscapes for testing the effects of grain size (spatial resolution) of data and spatial 554 autocorrelation in Experiment 5. The species was assumed to respond to the environment at a 555 "native" grain size of 1/1024 th of the linear dimension of the landscape. The environmental data 556 was resampled to a resolution of 1/16,384th (smaller cells) or 1/64th (large cells) of the 557 landscape's linear dimension. Spatial autocorrelation in the environment mediates the effect of 558 spatial resolution, so we varied autocorrelation for each variable by randomly swapping cell values 559 between none, 1/3, 2/3, and all of the cells. Increasing the proportion of cells swapped reduces 560 spatial autocorrelation in the TRUE variable because cells of similar value are less likely to be close 561 to one another. The left three columns represent the TRUE variable and right three columns the 562 FALSE variable. Each TRUE was paired with the FALSE with the same level of spatial 563 autocorrelation and grain (e.g., the TRUE in the upper left was paired with the FALSE to the 564 immediate right of it). Values range from -1 (brown) to 1 (blue). 565 In this experiment the "true" importance of TRUE and FALSE is indicated by results from OMNI at 578 the "native" resolution of 1/1024 th (middle column in any panel in Fig. 9). Results for OMNI for the 579 other resolutions represent the outcome that would be obtainable if a modeler had perfect 580 knowledge of the species' response to the environment but only had data available at finer or 581 coarser resolutions. At the native resolution results for OMNI and SDMs were comparable, although 582 BRTs tended to have lower accuracy and precision compared to GAMs and MAX (Fig. 9c). 583 Regardless of the level of spatial autocorrelation, downscaling to the finer resolution of 1/16,384 th 584 (left column in each panel of Fig. 9) yielded results very similar to results at the native resolution. 585 Similarly, upscaling to 1/64 th resolution when spatial autocorrelation was highest (proportion of 586 cells swapped = 0; upper right of any panel in Fig. RESOLUTION_6) was very similar to modeling at 587 the native resolution. However, reducing the level of spatial autocorrelation (going from upper 588 right to lower right in any panel in Fig. RESOLUTION_6) decreased performance of OMNI controls. 589 When spatial autocorrelation was lowest, OMNI control performance was on average no different 590 from random. Permuting either TRUE or FALSE with OMNI reduced CBI to ~0 in both cases, 591 meaning that OMNI could not reliably discriminate between the variables in this situation. 592 Likewise, none of the SDMs dependably discriminated TRUE from FALSE when spatial resolution 593 was coarser than the scale of the perception of the environment and spatial autocorrelation was 594 low (proportion swapped 2/3 or 1). Tellingly, very few GAM and BRT models converged under 595 these circumstances. 596 In summary, we found downscaling did not affect the ability of SDMs to measure variable 597 importance, and this was not affected by the level of spatial autocorrelation in the environment. 598 Splitting cells into smaller units has negligible effect on cells values, so we did not expect a great 599 difference between results at the native and downscaled resolutions. Similarly, upscaling when 600 spatial autocorrelation was high also did not challenge models. In this situation, the species

Experiment 6: Collinearity between environmental variables 613
In this experiment we explore the effects of model misspecification when a FALSE predictor is 614 collinear (correlated) with a TRUE predictor. As before, the species has a logistic response (Eq. 1) to 615 TRUE, which has a linear gradient across the landscape. FALSE also has a linear trend which is 616 rotated relative to the gradient in TRUE to affect the correlation between the variables. To ensure 617 no change in the univariate frequencies of TRUE and FALSE with rotation we used a circular 618 landscape (Fig. 10). The two variables are uncorrelated when their relative rotation is 90° (Fig.  619 TRUEvsFALSEc), and positive or negative if less than or more than 90°, respectively (Fig. 10) Fig. 11b and c). 672 Overall GAMs were more robust to collinearity 673 and model misspecification than MAX which 674 was in turn slightly more robust than BRTs. 675 There was no strong trend in differences between positive and negative correlation (cf. Mela  ). 687 Niche width in T1 and T2 is determined by σ1 and σ2. Importantly, decreasing σi increases the 688 degree to which a variable restricts distribution, meaning that σi and variable importance are 689 inversely related. Changing σ1 and σ2 thus also alters the distribution of the species (Fig. 12b). We 690 explored all possible combinations of values of 0.1, 0.3, and 0.5 for σ1 and σ2. Altering niche width 691 has the unfortunate side effect of changing prevalence. Based on results from Experiment 3, we do 692 not expect to see substantial biases until prevalence is ≥~0.5 (Fig. 5), but in none of the current 693 scenarios did prevalence fall above 0.570 (which occurs when niche width is widest, at σ1 = σ2 = 694 0.5). We used a circular landscape and allowed T1 and T2 to vary in a linear fashion across the connote broader niche width and thus less restriction by the respective variable. As a result, 703 reducing σi causes the variable to restrict distribution more (i.e., the variable is more important). In 704 Experiment 7 we varied niche width while using a 90° rotation between T1 and T2 so they were 705 uncorrelated (panel b). In Experiment 8 we varied the correlation between T1 and T2 from -0.71 to 706 0 to 0.72 by rotating T2 relative to T1 on the landscape (panels a and c). The range of values for the 707 species' maps is [0, 1], with darker/greener shades indicating greater probabilities of occurrence. 708 There is no interaction between T1 and T2 in Experiments 7 through 9 (ρ = 0). 709 710 Figure 13. In Experiments 7 through 10, GAMs underestimate importance when variables act 711 equally to shape the niche and niche width is narrow or intermediate. When variables act 712 unequally, GAMs overestimate the importance of the more influential variable (narrow niche width, 713 low σi), and underestimate the importance of the less influential variable (broad niche width, high 714 σi). Aside from these considerations, neither correlation between predictors (r) nor niche 715 covariance (ρ) have strong effects on estimates of importance. Each panel corresponds to a 716 particular landscape (correlation between predictors) and niche topology (niche covariance 717 structure). Within a panel, subplots contain bar plots of control and permuted predictions from 718 OMNI and GAMs. Smaller niche width corresponds to more restriction by the respective variable 719 (and thus more importance) so should yield lower CBI when the variable is permuted. In each panel 720 a subplot for a given value of σ1 and σ2 can be interpreted as explained in Box 1. Asterisks in the 721 lower left of a subplot indicate both OMNI and GAMs successfully discriminate between T1 and T2. 722 overestimate the importance of the more influential variable (narrow niche width, low σi), and 726 underestimate the importance of the less influential variable (broad niche width, high σi). Aside 727 from these considerations, neither correlation between predictors (r) nor niche covariance (ρ) have 728 strong effects on estimates of importance. Each panel corresponds to a particular landscape 729 (correlation between predictors) and niche topology (niche covariance structure). Within a panel, 730 subplots contain bar plots of control and permuted predictions from OMNI and MAX. Smaller niche 731 width corresponds to more restriction by the respective variable (and thus more importance) so 732 should yield lower CBI when the variable is permuted. In each panel a subplot for a given value of σ1 733 and σ2 can be interpreted as explained in Box 1. Asterisks in the lower left of a subplot indicate both 734 OMNI and MAX successfully discriminate between T1 and T2. 735 736 Figure 15. In Experiments 7 through 10, BRTs underestimate importance when variables act 737 equally to shape the niche and niche width is narrow or intermediate. When variables act 738 unequally, BRTs overestimate the importance of the more influential variable (narrow niche width, 739 low σi), and underestimate the importance of the less influential variable (broad niche width, high 740 σi). Aside from these considerations, neither correlation between predictors (r) nor niche 741 covariance (ρ) have strong effects on estimates of importance. Each panel corresponds to a 742 particular landscape (correlation between predictors) and niche topology (niche covariance 743 structure). Within a panel, subplots contain bar plots of control and permuted predictions from 744 OMNI and BRTs. Smaller niche width corresponds to more restriction by the respective variable 745 (and thus more importance) so should yield lower CBI when the variable is permuted. In each panel 746 a subplot for a given value of σ1 and σ2 can be interpreted as explained in Box 1. Asterisks in the 747 lower left of a subplot indicate both OMNI and BRTs successfully discriminate between T1 and T2. 748 In the current experiment we examine scenarios in which there is no correlation (r = 0) between 749 predictors and variables interact additively to shape the niche (ρ = 0; middle panel in each of Figs. 750 13-15). To structure the discussion, we will first focus on cases where T1 and T2 have equal 751 influence on the niche (σ1 = σ2, displayed along the upper right-to-lower left diagonal of Figs. 13e, 752 14e, and 15e). Along this sequence niche breadth decreases from broad (σ1 = σ2 = 0.5) to moderate 753 (σ1 = σ2 = 0.3) to narrow (σ1 = σ2 = 0.1). OMNI control had high performance across all combinations 754 of niche width in T1 and T2 (median CBI ranging from 0.89 to 0.93). GAMs and MAX controls 755 performed comparably to OMNI controls across the range of niche breadths. BRTs were also 756 comparable to OMNI control for narrow and moderate niche breadths, but had low precision when 757 niche width was widest (σ1 = σ2 = 0.5). 758 Continuing our focus on cases with symmetrical niche breadth (σ1 = σ2; Figs. 13e, 14e, and 15e), we 759 recall that narrower breadth increases the restraint imposed by the variables. Thus, permuting a 760 variable for which niche breadth is narrow should reduce CBI more than when breadth is large. The 761 SDMs did not accurately indicate the importance of the variable when niche width was moderate or 762 narrow (σ = 0.1 or 0.3). In fact, reducing niche width often reduced the estimates of variable 763 importance. We can see this for any of the SDMs in Figs. 13e, 14e, or 15e by going along the 764 diagonal from the upper right to the lower left along which niche changes from broad to narrow. 765 Permuting T1 along this series reduces median CBI for the OMNI model from 0.68 to 0.63 to 0.46, 766 respectively. However, none of the SDMs displayed comparable trends: for GAMs permuted CBI was 767 0.52, 0.90, and 0.90, respectively; for MAX 0.68, 0.76, and 0.64; and for BRTs 0.62, 0.89, and 0.83.

768
(Permuting T2 yields similar discrepancies.) Compared to OMNI treatment models, the SDMs 769 always underestimated the importance of T1 and T2 when niche breadth was moderate or narrow. 770 Despite the failure to accurately estimate importance, MAX had the highest accuracy (median 771 values closest to the respective median values of OMNI) and precision (lowest variability). 772 Now, we turn attention to cases with asymmetrical influence by T1 and T2 (σ1 ≠ σ2). Again, OMNI 773 control had reliable performance in these cases, and permuting T1 or T2 caused CBI to decrease 774 monotonically with decreasing niche breadth as the respective variable became narrower. All three 775 SDMs successfully discriminated between T1 and T2 (i.e., no overlap between distributions of 776 permuted CBI for T1 and T2). However, all three SDMs tended to overestimate the importance of 777 the more restrictive variable while underestimating the importance of the less restrictive variable. 778 Again, MAX had greater relative accuracy and precision than GAMs and BRTs. Estimates of 779 importance tended to be more uncertain for the more influential variable compared to the less 780 influential variable. 781 In summary, SDMs successfully discriminated between two variables of unequal influence. bivariate Gaussian with additive effects of T1 and T2 (Eq. 3), so we will constrain our discussion to 798 panels in the middle row of panels in Figs. 13 through 15. 799 The results for OMNI and the SDMs were generally qualitatively the same as Experiment 7 in which 800 variables were uncorrelated on the landscape. As before, when niche breadth was equal and either 801 moderate or narrow, the SDMs underestimated importance of the two variables. When niche 802 breadth was equal and broad, the SDMs accurately estimated variable importance, albeit with low 803 precision. When niche width was unequal, the SDMs reliably discriminated between the variables, 804 but overestimated the importance of the more influential variable and underestimated the 805 importance of the less influential variable. Again, MAX had greater accuracy and precision than 806 GAMs and BRTs. In summary, collinearity between variables had little qualitative effect on the 807 SDMs' discriminatory capacity and did not worsen calibration capacity. 808

Experiment 9: Two uncorrelated, interacting variables 809
Here we modify the bivariate niche described in Eq. 3 by considering interactions between 810 variables in how they determine environmental preferences (probability of occurrence), a 811 phenomenon we call "niche covariance." The modified model is bivariate Gaussian and includes a 812 term ρ specifying the sign and strength of niche covariance between the variables: 813 Again, σ1 and σ2 represent the degree of limitation imposed by T1 and T2, the effect of which is 815 mediated by the magnitude of ρ. 816 In this experiment we restrict our investigation to cases when there is no correlation between the 817 variables on the landscape (r = 0). Varying the covariance parameter ρ alters the orientation of the 818 species' distribution in niche space and on the landscape even when variables are uncorrelated. 819 Negative covariance (ρ <0; Fig. 16a) causes an increase in one variable and a decrease in the other 820 to increase the probability of presence. Positive covariance (ρ>0; Fig. 16c) causes an increase in 821 both variables or a decrease in both variables to increase the probability of presence. Lack of niche 822 covariance (ρ = 0; Fig. 16b) causes variables to shape the niche only additively so favorable values 823 of one variable can offset unfavorable values of another. 824 825 Figure 16. Range maps for Experiments 9 and 10 in which two TRUE variables (T1 and T2) interact 826 to define the niche. In all cases shown here T1 and T2 were distributed orthogonally to one another 827 on the landscape so were uncorrelated even though they interact to shape the niche (cf. Fig.  828 TRUEvsFALSEa and c). Parameter ρ determines the interaction between T1 and T2 and was varied 829 from -0.5 to 0 to 0.5. The degree of limitation imposed by T1 and T2 are determined by σ1 and σ2 830 plus the magnitude and sign of ρ. Panel (b) is the same as Fig. 12b but repeated here to aid visual 831 comparison. The predicted probability of presence ranges from 0 (white) to 1 (dark green). 832 Results for this experiment are displayed across the middle panels of Figs. 13-15. OMNI and the 833 SDMs responded in a manner qualitatively similar to that in Experiments 7 and 8. Namely, SDMs 834 reliably discriminated between variables when they had unequal influence, but otherwise had low 835 calibration accuracy. 836

Experiment 10: Niche covariance with collinear variables 837
This experiment combines the scenarios in Experiments 7 through 9 varying niche breadth, 838 correlation between predictors, and niche covariance (corner panels of Figs. 13-15). Again, we 839 found that OMNI control and the SDM controls usually performed well and had reliable 840 discriminatory accuracy, but estimated variable importance poorly when niche breadth was equal. 841 When variables acted unequally the SDMs underestimated the importance of the weaker variable 842 and overestimated the stronger variable. 843

Results using other test statistics 844
Test statistics other than CBI can be used with permute-after-calibration test to measure variable 845 importance, including AUC calculated with presences and absences (AUCpa) or presences and 846 background sites (AUCbg), and the correlation between control and treatment predictions calculated 847 with presences and absences (CORpa) or with presences and background sites (CORbg importance of variables within the same scenario (i.e., same species in the same region and time 871 period). 872

Discussion 873
In this study we used a reductionist approach to test the ability of species distribution models and 874 ecological niche models to measure the influence of variables (results summarized in Table 1). In 875 general, situations that are challenging for generation of accurate models are also challenging for 876 generating well-calibrated measures of variable importance (i.e., estimates that have the same 877 central tendency and distribution as an "omniscient" model with the same form used to simulate 878 the species). Challenging conditions include cases with small sample size, high prevalence, low 879 spatial extent (low environmental variability), and using environmental data that is coarser than 880 the perceptual scale of the species when fine-scale variation in the environment is high. We also 881 found that the permute-after-calibration test (Breiman 2001) for SDMs cannot produce well-882 calibrated estimates of importance when variables act simultaneously to shape distribution. 883 However, tests were often able to discriminate when variables acted unequally even when the tests 884 had low calibration accuracy. In summary, the ability to measure variable importance reliably was 885 affected by aspects related to 1) study design, 2) spatial scale, 3) landscape, and 4) the niche. We 886 structure the discussion of results around these four themes. 887

Study design and variable importance 888
Difference between algorithms.-The primary distinction between SDM and ENM algorithms is the 889 manner in which they attempt to delineate the multi-dimensional environmental space inhabited 890 by a species (Guisan & Zimmermann 2000). Algorithms differ in their ability to fit smooth (GAMs) 891 versus "step-like" (BRTs) functions or a mixture thereof (MAX). Algorithms also differ in their 892 ability to capture interactions between variables (MAX: two-way, BRTs: potentially highly complex, . As a result, the manner in which an algorithm estimates the niche will affect its capacity 901 to measure variable importance. 902 We found that in simple scenarios with one influential and one uninfluential variable (Experiments 903 1 through 6) GAMs often had better discrimination and somewhat calibration than MAX or BRTs, 904 even though they sometimes did not converge under challenging situations (e.g. low sample size 905 Fig. 3a and b). In scenarios with two influential variables (Experiments 7 through 10), the three 906 SDMs had similar discrimination accuracy but MAX had somewhat better calibration capacity 907 (mean values closer to mean values of the OMNI model and lower variability around the central 908 tendency). BRTs had the poorest discrimination and calibration accuracy across all scenarios 909 despite the use of standard parameterization procedures to generate well-calibrated models 910 (Appendix 5). Troublingly, some algorithms (especially BRTs) invariably used information from the 911 uninfluential variable. This is particularly concerning because in many cases these same models 912 (when using unpermuted variables) often yielded values of CBI that were still within the range 913 indicative of "reliable" performance (i.e., were still seemingly accurate functions under the simplest situation (Fig. 2). However, small sample size (Fig. 3) or insufficient 929 differentiation between occurrences and background sites (Fig. 6) quickly eroded models' ability to 930 estimate the real functional response. In some cases, even the omniscient model was not able to 931 effectively differentiate between influential and uninfluential variables (e.g., high prevalence in Fig.  932 5). The omniscient model does not use calibration data, so in this case the failure of the model is 933 due solely to lack of differentiation between presence and background sites (or presence and 934 absence sites) used to evaluate the model. SDMs are even more challenged than the omniscient 935 model since they must estimate the response to the environment from potentially inadequate data 936 then assess importance using similarly inadequate data. 937 Obviously, more research needs to be conducted on how well different SDM algorithms assess 938 variable importance. Additionally, a small but growing number of methods that do not rely on SDMs 939 can be used for assessing the effects of variables on distributions (Greenberg et  In turn, using cells with coarse resolution generally increases prevalence because a species needs 968 only occur in a portion of a cell to be scored as "present" in the entire cell ( is extremely difficult to disentangle how scale affects the ability of models to measure variable 978 importance. 979 We used simulations to assess the independent effects of prevalence, study region extent, spatial 980 resolution of predictors, and spatial autocorrelation in environmental variables in a manner that 981 isolates the effects of each. We found that models were most reliable when prevalence was <0.5 982 (Fig. 5) and when study region extent was "intermediate" or "large" relative to the size of the 983 geographic gradient across which the species responded to its environment (Fig. 7). We also found 984 that models when trained on data that had a spatial resolution different from the perceptual scale 985 of the species, except when the spatial resolution of predictors was coarse and spatial 986 autocorrelation in the environment was low (Fig. 9). 987 As noted, nearly all of these aspects of scale are under direct control of the modeler or at least 988 directly affected by decisions made by the modeler, but nearly all are tightly interdependent. Thus, 989 our recommendations need to be interpreted with regards to how decisions over one aspect of 990 scale affect the others. We recommend using study regions that encompass the entire species' range 991 plus enough extra area to ensure the species resides in less than half the region. Current best 992 practices in distribution modeling recommend that the region from which background/absence 993 sites are drawn should only represent the area that is "accessible" to the species through dispersal accessible area may or may not be large enough to ensure the species occupies less than half the 996 area. Thus, modelers may be faced with a difficult decision. We expect that the "accessible area" 997 rule will be more applicable when the goal of a study is to delineate a niche and/or predict its 998 potential future geographic distribution. But we expect that our "less than half" rule will be more 999 applicable when the goal of a study is to measure variable importance. We also recommend 1000 delineating study regions that are large enough to cover broad environmental gradients or at least 1001 selecting predictors that display sufficient variation across the study region. Smaller extents and 1002 predictors with low variability across samples do not offer sufficient variation to reliably 1003 differentiate occurrences from background/absences and therefore confound attempts to measure 1004 variable importance (Fig. 7). Finally, we recommend using environmental data at a resolution at the scale of perception of other variables. Likewise, we did not explore how niche width interacts with 1012