Top ten hazards to avoid when modeling species distributions: a didactic guide of assumptions, problems, and recommendations

Species distribution models, also known as ecological niche models or habitat suitability models, have become commonplace for addressing fundamental and applied biodiversity questions. Although the field has progressed rapidly regarding theory and implementation, key assumptions are still frequently violated and recommendations inadvertently overlooked. This leads to poor models being published and used in real-world applications. In a structured, didactic treatment, we summarize what in our view constitute the ten most problematic issues, or hazards, negatively affecting implementation of correlative approaches to species distribution modeling (specifically those that model suitability by comparing the environments of a species’ occurrence records with those of a background or pseudoabsence sample). For each hazard, we state relevant assumptions, detail problems that arise when violating them, and convey straightforward existing recommendations. We also discuss five major outstanding questions of active current research. We hope this contribution will promote more rigorous implementation of these valuable models and stimulate further advancements.

(b) These data are used as input for any of the several available algorithms, which characterize suitability as a function of environmental variables.The model distinguishes the environmental conditions more frequently associated with presence of the species versus those from a comparison dataset (either absences or more commonly background or pseudoabsence information across the study region).(c) The model characterizes environmental suitability for the species in environmental space (warmer colors indicating increasingly suitable conditions).(d) Typically, the model is then applied to geographic space, indicating spatial patterns of suitability across the study region.Finally, the model is evaluated quantitatively, typically by assessing how well it predicts an evaluation dataset (arrow back to (b)).
16000587, 2024, 4, Downloaded from https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.06852 by The City College of New York, Wiley Online Library on [02/04/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License are frequently applied to issues of importance to society, from agriculture and public health to biodiversity management and conservation, including the cross-cutting effects of climate change (Guisan et al. 2014, Willis et al. 2015, Johnson et al. 2019).However, key principles and procedures frequently remain underappreciated or misapplied (Morales et al. 2017, Araújo et al. 2019, Andrade et al. 2020).This can lead to unreliable models and erroneous interpretations being published or used as inputs for additional downstream analyses, which increasingly include valuable studies that address applied real-world problems (Brown et al. 2016, Fordham et al. 2018, Briscoe et al. 2019, Reid et al. 2019, Zurell et al. 2020a, Tuia et al. 2022).
This situation stems from several interrelated factors.First, SDMs have seen explosive use and development since around the year 2000, creating an enormous literature (Araújo et al. 2019, Feng et al. 2019a).Second, semi-automated software makes analyses quick and easy to implement, which can lure users into building models without considering the underlying principles carefully (Joppa et al. 2013, Merow et al. 2014, Escobar and Craft 2016, Sillero and Barbosa 2021).Especially in the era of 'big data,' this can be exacerbated in studies that model hundreds of species via workflows that employ automated code-based pipelines, which is often needed to achieve broad assessments of biodiversity (Brown et al. 2015, Gomes et al. 2019, Merow et al. 2022).Finally, SDMs commonly form part of interdisciplinary studies (e.g.combining demographic, molecular, or epidemiological approaches; Perktaş et al. 2017, Guevara et al. 2018a, Bonfim et al. 2019, González-Serna et al. 2019), where the research team may not include an SDM expert.As a consequence of these factors, researchers often struggle to follow recommendations from the massive, disparate literature that has accumulated for individual aspects of modeling.Indeed, a quantitative analysis of randomly selected SDM studies with an applied biodiversity focus published between 1995 and 2015 found that 46% of them were deficient regarding at least one key issue (Araújo et al. 2019).Perhaps most disturbingly, the evaluation of model assumptions saw a decrease in quality over that time period, suggesting that enormous growth in the use of these techniques unfortunately had been accompanied by lessened consideration of the field's conceptual foundations.
To help rectify this situation, here we highlight what in our view constitute the ten most problematic hazards negatively affecting SDM implementation -each with existing solutions.To varying degrees, these issues have been addressed elsewhere, although often independently (Araújo and Peterson 2012, Jiménez-Valverde et al. 2013, Cooper and Soberón 2018, Yates et al. 2018, Cobos et al. 2019a, Qiao et al. 2019, Warren et al. 2020).Fortunately, some synthetic books (Franklin 2010, Peterson et al. 2011, Guisan et al. 2017), shorter reviews (Elith et al. 2011, Anderson 2012, Merow et al. 2014, Jarnevich et al. 2015, Beery et al. 2021, Sillero et al. 2021, Franklin 2023), and proposed standards for modeling and reporting exist (Araújo et al. 2019, Feng et al. 2019a, Sofaer et al. 2019, Zurell et al. 2020b, Fitzpatrick et al. 2021).Nevertheless, the book-level syntheses are long and include mathematical and statistical formalizations, making it difficult for some readers to understand key information and deterring others from attempting.On the other hand, the shorter reviews each only consider a few key issues and sometimes are highly technical, requiring several papers to cover key topics and again limiting the audience.Finally, although comprehensive in breadth, the proposed standards do not go into great depth of explanation, emphasizing key principles and recommendations more than guiding the reader's understanding regarding problems that can occur if they are not followed.Complementary to these resources, here we present a concise, structured treatment of ten hazardous, frequently misunderstood issues (and ways that they interact), avoiding mathematical formulations to produce a didactic guide for readers with a wide range of exposure to these techniques.Additionally, we include a glossary of some important terms (Table 1), which appear in italic at first mention.We hope that this contribution will help a broad set of researchers reach a deeper understanding of these issues and prove useful to them as authors, reviewers, and editors.We also provide ample citations, constituting a rich resource for readers who want to delve deeper into the literature.
In this guide, we consider the most common implementation of correlative SDMs, where absence data do not exist and the researcher aims to use records of the species' presence (presence-only data) to model the environmental conditions (and geographic areas) suitable for it.Specifically, we focus on models that compare environmental conditions for records of the species' presence versus those of the background (the study region or a sample of it) or pseudoabsences (places within the study region that lack records of the species; Anderson 2012, Fig. 1).All the issues we cover hold true regardless of modeling algorithm; additionally, many are also applicable to implementations that exclusively use presence data (Booth et al. 2014), incorporate information regarding absences (Guisan et al. 2017), or model more than one species simultaneously (Pollock et al. 2014, Poggiato et al. 2021).Also, the issues we present (or an equivalent analog) are relevant to integrated modeling approaches that combine distinct data types (e.g.presence-only, presence-absence and abundance information; Fletcher et al. 2019, Miller et al. 2019, Isaac et al. 2020, Kays et al. 2022a).
We order the ten key hazards according to the step in which they arise (data gathering and processing, model building, model evaluation, and model interpretation; i.e. not according to importance or frequency in the literature; Table 2, Fig. 2).For each hazard, we state relevant assumptions, detail problems that arise when violating them, and summarize existing recommendations.In most entries, we explain one or more key interactions with other specific hazards, using the syntax 'interaction with Hazard X'; in contrast, we denote simple references to another hazard by the wording 'Hazard Y'.While we strive for general explanations, we also provide some examples for particular algorithms (especially the commonly used maximum entropy approach MaxEnt; Phillips et al. 2006).Given the increasing number of studies modeling many species individually via automated pipelines (Brown et al. 2015, Gomes et al. 2019), for each hazard we note the feasibility and challenges of implementing the recommendations in such approaches.In addition to the ten key hazards, we also discuss a few other important topics that researchers should be aware of, yet for which solutions and recommendations remain challenging and require additional research.

Overlooking errors in occurrence datasets
Online biodiversity portals (e.g.GBIF; eBird; Robertson et al. 2014) often aggregate information from many data providers, affording invaluable access to species occurrence data.However, such data sources are often misused by researchers who employ records of insufficient quality (Newbold 2010, Maldonado et al. 2015).
Assumptions.Input data are free of substantial errors regarding taxonomic identification and georeference (e.g.latitude and longitude coordinates), with spatial and temporal uncertainty smaller than the resolution of the environmental predictors (Lozier et al. 2009, Aubry et al. 2017, Araújo et al. 2019).
Problem.Despite increasing attention to data quality and uncertainty in data portals (Anderson et al. 2020), errors of taxonomic identification and georeferencing remain Error that represents the failure of a model to predict an absence of the species once a binary prediction of suitability is obtained (by applying a threshold above which a grid cell is considered suitable or predicted presence).In studies lacking true absence data, estimates of commission error are typically inflated.

evaluation
Process by which the model resulting from training is assessed.Ideally, the evaluation would be performed using fully independent testing data.However, because such information seldom exists, in most instances evaluation is conducted via a semi-independent validation dataset based on a partition not used in training (Hastie et al. 2009, Wenger and Olden 2012, Zurell et al. 2020b).We use evaluation as a blanket term that applies to either validation (via a partition of the same dataset used for training) or testing (based on independent data).

noise assumptions
The premise that within the study region, factors related to dispersal, establishment, and persistence of populations (e.g.dispersal barriers or small patch size); biotic interactions; and human modifications of the environment do not cause the species to occupy an environmentally biased subset of the areas abiotically suitable for it.Under these assumptions, even if the species does not inhabit all suitable areas in the study region (i.e. is not at spatial equilibrium), it occurs in environmental equilibrium, with such factors only adding statistical noise (Anderson 2013).

non-analog
Environments in which the conditions are different from those in which the model was trained.It usually refers to environmental values beyond the minimum and maximum present in the training data but can also refer to novel environmental combinations (Mesgaran et al. 2014, Guevara et al. 2018b).omission (Peterson et al. 2011) (= false negative rate; = 1 -sensitivity; see also commission) Error that represents the failure of a model to predict a presence of the species once a binary prediction of suitability is obtained (by applying a threshold above which a grid cell is considered suitable or predicted presence).

overfit
A model that is fit too tightly to the training data.This reduces the generality and utility of the model, which will have high performance in predicting the sample with which it was trained but perform poorly on other datasets (e.g.those withheld for validation, deriving from additional sampling, or corresponding to transfer of the model to other areas or time periods; see transfer).sampling bias When known occurrences of the species reflect vagaries associated with patterns of biological sampling (e.g.greater efforts near roads or biodiverse regions).This sampling bias in geographic space usually translates into a bias in environmental space.sensitivity The ability of a model to predict a presence of the species (see omission).specificity The ability of a model to predict an absence of the species (see

commission). study region
The geographic region in which a species distribution model is trained, and from where the sample of presences and any environmental background (or pseudoabsence) information is obtained.training (= fitting; referred to as calibration in some literature (Peterson et al. 2011) The process by which a particular model is built (Hastie et al. 2009, Zurell et al. 2020b).Often this includes automated iterations with subsets of the data, when plausible environmental associations are explored and evaluated by the algorithm, aiming to achieve an optimal solution (e.g.maximizing likelihood in regression, or minimizing relative entropy in MaxEnt; Merow et al. 2013, Phillips et al. 2017).transfer Use of a model in a place or time period different from that in which it was trained.If non-analog environments are encountered (different from those in which the model was trained), extrapolation of the model is required (e.g.following response curves beyond the point of truncation; Guevara et al. 2018b).tuning Approximating optimal parameterization and model complexity for a given dataset.This is achieved by building multiple models differing in the underlying parameters (or settings that constrain their estimation from the data) and choosing the best one/s according to specific evaluation criteria (Elith et al. 2011, Merow et al. 2013).
common, and high spatial uncertainty may always persist for many older records (Newbold 2010, Maldonado et al. 2015, Serra-Diaz et al. 2017).Depending on the type and magnitude of these errors, spatial resolution of the environmental predictors, and heterogeneity of the landscape, substantial distortions and inflations of inferred environmental associations and suitable areas may occur (Romero et al. 2014, Costa et al. 2015, Aubry et al. 2017, Gábor et al. 2020a).Unfortunately, species identifications may be incorrect, and aggregators seldom capture updates from recent taxonomic publications (Anderson et al. 2020).Additionally, the vast majority of records lack information regarding the uncertainty of identification and georeference (and unit conversions to decimal degrees often falsely implies high precision and accuracy).
Recommendations.Models should be built using occurrence data with correct taxonomic identifications and georeferences whose uncertainties do not affect the results greatly (e.g.only increase noise minimally and do not bias the model Similarly, some consortia of data providers have determined coordinate uncertainty, which percolates up to aggregators, so that researchers can use only those accurate enough for the aims of a given study (e.g.MaNIS; Stein and Wieczorek 2004).Focused endeavors to standardize and clean data for given taxonomic groups and regions also provide much better information than otherwise available (e.g.BIEN and BioModelos;Maitner et al. 2018, Velásquez-Tibatá et al. 2019).When these situations do not exist, automated methods for data cleaning can catch many but certainly not all errors (e.g.identifying unit conversion errors; spatial and environmental outliers; García-Roselló et al. 2014, Naimi et al. 2014, Robertson et al. 2016, Zizka et al. 2019).In cases where the uncertainties of records are not known, researchers can conduct qualitative or quantitative sensitivity analyses and discuss the likely effects on model output (Gábor et al. 2020a).Additionally, expert knowledge remains critical in interpreting the outputs of both the data-cleaning algorithms and resulting models (e.g.'outliers' may represent rarely sampled sites and provide valuable information; Jiménez-Valverde et al. 2011, Soley-Guardia et al. 2014).All of these investments in data cleaning improve the quality of models, perhaps especially those generated via automated analyses for projects modeling many species (Velásquez-Tibatá et al. 2019).

Disregarding biases inherent to biological sampling
Aggregated biodiversity data combine records that generally were collected by opportunistic rather than stratified or random sampling (Beck et al. 2014, Isaac and Pocock 2015, Daru et al. 2018).
Assumptions.Sampling effort has been homogeneous across the study region, so that no bias exists in the conditions inferred to represent the species' environmental associations (Castellanos et al. 2019, Vollering et al. 2019, Taylor et al. 2020).
Problem.Opportunistic sampling tends to be geographically biased towards accessible areas or habitats of particular interest (Daru et al. 2018, Tarli et al. 2018).Such geographic bias typically translates into environmental bias as well, with some conditions inhabited by the species being artificially overrepresented (Yackulic et al. 2013, Monsarrat et al. 2019).If not corrected for, certain environmental combinations will erroneously be identified as indicating higher suitability for the species, yielding a biased model (Syfert et al. 2013, Ranc et al. 2017, Vollering et al. 2019).Importantly, inappropriate (yet commonly applied) evaluation approaches fail to detect this bias, which pervades both training and validation datasets (interaction with Hazard 8; Fig. 2).
Recommendations.Completely eliminating the false signal resulting from sampling bias remains challenging; however, several approaches may reduce its effects considerably (Franklin 2023).One theoretically sound approach corrects for sampling bias by contrasting the focal species' occurrences against a sample of the environment characterized by the same bias, via occurrence records of other species detected with the same techniques (i.e.'target-group background'; Anderson 2003, Phillips et al. 2009, Merow et al. 2016).This approach can be modified by concentrating background sampling around occurrences of the target-group species in proportion to their density ('weighted locality approach'; Anderson 2003; sometimes called 'background thickening'; Ranc et al. 2017, Støa et al. 2018, Vollering et al. 2019, Barber et al. 2022).Alternatively, a practical approach that can improve performance is to apply a spatial or environmental filter that reduces the number of clustered occurrences ('thinning'), likely reducing the effects of sampling bias (Boria et al. 2014, Varela et al. 2014).Nevertheless, choosing an appropriate filtering distance that retains the niche signal while removing the bias remains challenging (Fourcade et al. 2014, Aiello-Lammens et al. 2015, Castellanos et al. 2019, Gábor et al. 2020b) but can be tuned empirically (Soley-Guardia et al. 2019).For projects implementing automated modeling of many species, either using target-group background correction (with experts defining suitable groupings based on relevant sampling techniques) or filtering approaches should be feasible.

Using spatially or temporally inconsistent proxy environmental variables
Predictor can provide strong predictive ability within a limited study region, their correlation with driving factors usually varies in other places or time periods (Dormann et al. 2013).
Assumptions.The variables (at a given spatial and temporal resolution; Hazard 1) incorporated into a model affect the species' distribution directly or are highly and consistently correlated with those that do (Anderson 2013, Lira-Noriega et al. 2013).
Problem.If this is not the case, predictions can be inaccurate -especially when occurrence datasets are small, study regions are overly large (interaction with Hazard 6; Fig. 2; because correlations between proxy and driving variables are less likely to be consistent across space), or models undergo transfer across space or time for the needs of the study.Elevation constitutes a classic example of a variable that frequently violates this assumption because its correlation with the presumed driving variable (temperature), differs across latitudes and over time periods (Austin 2002, Elith andLeathwick 2009).However, commonly used climatic variables also can suffer changes in correlation with missing driving factors (e.g.soil type, groundwater availability; Dormann et al. 2013, Mesgaran et al. 2014, Soley-Guardia et al. 2014, Feng et al. 2019b ).For projects modeling many species using automated pipelines, researchers can select environmental variables in bunches (e.g. for groups of species with similar natural histories) rather than using the same set for all of them, although that still may not achieve models as good as those possible with species-specific decisions.
Table 2. Overview of the ten most problematic hazards in species distribution modeling covered in this paper.The columns indicate: the corresponding step of modeling; the name of the hazard; the nature of the hazard; its relevance for predictions (whether it affects the model in the training region, under transferal, or both); interactions with other hazards (an asterisk indicates those where the interaction is explained in the main text only in the entry for the other hazard involved); and three selected references.All hazards correspond to important concepts in the field, but for their nature we note 'conceptual' for those with especially close ties to ecological theory.While Hazards 3 and 9 apply primarily to model transfer, they may also affect predictions within the time and space in which the model was trained (see main text).
Step  Kass et al. 2018Kass et al. , 2023)), have made this practical via automation, at least for certain algorithms.Despite increasing computational time, their ease of automation makes tuning exercises a feasible element for projects modeling many species (Valavi et al. 2022); nevertheless, as for all modeling efforts, expert inspection of the response curves and geographic prediction of the selected model to assess ecological realism remains beneficial and wise (Hazard 9; Guevara et al. 2018b, Velásquez-Tibatá et al. 2019).

Reducing predictor variables irrespective of their information content
With the goal of minimizing variable correlations and/or overfitting, researchers often reduce the number of predictors prior to model building irrespective of their information content (i.e.their predictive power or ability to inform regarding a species' distribution), for example via pre-determined cutoffs in correlation analyses (Breiman 2001, Feng et al. 2019b, Sillero et al. 2021).This approach stems from otherwise sound statistical practices in regression modeling, often also aimed at identifying the contributions of particular variables and obtaining simple, explanatory models that facilitate interpretation and hypothesis-testing (Dormann et al. 2013, Farrell et al. 2019, Feng et al. 2019b).
Assumptions.Modeling with fewer or less correlated variables yields simpler and better models (Breiman 2001, Elith et al. 2008, Evans et al. 2013) that are not fitted to nuances in the data.
Problem.Removing predictor variables a priori without consideration of their information content may arbitrarily discard informative data (reducing the predictive ability of the model) and does not directly address the problem of overfitting (Breiman 2001, Olden et al. 2008, Braunisch et al. 2013).
Recommendations.After identifying candidate variables based on their biological relevance (Hazard 3), approaches that consider information content to limit complexity (instead of via correlations a priori) can reduce overfitting and lead to improved models (Cobos et al. 2019a, Farrell et al. 2019).Many algorithms have options for controlling the degree of fitting to the sample by penalizing complex models (e.g.those with higher numbers of variables and more complex responses to them).Approaches for limiting complexity include regularization procedures like lasso and ridge regression in GLMs and MaxEnt and pruning in regression trees (Hastie et al. 2009, Dicko et al. 2014, Guisan et al. 2017, Valavi et al. 2022).Values of the penalties for higher complexity (e.g. the regularization multiplier in MaxEnt; interaction with Hazard 4; Fig. 2) can be tuned using evaluation procedures that detect overfitting (Hazard 8).Increasing the penalization for complexity tends to lead to zero contribution for variables with low or no information content (by themselves or in combination with correlated variables that are more informative), in effect excluding them from the final model (interaction with Hazard 4; Fig. 2; Phillips et al. 2017, Farrell et al. 2019, Valavi et al. 2022).Because of this, algorithms that implement variable selection indirectly through penalties for complexity tend to yield models without highly correlated variables, which also simplifies interpretation of the response curves for those retained by the algorithm (Feng et  as long as appropriate evaluation procedures are employed (Hazard 8).

Using an overly large study region
Often, researchers use an overly large study region for model building and/or evaluation (Araújo andPeterson 2012, Liang et al. 2018).Sometimes such a region is selected with the misconception that it will yield an accurate depiction of the species' range (rather than the areas suitable for it) but probably most often because a map of suitability is desired in that full area.For example, researchers often seek to find suitable areas with sparse sampling, identify likely areas of spread for invasive species, or characterize potential overlap where related species may occur (uses that all require estimates of suitability).
Assumptions.Within the study region for model training, the species occurs in spatial (or at least environmental) equilibrium.In the first scenario, the species inhabits all suitable areas (i.e.spatial equilibrium; Peterson et al. 2011, Guisan et al. 2017).Because of this, its distribution also corresponds to environmental equilibrium: the places occupied by the species accurately reflect the environments suitable for it.In the second scenario, although the species does not occur in all suitable areas, it inhabits an environmentally representative (unbiased) subset of them (i.e.environmental equilibrium).Hence, it fulfills the noise assumptions (Anderson 2013), whereby factors related to dispersal, establishment, and persistence of populations; biotic interactions; and human actions do not cause the species to occupy an environmentally biased subset of the areas suitable for it, but rather only add statistical noise.
Problem.The chances of violating the spatial-equilibrium or environmental-equilibrium (noise) assumptions increase with the extent of the study region (Saupe et  Projects modeling many species likely will employ automated decision-making (e.g.buffering occurrence records by a given distance or using biogeographic regions).Although speciesspecific choices likely would approximate assumptions more closely and produce superior models, making these decisions by bunches (e.g. for species with similar natural histories and inhabiting regions with comparable levels of environmental heterogeneity) may prove more reasonable than a single choice for all species.

Misinterpreting metrics of model performance
A particularly widespread hazard in studies employing presence-background or presence-pseudoabsence data is evaluating performance using metrics designed for analyses with reliable absence information (e.g.AUC, Kappa, true skill statistic), under the misconception that they represent statistically unbiased, absolute measures of performance (Lobo et al. 2008, Leroy et al. 2018).Unfortunately, with great frequency researchers interpret these metrics in the same manner as when true absence data are used for their calculation (e.g.predictions with AUC values > 0.5 being better than random and those approaching 1.0 being a requisite for or indicative of a good model); or as comparable among predictions for different species or study regions (Jiménez-Valverde et al. 2013, Liang et al. 2018, Fernandes et al. 2019).
Assumptions.Error rates for omission (false negatives) and commission (false positives) are both accurately calculated (Lobo et al. 2008 specificity).For commission error and metrics that include it, comparisons across species or regions assume that the ratio of suitable to unsuitable environments remains constant (Jiménez-Valverde et al. 2013, Bohl et al. 2019).
Problem.When absence data are not available, commission errors estimated from a background or pseudoabsence sample suffer from (an unknown) statistical bias.Great numbers of map pixels typically exist within areas that are suitable for the species, but the vast majority of them are not documented by the limited sample of presence records (a problem exacerbated by overly large study regions, where factors other than environmental suitability preclude the species' presence; interaction with Hazard 6; Fig. 2).Because of this, the estimate of commission error is strongly inflated (Anderson et al. 2003, Araújo and Peterson 2012, Leroy et al. 2018, Saupe et al. 2018).Therefore, when calculated using background or pseudoabsence data, commission error and metrics derived from it are biased indicators of performance (Radosavljevic and Anderson 2014, Fourcade et al. 2018, Liang et al. 2018), penalizing models that correctly predict suitability beyond the species' documented occurrences and favoring those that do not (Araújo andPeterson 2012, Jiménez-Valverde 2012).Critically, it follows that because of this bias, such metrics do not represent values of absolute performance, but instead relative ones valid for comparisons only with other analyses for the same species and study region (Lobo et al. 2008, VanDerWal et al. 2009, Jiménez-Valverde et al. 2013).
Additionally, this bias undermines the use of these metrics in identifying an optimal threshold for converting model output into a binary prediction of suitable vs unsuitable conditions (e.g. via the sum of sensitivity and specificity; Liu et al. 2013).
Recommendations.Adjustments for differential weighting of omission and commission errors have been proposed (Peterson et al. 2008), but appropriate weights for given study systems remain elusive.Instead, metrics that gauge performance without estimating commission error are valid and more appropriate for this type of data, both for assessing model performance and for use in threshold selection.For example, rates of omission at a given threshold are commonly used and comparable across species (e.g.validation omission rate based on a threshold corresponding to the minimum training presence value; Peterson et al. 2011), at least when the effects of sampling bias have been ameliorated (Hazard 2).In situations where metrics based on both omission and commission can provide useful information (e.g.assessing discrimination from the background for the same species and study region in tuning exercises; Soley-Guardia et al. 2016, Galante et al. 2018), they are best interpreted against a null expectation specific to the system.Fortunately, null-model approaches now exist to determine statistical significance and effect size (Raes andter Steege 2007, Bohl et al. 2019; for example including implementation in 'ENMeval'; Kass et al. 2021).These approaches increase computational time greatly but do not represent analytical barriers for projects that automate modeling of many species.

Evaluating models with random splits of the data
Most SDM studies lack truly independent occurrence data for testing (e.g.independently collected from a random or stratified survey design).Because of this, researchers commonly conduct model evaluation by partitioning available presence (and background or pseudoabsence) data into subsets, with some being used for model training and the rest for validation (Hastie et al. 2009, Zurell et al. 2020b).Over the past two decades, a particularly widespread partitioning approach has employed random division of the data (e.g.cross validation with random division into k folds, or groups, one of which is excluded from training in each round to serve instead for evaluation; Naimi andAraújo 2016, Roberts et al. 2017).
Assumptions.Training and validation datasets created using random splits are independent from one another.
Problem.Random splits of occurrence data do not provide statistical independence; instead, data points of the training and validation subsets often end up lying in geographic proximity (especially if sampling is spatially clustered; interaction with Hazard 2; Fig. 2).Due to pervasive spatial autocorrelation of the environment, geographically proximal records share similar or identical conditions; furthermore, because training and validation subsets are highly non-independent, they both reflect the same biases of the overall sample (Jiménez-Valverde et al. 2011, Wenger and Olden 2012, Roberts et al. 2017).Hence, undesired fitting to biases in the training data (interaction with Hazard 2; Fig. 2), which is more likely in overly complex models (interaction with Hazard 4), will go undetected -yielding inflated measures of performance via an especially misleading interaction among these three hazards (Veloz 2009, Anderson 2012, Hijmans 2012, Fourcade et al. 2018).
Recommendations.Splitting data non-randomly increases independence and reduces spatial correlation among subsets (e.g. using spatial or environmental blocks; Radosavljevic andAnderson 2014, Roberts et al. 2017).Such datapartitioning schemes make it possible to detect fitting to biases in the sample (interaction with Hazard 2; Fig. 2).Hence, they can be applied to estimate optimal model complexity appropriately in tuning exercises, rather than relying on default settings that often lead to overfitting (interaction with Hazard 4; Fig. 2).Fortunately, evaluation procedures with automated non-random (e.g.spatial) partitioning exist ('BlockCV'; Valavi et al. 2019;'ENMeval';Kass et al. 2021;'ENMTML';Andrade et al. 2020;'kuenm';Cobos et al. 2019b;'SDMtoolbox';Brown et al. 2017;'Wallace EcoMod';Kass et al. 2018Kass et al. , 2023)).Notably, this approach often faces the additional challenge of requiring extrapolation of response curves into non-analog environments (entailing additional caveats, including changes in the correlation structure of the variables; interaction with Hazard 9; Fig. 2).However, this can be regarded as an advantage when the ultimate use will require model transfer to another place or time (because spatial blocks allow assessment of the model under transfer;  et al. 2017, Soley-Guardia et al. 2019).Non-random splits for cross-validation exercises are feasible for automated analyses modeling many species, but they increase the need for expert inspection of the response curves and geographic prediction of the selected model.

Misunderstanding or overlooking the effects of extrapolation
When the ultimate use of a model requires its transfer to another place or time, the new conditions typically contain at least some non-analog environments (beyond those found in the training study region).This makes extrapolation of the modeled response curves necessary to make a prediction (Anderson 2013; see slightly different use of terminology in Owens et al. 2013, Qiao et al. 2019).Frequently, however, researchers confuse this need to extrapolate with the particular manner in which extrapolation is accomplished.
Assumptions.Two fundamental methods exist for extrapolating beyond the minimum and maximum environmental values found for any variable within the training dataset (composed of the sample of presences and the background or pseudoabsence information).One assumes that the modeled response to a given variable continues its trend unconstrained (e.g.achieved by including cubic and quadratic terms in GLMs, or disabling 'clamping' in MaxEnt); the other assumes that the response remains fixed at the last value of the training dataset (e.g. using linear or constant splines in GLMs, 'clamping' in MaxEnt, or in all uses of classification and regression trees including random forests and boosted regression trees; Elith and Graham 2009, Elith et al. 2010, Anderson 2013).Which method is more appropriate, and how much the corresponding estimates of suitability differ, both depend on several factors.These include the degree of environmental novelty in the new area or time (i.e.how different the conditions are compared with those used for model training) and the tendency and height of the response curve at its point of truncation (e.g.increasing or decreasing; nearing suitability limits or not; Fitzpatrick and Hargrove 2009, Anderson 2013, Guevara et al. 2018b).
Problem.Confusing the need to extrapolate with the manner in which it is accomplished can result in not only incorrectly documenting this important aspect of modeling but also ignoring its effects on predicted suitability and biological inferences (Guevara et al. 2018a, Feng et al. 2019a).While this applies primarily to model transfer, it also can affect predictions within the time and space where the model was built (when the background or pseudoabsence sample does not contain the full set of environments of the study region (Guevara et al. 2018b).
Recommendations.Several analytical approaches now exist to provide a clearer understanding and documentation of extrapolation.Together, they allow for flagging areas with high uncertainty due to extrapolation, which can help researchers identify regions where drawing any inference from the suitability values may be unadvisable (Owens et al. 2013, Franklin 2023).For example, MaxEnt can calculate the degree of environmental novelty across space ('MESS' and 'MoD' analyses) as well as how much the prediction depends on the particular extrapolation method ('clamping analysis'; Elith et al. 2010).Such analyses are best interpreted in conjunction with inspection of the response curve for each environmental variable included in the final model, to assess its plausibility in non-analog conditions (Guevara et al. 2018b).
Additionally, the model's robustness to extrapolation can be estimated with evaluation schemes that include transfer, such as spatial blocks (Roberts et al. 2017, Soley-Guardia et al. 2019) (Hazard 8).Fortunately, tools for tackling aspects of extrapolation continue to be developed (Mesgaran et al. 2014, Bartley et al. 2019, Cobos et al. 2019b, Andrade et al. 2020).These now include the ability to make separate decisions (whether or not to constrain the response) for each tail of every environmental variable (for example, depending on whether the response curve is increasing or decreasing at the point of truncation; Anderson 2013, Guevara et al. 2018b, Kass et al. 2021).Additionally, researchers can compare results among techniques that extrapolate only via a fixed response (e.g.classification and regression trees), others that do so unconstrained (e.g.some implementations of GLMs), and those with the ability to do either (e.g.MaxEnt and contrasting ways to employ GLMs) -and then characterize the associated uncertainty (Elith andGraham 2009, Araújo et al. 2019).Decisions regarding environmental extrapolation can be automated but represent a challenge for analyses modeling many species, where expert inspection of response curves, maps of predicted suitability, and uncertainty due to extrapolation remain advisable.

Comparing model outputs across different species or places
Without additional data or assumptions, algorithms that use presence data and a background or pseudoabsence sample (i.e.without true absence information) do not provide probability of presence but rather an output that represents relative suitability across map pixels for a given species (Phillips andElith 2013, Yackulic et al. 2013).Nevertheless, direct comparisons of such outputs across species or geographic regions is frequently desired (e.g. to assess competition or niche overlap; Peterson 2011, Gutiérrez et al. 2014).Only within a study region where the species' distribution is in equilibrium with environmental suitability (interaction with Hazard 6; Fig. 2), can relative suitability values be interpreted as related to probability of presence (by an unknown function, which is presumably monotonic although not necessarily linear).
For techniques with theoretical links to population ecology (e.g.MaxEnt, GLMs, and GAMs; Hastie andFithian 2013, Phillips andElith 2013), such outputs of relative suitability can be transformed to yield absolute probability of presence, allowing direct comparisons across species and geographic regions.This requires an explicit rescaling function, obtained Problem.If these assumptions are violated (or no theoretical rescaling to probability of presence exists for the algorithm being used), suitability can only be interpreted as relative across the study area of a given analysis, and the values of the predictions of different models should not be directly compared across species or regions (Fithian and Hastie 2013, Merow and Silander 2014, Guillera-Arroita et al. 2015).
Recommendations.In modeling techniques with theoretical links to population ecology (e.g.MaxEnt, GLMs, and GAMs), available estimates of occupancy or abundance can be used to set the scaling parameters rather than relying on default values (Guillera-Arroita et al. 2014, Phillips et al. 2017).By extension, if scaling parameters are unknown but can be justified as likely to be similar across species, the resulting values can be interpreted across models in absolute terms (but as comparable suitability scores, not as probability of presence; see also a modification of logistic regression aimed at allowing comparisons across species; Real et al. 2006).In all other cases, model output should be interpreted as only relative estimates of suitability across map pixels of the area of analysis (specific to that given species and region), whether or not values have been rescaled (Soley-Guardia et al. 2016).For any of these three situations, researchers can enhance interpretations by calculating the species' prevalence separately across different levels of suitability (although necessary information regarding detectability often remains lacking; Anderson 2023).In projects automating the production of models for many species, researchers likely will make assumptions regarding scaling parameters for sets of species with similar natural histories (to yield values for comparison across species or regions as suitability scores, not probability of presence).

Outstanding questions
In addition to the ten commonly misunderstood hazards discussed above (each with existing recommendations), other topics remain challenging and require additional research.Various such issues exist (e.g.spatial and temporal resolutions and correlations among environmental variables; standardization of methods to increase comparability of models across species and studies; emerging approaches for modeling with big data; Araújo et al. 2019, Franklin 2023).Below, we cover five of them, important topics currently being investigated and each related in various ways to the ten hazards covered above: model uncertainty; model complexity; biotic interactors; interactions among suitability drivers; and intraspecific niche variation.Although definitive resolutions for these issues do not yet exist, researchers should be able to appreciate them and comment upon how they may affect conclusions (Araújo et al. 2019, Sofaer et al. 2019, Zurell et al. 2020b).

Quantifying model uncertainty
Although most SDM studies rely on conclusions from a single final estimate of suitability, the field lacks a unified approach to quantify, partition, and map the uncertainty that arises from the many factors that can affect predictions (Beale and Lennon 2012, Peterson et al. 2018, Yates et al. 2018, Araújo et al. 2019).To characterize such uncertainty, researchers usually have focused on different sources by combining models built with various splits of the data (Soley-Guardia et al. 2019), under different parameterizations of the same algorithm (Breiner et al. 2015, Boria et al. 2017), and using different algorithms ('ensemble' or 'consensus' approach; Araújo and New 2007, Meller et al. 2014, Hao et al. 2020).The latter involves difficult challenges, including how to combine different output formats (Marmion et al. 2009, Sillero 2011, Andrade et al. 2020) and ensuring an equally sound implementation of each algorithm (Barry and Elith 2006, Jarnevich et al. 2015, Hao et al. 2019, 2020).Additionally, the degree of uncertainty (or outright bias) introduced by errors in taxonomy and georeferences almost always remains unknown, and the answer surely differs among taxonomic groups and geographic regions (Beale and Lennon 2012, Costa et al. 2015, Gábor et al. 2020a).
Simulation studies with virtual species can be informative for determining how sensitive different approaches are to particular aspects of the modeling process (Meynard et al. 2019), but resolution of this issue and corresponding recommendations likely will depend on aspects of the system.

Reaching consensus on optimal model complexity
No consensus exists on the best criteria for approximating optimal model complexity (Peterson et al. 2011, Merow et al. 2014, Warren et al. 2014, 2020).A commonly used metric that penalizes complexity via the number of parameters incorporated into the model (AICc) remains statistically imperfect for some algorithms (e.g.MaxEnt, because the degrees of freedom cannot be calculated exactly; Warren and Seifert 2011, Warren et al. 2014, Galante et al. 2018, Velasco and González-Salazar 2019).Additionally, for measures of performance calculated using partitioned validation data or fully withheld testing data, the particular metric employed can lead to selection of different models as optimal, and no single one has been demonstrated as reliable and sufficient on its own (Hirzel et al. 2006, Boria et al. 2017, Bohl et al. 2019, Jiménez and Soberón 2020).Disagreement even remains about whether simpler models are always desirable (García-Callejas and Araújo 2016, Yates et al. 2018, Coelho et al. 2019), and few studies have investigated how complexity correlates with predictive ability (Galante et al. 2018, Velasco andGonzález-Salazar 2019).Future studies comparing model performance according to diverse criteria (Norberg et

Integrating biotic interactors
Most SDM implementations rely on the Eltonian noise hypothesis that biotic interactions occurring at a fine spatial resolution in local communities do not alter the abiotic signal retrieved from the coarser-grain environmental data used across large spatial extents (Soberón andNakamura 2009, Lira-Noriega et al. 2013).Nevertheless, key biotic interactions often carry their effects to resolutions and extents relevant for SDMs (Wisz et al. 2013).In these cases, SDMs can benefit from incorporating data regarding biotic interactors (assuming that the nature of their effects remains stationary across the study region; Sanín and Anderson 2018, Fern et al. 2019, Kass et al. 2020).However, for the correlative models considered here (comparing presence data with background or pseudoabsence information), only biotic interactors with unidirectional effects on the focal species should be considered as predictor variables (the interactor affects the distribution of the focal species but not vice versa; Soberón 2007, 2010, Anderson 2017).Other interactors can be informative during post-processing of model output (Peers et al. 2013, Gutiérrez et al. 2014) or via classes of models that incorporate population demography (Zurell 2017).Even joint species distribution models that can fit environmental responses of multiple species simultaneously still assume stationarity of the effects of relevant biotic interactions (Pollock et al. 2014, Poggiato et al. 2021).Clearly, incorporating biotic interactions remains difficult, and some unresolved challenges include: accounting for their statistical interaction with abiotic predictors and historical contingencies such as past extinctions and dispersal barriers (Warton et al. 2015, Dormann et al. 2018, Brown and Carnaval 2019, Early and Keith 2019, Franklin 2023); computational limitations for considering inputs from adjacent populations in local population-dynamic models (Zurell et al. 2020a); and extrapolating to non-analog biotic contexts (Williams andJackson 2007, Jaeschke et al. 2012).

Accounting for interactions among suitability drivers
Although SDM predictions often represent the combination of independent responses estimated for each predictor variable, suitability is likely driven by non-additive effects (Merow et al. 2014, Golding andPurse 2016).Accounting for such effects can be particularly important when extrapolating into novel environmental combinations (Zurell et al. 2012, Mesgaran et al. 2014, Feng et al. 2019b).Realistic inclusion of such statistical interactions is currently possible for only a few modeling algorithms (e.g.multivariate adaptive regression splines, 'MARS'; Leathwick et al. 2006;Gaussian processes;Golding and Purse 2016).Others provide partial solutions to the problem, for instance allowing simple pairwise fixed interactions (e.g.MaxEnt's product features; Merow et al. 2013, Phillips et al. 2017), or providing great flexibility but without much specification control by the user (random forests and boosted regression trees; Elith et al. 2008, Merow et al. 2014).Simulation studies are needed to assess the importance of accounting for such interactions and determine which particular approaches are most suited to the task under what circumstances (Golding and Purse 2016).

Incorporating intraspecific niche variation
SDMs assume equal environmental associations for individuals across populations (i.e.niche conservatism over space), which may closely approximate reality for many taxa and commonly used predictor variables (Peterson 2011).However, in an increasing number of studies, genetically determined differences have been documented for factors related to distributional limits among populations (Pelini et al. 2009, Fournier-Level et al. 2011, Morente-López et al. 2022, Franklin 2023).Whereas the problem of violating this assumption of stationarity in SDMs has been addressed in studies of invasive species (van Boheemen et al. 2019, Pili et al. 2020), phylogeography (Costa et al. 2002), and climate change (Fitzpatrick and Keller 2015, Moran and Ormond 2015, Martin et al. 2020), resolving it fully involves multiple conceptual and practical challenges (Smith et al. 2019).These challenges include: 1) detecting deviations from niche homogeneity when data regarding adaptation and plasticity are not available (i.e.poor prediction in spatial transfers could stem from niche variability or from various methodological factors, including overfit models; Peterson andHolt 2003, Brown andCarnaval 2019); 2) building models with population-specific data (i.e.separate models for evolutionarily distinct populations vs single models integrating data on functional differences across the range; Fitzpatrick and Keller 2015, Hällfors et al. 2016, Thorson et al. 2016); and 3) identifying appropriate means for transferring such models across space and time (e.g.involving estimation of distributional limits for particular populations and the effect of any future intermixing; Prates et al. 2016, Martin et al. 2020).

Closing remarks
With many topics in species distribution modeling now well understood and others advancing rapidly, researchers can take advantage of over two decades of impressive progress to make defensible and useful models (Araújo et al. 2019, Sofaer et al. 2019, Zurell et al. 2020b).In contrast, not addressing adequately any of the ten hazards detailed here can hinder progress in basic science -including the understanding of fundamental biological processes -as well as lead to inefficient or ineffective use of resources and counterproductive decision-making for important applications in areas as varied as invasive agricultural pests, human zoonotic diseases, and the effects of climate change on biodiversity (Guillera-Arroita et al. 2015, Morales et al. 2017, Tuia et al. 2022).Attention to key principles holds particular relevance as researchers increasingly embark on projects modeling species en masse via automated pipelines, where quality can suffer without expert supervision and insights.Importantly, we note several direct interactions between pairs of hazards, and in one case among three of them (Fig. 2, Table 2).Therefore, because of both their individual and collective effects, we encourage researchers to follow -and reviewers and editors to call for heeding -existing recommendations for these and other critical issues.As we mention, comprehensive treatments exist to guide research and promote quality (and replicability) across all aspects of modeling species distributions (Araújo et al. 2019, Zurell et al. 2020b).In closing, we hope that researchers will think about all of these topics carefully for successful implementations and continue to develop novel approaches that better approximate reality.

Figure 1 .
Figure 1.Schematic overview of how a species distribution model is built.(a) Records of a species' presence (white circles) within a study region (rectangle) are gathered, along with relevant environmental variables.Environmental variables typically consist of gridded GIS layers describing abiotic conditions of the study region (e.g.temperature, soil type; colors for each grid cell depict different values of the variable).(b) These data are used as input for any of the several available algorithms, which characterize suitability as a function of environmental variables.The model distinguishes the environmental conditions more frequently associated with presence of the species versus those from a comparison dataset (either absences or more commonly background or pseudoabsence information across the study region).(c) The model characterizes environmental suitability for the species in environmental space (warmer colors indicating increasingly suitable conditions).(d) Typically, the model is then applied to geographic space, indicating spatial patterns of suitability across the study region.Finally, the model is evaluated quantitatively, typically by assessing how well it predicts an evaluation dataset (arrow back to (b)).

Table 1 .
Glossary of selected key terms. ).
16000587, 2024, 4, Downloaded from https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.06852 by The City College of New York, Wiley Online Library on [02/04/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License al. 2019b, Morente-López et al. 2022).Such approaches dovetail with projects that automate model-building for many species, 16000587, 2024, 4, Downloaded from https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.06852 by The City College of New York, Wiley Online Library on [02/04/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License 16000587, 2024, 4, Downloaded from https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.06852 by The City College of New York, Wiley Online Library on [02/04/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)onWileyOnlineLibrary for rules of use; OA articles are governed by the applicable Creative Commons LicenseFigure 2. Illustration of between pairs of the ten most problematic hazards in species distribution modeling covered in this paper (arrows; Table2).The colored shapes denote the corresponding step of modeling (orange hexagons: data gathering and processing; purple circles: model building; blue squares: model evaluation; green ovals: model interpretation).Red arrows highlight an especially misleading interaction among hazards from three different steps of modeling that together can trick researchers into making overfit models that falsely appear to have excellent performance (Hazards 2, 4 and 8).Other interactions appear in black.Notably, Hazard 6 shows interactions with over half of the other individual hazards, emphasizing the far-reaching repercussions of violating its associated assumptions.In addition to these direct interactions explained in the text, the various hazards also have diffuse effects on each other (not shown), including the cascading impacts of Hazard 1 (overlooking errors in occurrence datasets).16000587, 2024, 4, Downloaded from https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/ecog.06852 by The City College of New York, Wiley Online Library on [02/04/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License are still equivalent among the entities involved (e.g. for ecologically similar species), rendering output values that are consistently incorrect yet comparable in absolute terms.