Occurrence–habitat mismatching and niche truncation when modelling distributions affected by anthropogenic range contractions

Human‐induced pressures such as deforestation cause anthropogenic range contractions (ARCs). Such contractions present dynamic distributions that may engender data misrepresentations within species distribution models. The temporal bias of occurrence data—where occurrences represent distributions before (past bias) or after (recent bias) ARCs—underpins these data misrepresentations. Occurrence–habitat mismatching results when occurrences sampled before contractions are modelled with contemporary anthropogenic variables; niche truncation results when occurrences sampled after contractions are modelled without anthropogenic variables. Our understanding of their independent and interactive effects on model performance remains incomplete but is vital for developing good modelling protocols. Through a virtual ecologist approach, we demonstrate how these data misrepresentations manifest and investigate their effects on model performance.


| INTRODUC TI ON
As global changes increasingly threaten biodiversity and ecosystems, there is a need to model and understand the effects of these changes on species distributions (Guisan & Thuiller, 2005;Guisan et al., 2013;Newbold, 2018). The most extensive and prevalent changes in recent times are caused by human pressures such as poaching or deforestation, which manifests in anthropogenic range contractions (ARCs) (Maxwell et al., 2016;Newbold et al., 2015).
Hence, predictions of biodiversity scenarios need to integrate species responses to such pressures through anthropogenic variables such as land-use/land-cover (Titeux et al., 2016).
A common approach towards measuring and predicting changes in species distributions are species distribution models (SDMs), which estimate habitat suitability by correlating species' occurrences to prevailing environmental conditions (Soberon & Peterson, 2005).
However, SDMs typically assume distributions to be at equilibrium with their environment (Araújo & Pearson, 2005), which may lead to data misrepresentations when modelling species already affected by ARCs. The crux of these data misrepresentations stems from the incongruence between static occurrence data and temporally dynamic distributions, where occurrence datasets will inevitably exhibit some form of temporal bias relative to the progression of ARC (Boakes et al., 2010;Ryo et al., 2019). Datasets biased to the past or to more recent times, respectively, are more likely to have occurrences sampled before or after an ARC.
Consider the occurrence records of a forest-dependent species sampled from forests in the decades prior to clearance in the year 1980 for urban land-use ( Figure 1). When an SDM using those occurrence data incorporate a land-use variable from the year 2000, occurrence-habitat mismatching occurs, as occurrences are mismatched with the unsuitable urban land-use class (Milanesi et al., 2020;Pang et al., 2021;Ryo et al., 2019). Thus, the resultant model may wrongly infer urban land-use as suitable and underestimate the true impact of deforestation on the species' range. This situation arises because of the lack of temporal range and resolution for many anthropogenic variables, and conventional use of static predictors in most SDM studies; hence, contemporary anthropogenic variables are often utilized in SDM studies despite possible mismatching with historical occurrences (Garcia et al., 2013;Marshall et al., 2018; can be accounted for post-hoc (see Gomes et al., 2019;Manchego et al., 2017;Newbold, 2018), excluding anthropogenic variables facilitates niche truncation when occurrences are sampled after an ARC.
Consider alternatively, the same species and deforestation scenario, but wherein occurrences were sampled in 1990, that is, after land-use conversions (Figure 1). Those occurrences would represent only a subset of the species' historical range, and by extension, a subset of the species' historically realized niche (Colwell & Rangel, 2009;Hutchinson, 1957;Scheele et al., 2017). When an SDM using those occurrence data lack a land-use variable, niche truncation occurs, because absences due to urban land-use are being erroneously attributed to the available environmental variables (Barve et al., 2011;Owens et al., 2013). This truncated estimate of the niche may undermine model transferability, leading to inaccurate predictions of species distribution across space (e.g. invasive potential) and time (e.g. climate change) (Guisan et al., 2014;Peterson et al., 2018).
Occurrence-habitat mismatching and niche truncation are thus conceptually related outcomes of ARCs but arise out of opposing temporal biases and modelling protocols.
In comparison, multiple real species studies have found evidence for niche truncation due specifically to ARCs (Faurby & Araújo, 2018;Gibson et al., 2019;Martínez-Freiría et al., 2016;Rutrough et al., 2019). However, utilizing real species hinders assessments of model performance because knowledge on historical distributions and the full extent of ARCs are incomplete. Indeed, neither data misrepresentation can be accurately quantified for real species datasets, making it difficult to disentangle their effects on SDMs.
Furthermore, because previous studies had investigated either data misrepresentation in isolation, possible interactions between them have gone unstudied. Specifically, although studies have shown that incorporating an anthropogenic variable allows SDMs to account for niche truncation (Gibson et al., 2019;Nüchel et al., 2018;Requena-Mullor et al., 2019;Silva et al., 2012), occurrence-habitat mismatching may undermine the effectiveness of such an approach.
Developing effective solutions for overcoming these data misrepresentations requires a thorough understanding of their independent and interactive effects on SDM performance (Araújo et al., 2019); in other words, evaluating these data misrepresentations in tandem.
In this study, we explore how data misrepresentations may arise from interactions between ARCs and temporal sampling bias and evaluate them simultaneously to isolate their independent and interactive effects on SDM performance. We overcome limitations associated with incomplete knowledge on historical distributions and extent of ARC by adopting a virtual ecologist approach (Austin et al., 2006;Meynard et al., 2019;Zurell et al., 2010). Using 100 virtual species, we simulated ARCs with real anthropogenic land-use data, and generated occurrence datasets with temporal biases based on real-world trends. First, we quantify and characterize these data misrepresentations as a function of both ARC and the temporal bias. Second, we model datasets with and without a contemporary anthropogenic variable and with a temporally dynamic anthropogenic variable, before comparing model predictions against the true historical and contemporary distributions of our virtual species. Based on the findings, we then propose strategies to tackle these data misrepresentations and improve model reliability.

| Create virtual species
The virtual ecologist approach works by simulating species distributions and observer models to generate virtual occurrence data (Zurell et al., 2010). The virtual occurrence data can be modelled and evaluated against the 'true' distribution, thus supporting an assessment of factors contributing to data quality, and a quantification of SDM performance (Austin et al., 2006;Meynard et al., 2019). The true distribution of virtual species, henceforth referred to as species unless otherwise specified, was based on environmental suitability and fractional cover of anthropogenic land-use. Environmental suitability was determined by the species' environmental niche, which was created from 15 bioclimatic (5 arc min resolution) and nine soil variables (250 m resolution) sourced from WorldClim.org and SoilGrid.org respectively (Fick & Hijmans, 2017;Hengl et al., 2017) ( Table S1.2). These variables were first resampled (bilinear) to 5 arc min resolution before a principal component analysis was conducted, where the first five PC-axes (~85% variance) were retained (Table   S1.3). Retained PC-axes were then used to randomly create environmental niches through the 'generateRandomSp' function from the 'virtualspecies' package in R, which produced continuous estimates of environmental suitability in geographical space based on realistically viable responses (for details see Leroy et al., 2016). One hundred environmental niches and their corresponding estimates of environmental suitability were generated, which were allowed a maximum pairwise Pearson's correlation coefficient of .85 to prevent duplicate results. Although distributional shifts due to intra-/ inter-annual climate variability may also result in misrepresentation of suitable habitats (Milanesi et al., 2020), we kept environmental variables static across the entire study period as this study focused on distributional changes caused by anthropogenic pressures.

| Simulate anthropogenic range contractions
Annual land-use maps (0.25° resolution rasters) from 1900 to 2000 were obtained from the Land Use Harmonization dataset (LUH2 v2h; <https://luh.umd.edu/data.shtml>), which indicated fractional cover for several classes (Hurtt et al., 2020). We generated an anthropogenic land-use class by summing urban, rangelands, managed pastures, C3 and C4 annual crops, C3 and C4 perennial crops, and C3 nitrogen-fixing crops land-use classes for each year. The maps were resampled (bilinear) to 5 arc min resolution to match environmental variables. Fractional cover of anthropogenic land-use was the only land-use class considered for this study, which was simulated to negatively affect habitat suitability. The negative effect was defined by a negative logistic regression that interacted with environmental suitability on a multiplicative scale, which replicated the contagion-like spread of extinction forces observed to induce ARCs ( Figure S1.4) (Channell & Lomolino, 2000). This produced the final, annual, true distribution maps of continuous habitat suitability for the 100 species, based on their environmental niche and the fractional cover of anthropogenic land-use. The continuous true distribution maps of habitat suitability for each species were then converted to binary presence-absence distributions using species-specific thresholds of habitat suitability. Using the 'convertToPA' function from the 'virtualspecies' package in R (Leroy et al., 2016;R Core Team, 2013), thresholds were determined by fixing the initial prevalence (year 1900) of species at 0.25. Additionally, four other prevalence values were used to test for sensitivity ( Figure S2.3-S2.6).
Anthropogenic range contraction (proportion of initial range lost) for a species was calculated as the proportion of 'presence' pixels that changed to 'absence' from 1900 to 2000, due to an increase F I G U R E 2 Methods flowchart for this study indicating the six phases of this study: 1) create virtual species; 2) simulate anthropogenic range contractions; 3) generate temporally explicit occurrence datasets; 4) model virtual species distributions; 5) evaluate model performance and 6) quantify data misrepresentations and their effects on model performance. A more detailed version can be found in Figure S2.  Anthropogenic land-use (1900Anthropogenic land-use ( -2000 Modelling protocol True distribution of our virtual species Environmental variables virtualspecies::generateRandomSp in fractional cover of anthropogenic land-use. We utilized the 'raster' package in R for the resampling of variables and the summation of the different land-use classes (Hijmans & Etten, 2012; R Core Team, 2013).

| Generate temporally explicit occurrence datasets
We simulated six temporal sampling patterns-two temporal sampling distributions and three temporal sampling biases-based on generalizations of real species occurrence datasets ( Figure 3; Figures S1.5 and S1.6) (GBIF.org, 2020). This ensured realistically possible data misrepresentations within each generated occurrence dataset. Temporal sampling distributions were either clustered (sampled over a short period) or spread (sampled more evenly across time).
Temporal sampling biases were either past (majority sampled near the start of the study period), recent (near the end), or intermediate (neither near the start nor end). Two extreme sampling patterns were also included to act as both positive and negative controls: To generate temporally explicit species occurrence datasets (i.e. presence-only), 2000 points were randomly sampled across the study region. For each point, the probability of sampling from a particular year was based on the simulated sampling frequencies of a temporal sampling pattern ( Figure 3). This was repeated for each temporal sampling pattern (including controls), which generated eight temporally unique datasets. Lastly, the occurrence probability for each sampling point was derived from the suitability of the habitat (both environment and land-use) and the binary range of the species for that year; occurrence probability outside the range was always zero. As replicates, this entire process was repeated 10 times.

| Model virtual species distributions
Species distributions models were developed for each occurrence dataset using the MaxEnt algorithm with its default settings (Phillips et al., 2006). The SDMs included the same environmental variables used to generate species as predictors (first five PC-axes), with 10,000 randomly sampled background points (Chefaoui & Lobo, 2008;VanDerWal et al., 2009). Models were trained under three protocols: without a land-use predictor, with a contemporary landuse predictor, and with a temporally dynamic land-use predictor.
Land-use predictors here were the same fractional cover of anthropogenic land-use used to simulate ARC. Models using a contemporary land-use predictor had occurrence and background points associated with land-use values from the year 2000, whereas models using a temporally dynamic land-use predictor had occurrence and background points associated with land-use values from the

F I G U R E 3
The six temporal sampling patterns (and median sampling year) visualized using sampling frequency (relative to the total number of occurrences for that species' dataset); simulated sampling frequencies in red and observed sampling frequencies for an example species in black. Patterns followed either a clustered or spread distribution (rows) and either a past, intermediate, or recent bias (columns). The simulated sampling frequencies for the six temporal sampling patterns were used to determine the probability of sampling from a specific year. Sampling frequency/probability for the controls (not shown) was 1 for their respective year  1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980 2000 1900 1920 1940 1960 1980

Real Species Simulated
year of their collection (background points were temporally sampled to match the temporal sampling pattern of the occurrence dataset).
For models trained with a land-use predictor, the land-use variable for the year 2000 was used to predict contemporary distributions and for the year 1900 to predict historical distributions. To delineate species' ranges, the probability distributions were converted to binary presence-absence using the maximizing sum of sensitivity and specificity threshold (Liu et al., 2005). All models were built with the 'sdmtune' package using the 'maxnet' function in R (Phillips et al., 2017;Vignali et al., 2019).

| Quantify data misrepresentations and their effects on model performance
We used two indices to quantify occurrence-habitat mismatching and niche truncation within each occurrence dataset: mean difference in habitat suitability and niche dissimilarity respectively. For each occurrence point, a mismatch between the occurrence and associated habitat was calculated by subtracting the true suitability of the habitat for the year 2000 from its true suitability at the time of sampling, that is, the difference in habitat suitability. Difference values were then averaged across points to quantify occurrencehabitat mismatching within a dataset.
Niche dissimilarity was calculated as 1 -I, where I is the overlap between niches represented by the true historical distribution (year 1900) and occurrence dataset for each species (Warren et al., 2008).
The niches were derived from environmental variables only (five PC-axes), using the 'nicheOverlap' function from the 'dismo' package in R R Core Team, 2013). The niche dissimilarity score from the 10 negative controls for each species served as a baseline value for species-specific correction ( Figure S1.7). Because of this correction, negative values resulted, which indicates that the calculated niche truncation was less than the established baseline.
The two indices, mean difference in habitat suitability and niche dissimilarity (hereafter simply occurrence-habitat mismatching and niche truncation), were plotted against ARC (proportion of range lost) for each temporal sampling pattern. Generalized linear models (GLMs) were used to assess the effects of data misrepresentations on the performance of SDMs under the three modelling protocols.
GLMs were built separately to evaluate model predictions of contemporary and historical distributions, for each model evaluation metric. The GLMs were built in R using the 'stats' package (R Core Team, 2013).

| The manifestation of occurrence-habitat mismatching and niche truncation
Our results showed that the proportion of range lost during ARC was positively correlated to both data misrepresentations ( Figure 4).
This was clearest for the positive controls, which maximized their respective misrepresentations and had regression lines against ARC with the highest slope and R 2 values (slope = 0.352, R 2 = .73 for occurrence-habitat mismatching; slope = 0.232, R 2 = .44 for niche truncation). Regression slopes, however, varied among temporal sampling biases: datasets with a past bias exhibited a higher slope for occurrence-habitat mismatching but a lower slope for niche truncation and vice versa for datasets with a recent bias ( Figure 4).
This demonstrates ARC as the primary driver of both data misrepresentations, while the temporal sampling bias determined the specific data misrepresentation that manifests. We also observed a significant negative partial correlation between data misrepresentations when controlling for ARC (n = 8000, r = −.63, p < 0.001), indicating a negative relationship between them. For clarity and simplicity, clustered and spread temporal sampling distributions were aggregated since trends between them were similar (for disaggregated results, see Figure S2.1).

| Relationship between model performance and data misrepresentations
The effects of occurrence-habitat mismatching and niche truncation on model performance depended on the prediction scenariothat is, historical or contemporary distributions-and the modelling protocol. For predictions of historical distributions (Figure 5a), models without the land-use predictor were negatively affected by niche truncation. As niche truncation increased, predictions of models without the land-use predictor were less correlated with historical distributions and more severely underpredicted them. Models with either land-use predictor, however, were relatively unaffected by increases in either data misrepresentation; model correlation scores remained high while under-and overprediction rates remained low.
Notably, for models with the contemporary land-use predictor, this meant niche truncation was accounted for despite occurrencehabitat mismatching and its effects. The spatially explicit example also confirmed these results. When niche truncation was high, underpredictions of the historical distribution only occurred for the model without the land-use predictor (Figure 6a). Although statistically significant, interaction between data misrepresentations were minor (for GLM coefficient estimates, see Figure S2.2).
On the other hand, for predictions of contemporary distributions (Figure 5b), reductions in model performance were observed primarily for models with and without the contemporary land-use predictor. For models without the land-use predictor, both data misrepresentations had a strong negative effect on model performance, whose increases resulted in predictions that were less correlated with contemporary distributions and more severely overpredicted them. This was because those models lacked the necessary predictor required to predict ARCs. For models with the contemporary landuse predictor, occurrence-habitat mismatching had strong negative and independent effects on model performance (lower correlation and higher overprediction). Niche truncation had no independent effects but interacted with occurrence-habitat mismatching to negatively affect model performance. However, this negative effect, like models without the land-use predictor, was due to model inability to predict ARCs. Specifically, as occurrence-habitat mismatching increased, model performance became increasingly similar to those observed for models without the land-use predictor, suggesting a convergence in model predictions. The spatially explicit example confirmed this convergence between models with and without the contemporary land-use predictor, showing near-identical range estimates of contemporary distribution when occurrence-habitat mismatching was high ( Figure 6b). Additionally, in response to a linear increase in occurrence-habitat mismatching, an exponential reduction in the variable importance of the contemporary land-use predictor was observed ( Figure S2.7). While the exponential reduction in variable importance suggests model sensitivity to mismatching, the near zero variable importance implies an uninformative contemporary anthropogenic predictor.
Although occurrence-habitat mismatching was also found to negatively affect models with the temporally dynamic land-use predictor, the effects were substantially reduced under even minor increases in niche truncation (Figure 5b). This indicates that models with the temporally dynamic land-use variable generally performed well, except for predictions of contemporary distributions based on datasets almost entirely dominated by occurrence-habitat mismatching. This is because such datasets would not have captured any of the effects of ARC on species occurrences. Nevertheless, models with the temporally dynamic anthropogenic predictor typically outperformed the other two models, regardless of the prediction scenario or data misrepresentations present (Figures 5b and 6b).  using the training data) (AUC > 0.74; Figure S2.8). This indicates that the AUC metric was relatively insensitive to model degradations due to occurrence-habitat mismatching and niche truncation. We did not find TSS nor KAPPA to be as insensitive to model degradations;

| The insensitive AUC metric
TSS was insensitive to model overpredictions only (i.e. contemporary predictions), whereas reductions in KAPPA reflected decreases in correlation scores and increases in either under-or overpredictions ( Figure S2.4-S2.6).

| DISCUSS ION
Understanding the processes leading up to data misrepresentations and their subsequent effects on SDM performance is essential for developing good modelling practices (Fourcade et al., 2018;Guillera-Arroita, 2017;Peterson et al., 2018;Yates et al., 2018). Yet, our understanding of occurrence-habitat mismatching and niche truncation, which arise out of interactions between ARCs and temporally biased data, remains incomplete.
Importantly, their interactive effect on model performance is unknown. A complete understanding of these data misrepresentations is especially pertinent considering the extent and prevalence of ARCs and the unavoidability of temporally biased data (Boakes et al., 2010;Maxwell et al., 2016;Newbold et al., 2015). Through the virtual ecologist approach, we demonstrated how ARCs and temporal sampling biases can result in occurrence-habitat mismatching and niche truncation, and significantly impact model performance.  to account for niche truncation as independent processes; mismatching hinders the former but not the latter. This independence probably stems from the fact that range contractions due to anthropogenic pressures are spatially heterogenous and can spread independently of environmental suitability and demographic processes (Channell & Lomolino, 2000;Scheele et al., 2017). As such, within the SDM itself, a species' response to the anthropogenic variable is modelled independently of the species' niche (Colwell & Rangel, 2009;Hutchinson, 1957;Scheele et al., 2017). Likewise, when modelling the niche, accounting for niche truncation through the anthropogenic variable is not based on the modelled response to that same variable.

| Implications for modelling ARCaffected species
The effects of occurrence-habitat mismatching have important methodological implications for modelling distributions shaped by anthropogenic pressures. In terms of modelling protocols, simply incorporating a contemporary anthropogenic variable into an SDM does not assure accurate predictions of distributions after ARC, regardless of the accuracy or precision of the anthropogenic variable, or its representativeness of the factors driving ARC. Should occurrence-habitat mismatching occur, models that incorporate a contemporary anthropogenic variable would severely underestimate F I G U R E 5 The plotted and modelled effects of occurrence-habitat mismatching and niche truncation on the performance of SDMs under different modelling protocols, for predictions of (a) historical and (b) contemporary distributions. Points represent the observed performance of SDMs, and regression lines indicate the estimated performance. Model performance was indicated by the y-axis, niche truncation by the x-axis, and occurrence-habitat mismatching by the spectral colours-specifically, 0.0 (blue), 0.1 (yellowish green) and 0.2 (orange) for the regression lines. Plots were faceted for each modelling protocol (horizontal facet) and metric of model performance (vertical facet). The metrics of model performance were for probabilistic performance (AUC and Pearson Correlation; higher is better) and binary error rates (Underprediction rates and Overprediction rates; lower is better). Note, niche truncation values here were after species-specific corrections, where a negative value indicates that the truncation was less than the established baseline  Titeux et al., 2016). The biased assessment of species vulnerabilities may then divert conservation efforts and funds away from species that need it . Using resultant model outputs to inform systematic conservation planning can also skew habitat prioritizations and undermine the effective demarcation or expansion of protected areas, or worse, result in the allocation of resources into protecting already degraded habitats McShea, 2014;Struebig et al., 2015). Our result also suggests that models were sensitive to occurrence-habitat mismatching, where relatively minor mismatching could engender the problems described.
A key step towards ensuring model reliability is the detection of models affected by data misrepresentations (Araújo et al., 2019).
Models affected by occurrence-habitat mismatching, however, will be challenging to detect. Without quality absence data, overpredictions are notoriously difficult to verify (Leroy et al., 2018;Warren et al., 2019). This detection difficulty is exacerbated by the uninformative outcome of the anthropogenic variable, in that multiple factors other than mismatching may lower variable importance as well; whereas an erroneous response may be intuitively recognized (Gibson et al., 2019;Guevara et al., 2018;Soley-Guardia et al., 2016).
It is also possible to misinterpret the uninformative anthropogenic variable as a legitimate response, resulting in false inference of species resilience. On a related note, the supposed uninformative contemporary anthropogenic variable is somewhat misleading since it could remain important for extrapolating truncated niches. This highlights a need to reconsider how we understand and evaluate variable importance within dynamic landscapes, especially for variables with atypical responses (i.e. constrain-only) (Fourcade et al., 2018;Harisena et al., 2021;Smith & Santos, 2020).
Our findings reveal the viability and robustness of incorporating contemporary anthropogenic variables as a solution to niche truncation arising from ARC. The importance of this is twofold.
First, data and technological constraints often restrict anthropogenic variables to relatively contemporary timescales, datasets with large temporal ranges generally lack spatial resolution (e.g. the LUH dataset used in this study; Hurtt et al., 2020) or coverage (e.g. SERVIR-Mekong Land Cover; Saah et al., 2020). Thus, contemporary anthropogenic variables are often all that is available.
Second, although the alternative-incorporating historical data to

F I G U R E 6
The spatially explicit binary accuracy of SDMs for predictions of an example species (a) historical and (b) contemporary distribution. Binary accuracy maps were faceted for each modelling protocol (horizontal facet) and for two different datasets (vertical facet). While one dataset exhibited high occurrence-habitat mismatching (0.25) but low niche truncation (0.02) (i.e. sampled following a clustered past temporal sampling pattern), the other exhibited high niche truncation (0.20) but low occurrence-habitat mismatching (0.00) (i.e. sampled following a clustered recent temporal sampling pattern). In this example, the selected species experienced a 62.5% contraction of its historical range (virtual species N27, see Supporting Information). For illustration purpose, only mainland Southeast Asia was shown as the main distribution of the selected species occurred there reduce niche truncation-was done in past studies, those same studies recognized the limitation of such an approach; a complete niche cannot be guaranteed unless data from times predating early human civilization is obtained (Faurby & Araújo, 2018;Martínez-Freiría et al., 2016;Rutrough et al., 2019). Additionally, given the paucity and imprecisions of available occurrence records (Chen et al., 2012;Feeley & Silman, 2011;Ficetola et al., 2015;Goodwin et al., 2015;Meyer et al., 2016), obtaining sufficient decades-old yet accurate data to substantially reduce niche truncation is an improbable luxury for most species. Incorporating contemporary anthropogenic variables is therefore the most feasible approach to overcome niche truncation.
Our observed underpredictions of historical distributions due to niche truncations emphasize the wide-ranging implications of leaving truncations unaccounted. When predicting potentially suitable habitats based on truncated niches: reforestation or rewilding projects will struggle to identify suitable habitats lost due to anthropogenic pressures Jarvie & Svenning, 2018); invasive species models may fail to map potential dispersal corridors or areas of invasion Liu et al., 2020); and model forecasts under climate change would exaggerate rates of range shift or loss (Faurby & Araújo, 2018;Martínez-Freiría et al., 2016). We also found niche truncation to degrade the accuracy of continuous estimates of relative suitability. This implies erroneous response curves and skewed estimates of variable importance, making SDMs affected by niche truncation inappropriate for empirical applications in general (Guevara et al., 2018;Harisena et al., 2021;Smith & Santos, 2020;Warren et al., 2019). Furthermore, although several studies for or against niche conservatism have accounted for geographical niche truncations, few have accounted for truncations due to anthropogenic factors (Atwater et al., 2018;Guisan et al., 2014;Petitpierre et al., 2012;Zhu et al., 2017). In other words, previously observed deviations from niche conservatism may be an artefact of unresolved anthropogenic niche truncation instead.
These examples show that anthropogenic niche truncation can lead to artefacts in SDM predictions, even in applications not focused on anthropogenic impacts.

| A recommended strategy for handling data misrepresentations
Based on our findings, we propose a general strategy for modelling species affected by ARC. The three-step strategy of Maximize, Exclude and Test (MET) (Figure 7).
Maximize the temporal range of anthropogenic variables without over-sacrificing spatial resolution and relevance to ARCs, among others. Practitioners will have options for representing the anthropogenic pressures that drive range contractions. Although our results support selecting anthropogenic variables that can maximize the temporal range of temporally dynamic models, we cannot neglect other variable qualities. High spatial resolution may be prioritized over temporal range for regional/local SDMs, or species responses better explained by fine-scale variables (Mangiacotti et al., 2013;Marshall et al., 2021). Alternatively, for ARCs due to other factors such as poaching, human-density estimates-regardless of its temporal range-may be more appropriate as a predictor than land-use/land-cover (Gibson et al., 2019).
Therefore, it is important to select anthropogenic variables with large temporal ranges, while maintaining spatial resolutions aligned with the study objective(s) and representations of anthropogenic pressures relevant to the target species. Fortunately, the trade-off between temporal range and spatial resolution is being gradually reduced as the quality of anthropogenic variables like land-use/ land-cover continue to improve (e.g. see Cao et al., 2021;Chen et al., 2020;Winkler et al., 2021).
Exclude mismatched occurrences when incorporating the selected anthropogenic variable to account for niche truncation. We demonstrated a negative relationship between occurrence-habitat mismatching and niche truncation; meaning, although problematic occurrence data are often removed (Guélat & Kéry, 2018;Guillera-Arroita, 2017;Soley-Guardia et al., 2016), excluding mismatched occurrences will exacerbate niche truncation. However, we also found that when models incorporated an appropriate anthropogenic variable, niche truncations were rendered relatively inconsequential. Hence, the recommended step of reducing occurrence-habitat mismatching while relying on the selected anthropogenic variable to account for niche truncation. Studies can identify these mismatches via inferred/informed species-habitat relationships or their time of sampling (e.g. a forest-dependent bird occurrence matched against an urban land-use class or sampled before the earliest anthropogenic variable and/or any major ARC). However, in cases where mismatched occurrence cannot be removed (e.g. too few occurrences or difficult to verify), species responses to anthropogenic pressures can still be accounted for post-hoc (see Gomes et al., 2019;Manchego et al., 2017;Newbold, 2018).
Test for residual data misrepresentations by validating model hindcasts and measuring variability in anthropogenic variable importance. Because of real-world data limitations, studies must verify the anthropogenic variable selected in step 'Maximize', and the approach used to exclude mismatched occurrences in step 'Exclude'.
Model hindcasts-or projections onto a manually calibrated null/ zero anthropogenic pressure predictor-can be validated using those previously excluded mismatched occurrences (Dobrowski et al., 2011;Maiorano et al., 2013), where high omission rates (underprediction rates) would indicate residual niche truncation unresolved by the anthropogenic variable (Boyce's index as a continuous alternative; Boyce et al., 2002;Hirzel et al., 2006). For species without mismatched occurrences, spatial and temporal cross-validations are a viable alternative (Roberts et al., 2017). Next, although difficult to detect models affected by occurrence-habitat mismatching, it is possible to flag likely candidates using cross-validation or resampling techniques (Nisbet et al., 2018;Roberts et al., 2017;Ryo et al., 2019).
Model sensitivity to mismatching asserts that minor differences in

| Study limitations
Our study was limited to the presence-background machine learning algorithm MaxEnt (Phillips et al., 2006(Phillips et al., , 2017. Hence, data misrepresentations might affect other statistical models differently, particularly those using dissimilar input data, such as demographic or presence-absence distribution models (Holden et al., 2021; F I G U R E 7 Decision flowchart detailing the three-step strategy of Maximize, Exclude and Test (MET), with four possible modelling protocols (differentiated by their outlined colour). Blue has the highest reliability since the temporal range of the anthropogenic variable matches that of the occurrence data. Green has high reliability as a temporally dynamic SDM; although the loss of data may affect model estimates of environmental niches, the validation of model hindcasts tests for niche truncation. Orange depends on the process used to exclude potential mismatched occurrences, which may be more reliable for range contractions driven by spatially observable pressures (e.g. deforestation as opposed to poaching). Red has the lowest reliability, which essentially bets on the incorporated anthropogenic variable to account for anthropogenic niche truncation Norberg et al., 2019). However, statistical models using background or pseudo-absence points like MaxEnt can be expected to behave somewhat similarly but be more, or less, robust against data misrepresentations. Nevertheless, our limited testing of statistical models does not diminish the importance of recognizing occurrence-habitat mismatching and niche truncation and their implications for model performance.
Our findings may not apply to anthropogenic variables that integrate natural factors. For example, in representing deforestation, variables like canopy structure or vegetation indices. Since such variables may contain information relating to environmental suitability, mismatching with those variables may result in outcomes that differ from those observed in our study (Burns et al., 2020;Gibson et al., 2019;Girma et al., 2016). Moreover, while the anthropogenic land-use variable in this study perfectly represented our simulated ARCs, real-world anthropogenic variables will almost certainly have lower accuracy and precision (Prestele et al., 2016;Verburg et al., 2011), or exist at spatial scales too coarse to properly reflect species responses (Mangiacotti et al., 2013;Mertes & Jetz, 2018). Because of these real-world data constraints, incorporating a contemporary or temporally dynamic anthropogenic variable, in practice, may not result in model improvements comparable to those observed in our study.

| FUTURE PER S PEC TIVE S AND CON CLUS ION
Our study reveals substantial challenges in modelling species distributions shaped by anthropogenic pressures. We showed how ARCs and temporally biased data can result in occurrence-habitat mismatching that impedes the predictions of anthropogenic-related absences, and niche truncations that lead to underestimations of environmentally suitable habitats. The effects of both data misrepresentations have important implications for SDM applications in various fields McShea, 2014;Peterson et al., 2018;Yates et al., 2018). These data misrepresentations may be underappreciated in part because of the relative insensitivity of the popular metric, AUC, which could have obscured the impacts of these data misrepresentations in past studies. Our findings also reveal that variables can be important on different fronts, which current methods seldom consider when evaluating variable importance (Elith et al., 2011;Harisena et al., 2021;Smith & Santos, 2020).
Besides cross-validations and resampling techniques to examine deviations in variable importance, studies should consider measuring changes in predictive outcomes outside the training data, as well as within.
More research is equally required for developing and testing modelling protocols to overcome the challenges associated with modelling ARC-affected species. Our study highlighted the limitations of conventional modelling protocols; simply incorporating a contemporary anthropogenic variable or relying only on environmental variables. Although our study demonstrated the potential of temporally dynamic SDMs (Milanesi et al., 2020), selecting appropriate variables is but one facet of developing good modelling practices (Araújo et al., 2019). Methods regarding the handling of presences and absences will be a logical next step, such as restricting background points to enable niche extrapolations without the anthropogenic variable (as hinted by in geographical niche truncation studies; see Barve et al., 2011;Owens et al., 2013;Saupe et al., 2012). For that matter, our recommended strategy aims to guide rather than dictate exact modelling processes and could serve as a precedent for future innovations. Crucially, with how ubiquitous ARCs are (Maxwell et al., 2016;Newbold et al., 2015), these data misrepresentations are likely the rule rather than the exception.
Therefore, future studies should recognize occurrence-habitat mismatching and niche truncation borne from ARCs as highly pervasive and relevant and consider our recommended strategy for resolving them as an integral process of modelling species distributions.

ACK N OWLED G EM ENTS
We thank Ryan A. Chisholm for his comments, which greatly improved the final version of this manuscript. Funding for part of this research was from a Ministry of Education-Singapore AcRF Tier 1 Grant to ELW.

CO N FLI C T O F I NTE R E S T
The authors of this study declare that they have no conflict of interest.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/ddi.13544.

DATA AVA I L A B I L I T Y S TAT E M E N T
For reproducibility, all supporting data are archived in Dryad <https://doi.org/10.5061/dryad.ttdz0 8m0g > and the R scripts