Predicting range expansion of invasive species: Pitfalls and best practices for obtaining biologically realistic projections

Species distribution models (SDMs) are widely used to forecast potential range expansion of invasive species. However, invasive species occurrence datasets often have spatial biases that may violate key SDM assumptions. In this study, we examined alternative methods of spatial bias correction and multiple methods for model evaluation for seven invasive plant species.


| INTRODUC TI ON
The geographic ranges of invasive species are shifting, both because they are often in disequilibrium with the environment (i.e. limited range filling) and because climate change has altered the distribution of suitable habitat. Early detection and prevention increasingly rely on species distribution models (SDMs) that forecast suitable habitats where surveillance and eradication efforts can be focused (Lodge et al., 2016;Marbuah, Gren, & McKie, 2014). However, invasive species pose unique challenges that stymie the model building, evaluation and selection process, as invasive species commonly violate underlying model assumptions and contain multiple sources of spatial bias (Elith, Kearney, & Phillips, 2010). Although a variety of approaches have been used to correct these biases, it is poorly understood how alternative methods perform on a common set of invasive species.
Building SDMs of invasive species have been challenging for two main reasons. First, invasive species inherently violate the model assumption that species are in equilibrium with the environment and occupy all regions suitable for their persistence. Second, modelled relationships are correlative and only hold over the range of climate variables included in the model-building process; therefore, projecting into uninvaded or incipient regions outside of the current invasive range may be problematic (Elith et al., 2010;Hastie & Fithian, 2013;Lodge et al., 2016).
Additionally, niche shifts are common between native and invaded ranges, which may lead to differences in the underlying environmental relationships in different regions of the global range (Le Maitre, Thuiller, & Schonegevel, 2008). Many researchers have attempted to address model violations (Broennimann et al., 2007;Fourcade, Besnard, & Secondi, 2018;Mainali et al., 2015;Petitpierre et al., 2012). Nevertheless, these methods have not completely addressed spatial biases in invasive species SDMs.
Uneven species sampling is a common source of bias that can impede accurate modelling (Bradley, Blumenthal, Wilcove, & Ziska, 2010;Fourcade, Engler, Rödder, & Secondi, 2014;Merckx, Steyaert, Vanreusel, Vincx, & Vanaverbeke, 2011). Sampling bias can arise if species are challenging to detect or not a priority for detection, which results in occupied regions having no documented occurrence records. In addition, spatial biases unique to invasive species may further complicate the modelling approach, such as recurrent eradication efforts, highly uneven sampling programs across political units, introduction into unsuitable habitats (Elith et al., 2010;Lodge et al., 2016) and introduction via stochastic or anthropogenic long-distance dispersal events (e.g. ship ballast) (von der Lippe & Kowarik, 2007). These biases manifest in oversampling in the established portion of the invaded range and undersampling in regions where the invasive species is incipient, leading to uneven prevalence across the invaded range. Identifying the sources of spatial bias in invasive species, and understanding how they interact with different modelling methods, may help to build more robust and reliable SDMs.
Among the many tools available for building SDMs, Maxent is the most widely used, as it is found to perform well for most species (Hao, Elith, Guillera-Arroita, & Lahoz-Monfort, 2019). Maxent contrasts environments of occurrence points with those sampled from background locations to determine which combinations of variables best predict occurrences (Elith et al., 2010(Elith et al., , 2011Merow, Smith, & Silander, 2013). One key assumption in SDMs is that sample prevalence (the frequency of sampled sites in the total study area) accurately represents species prevalence (the frequency of species over the total study area). Maxent uses a constant species prevalence value across the species' range, despite the fact that many species vary in prevalence geographically or have been documented with differing levels of sampling effort in different regions. While a number of methods have been developed to mitigate the potential effects of spatially biased occurrence records, two methods have shown promise: downsampling occurrence points and modifying the sampling distribution of background points (Kramer-Schadt et al., 2013;Phillips & Dudík, 2008). Downsampling addresses spatial bias by thinning dense clusters of occurrences where oversampling may have occurred (Fourcade et al., 2014). In model building, this effectively reduces the environmental importance placed on oversampled regions (Fourcade et al., 2014).
Alternatively, one can modify the distribution of background point sampling to match the spatial bias that exists in the occurrence points (Barbet-Massin, Jiguet, Hélène Albert, & Thuiller, 2012;Kramer-Schadt et al., 2013;Phillips & Dudík, 2008). Background sampling with a Gaussian Kernel Density (GKD) distribution can be used to sample a greater proportion of background points in areas of highly sampled occurrence points (Elith et al., 2010). GKD sampling draws more background points in areas with more occurrences, which can enable greater discriminating power in areas that are heavily sampled.
Model discrimination metrics, such as those relying on a confusion matrix, are commonly used to evaluate model performance.
The area under the curve (AUC) is most frequently used due to its threshold-independence, but it has been criticized due to its equal weighting of commission and omission errors and dependence on prevalence (Leroy et al., 2018). Alternative metrics such as the partial AUC and sensitivity (TP/TP + FN) have been proposed in order to mitigate problems caused by assuming background points in presence-only modelling are errors of commission (Allouche, Tsoar, spatially explicit metric to account for correlation between a models' prediction and occurrences (Boyce et al., 2002;Hirzel, Le Lay, Helfer, Randin, & Guisan, 2006). Last, model complexity should be considered in the model evaluation process; complex models tend to be overfit (i.e. high performance on training datasets, low performance on testing datasets) and often lack generalization (Yates et al., 2018;Brun, Thuiller, Chauvier, Pellissier & Wüest et al., 2019). Model complexity may be assessed by recording the number of model parameters or using information criteria, such as AICc, which balances model gains with penalties for added parameters (Akaike, 1974).
For invasive species, model selection needs to explicitly consider transferability because one goal of invasive species SDMs is to transfer predictions into incipient or unoccupied habitats (Morey & Venette, 2020;Wenger & Olden, 2012;Yates et al., 2018). One approach partitions cross-validation datasets using a spatially explicit block method, where the modelled area is divided into geographic regions that contain roughly the same number of occurrences (Wenger & Olden, 2012;Yates et al., 2018). Modellers can then estimate performance based on spatially explicit cross-validation and the overall model performance is a useful indicator of how well SDMs transfer between geographic regions (Finnoff, Shogren, Leung, & Lodge, 2007;Lodge et al., 2016;Yates et al., 2018). Additionally, the coefficient of variation (CoV) among cross-validated data partitions may provide insight into how well SDMs generalize among geographic regions, where greater CoV indicates that model performance varies depending on the geographic region.
For this study, we developed SDMs for seven invasive species with different invasion histories. We refer to these models as SDMs throughout the text, as we use climate variables to determine potential areas of occupancy, but do not attempt to model the niche glob-  Table S1. All species presented in this paper are native to Europe and portions of western Asia, are considered invasive across the continental United States and are of special concern in the Upper Midwest (Minnesota, Wisconsin, Illinois, Iowa, North Dakota and South Dakota).

| Spatial data
We downloaded occurrence records from two publicly available  (Table S2). We processed occurrence records to remove duplicate, erroneous and/or imprecise (e.g.

| Spatial data processing
Datasets were trimmed to include only the invaded range of North America. We used occurrence records from only the invaded range to capture unique niche shifts that may have occurred during invasion (Broennimann et al., 2007). Additionally, we produced a set of models containing the occurrences from both the native and invaded range, the results of which did not appreciably change modelling outcomes (See Supplementary Materials). To reduce pseudoreplication among the invaded-only occurrence records, we resampled occurrences to a baseline resolution of 1 km, the resolution of the climate data (hereafter: filtered datasets).

| Occurrence record sampling and background point selection
For each species, we developed two occurrence datasets and three background datasets as inputs for model construction in a 2 × 3 factorial design. For the occurrence datasets, our filtered datasets were downsampled to one occurrence per 1 km raster grid cell to match the resolution of the environmental data. We generated an additional dataset downsampled to one occurrence per 20 km raster grid cell, which effectively reduced prevalence in very densely sampled clusters without removing overall patterns of landscape occurrence. We then generated three different forms of background point selection: randomly selected background points and two different Gaussian kernel density distributions (GKD). For the random background points, we used a polygon object with 200 km buffers centred on occurrence points and dissolving overlaps. We then selected the default 10,000 background points from the polygon. GKD background point selection aligns the background point density more closely to the density of occurrence points, so we generated two GKD background point datasets for both occurrence datasets. The probability of choosing a background point was based on a surface generated from summing Gaussian kernels (with dispersion parameter λ) at each occurrence point. For the GKD background points, we varied the dispersion parameter λ to create both highly clustered background (λ = 1) and moderately clustered background (λ = 3) points.

| Model construction
We constructed SDMs using Maxent v. 3.4.1 (Phillips, Anderson, Dudík, Schapire, & Blair, 2017) implemented through the ENMEval package (Muscarella et al., 2014). We developed six Maxent models per species using each combination of downsampled occurrence data (1 and 20 km) and GKD background point selection (random; λ = 1: highly clustered; λ = 3: moderately clustered). For all models, we set the regularization beta multiplier = 2 and allowed for all feature classes. In preliminary model building, we tested four (n = 4) beta multipliers from 1.0 to 4.0 with 1.0 increments. Differences among beta multipliers were minimal.

| Model evaluation
We evaluated models by partitioning occurrences into four geographically structured blocks containing an approximately equal number of occurrence records. We assessed performance with fourfold cross-validation using three blocks as training data and retaining one block for testing. For each model, we calculated AUC (Area under the curve), the partial AUC (pAUC) ratio, the Continuous Boyce Index (CBI) and sensitivity (True Positives/ (True Positives + False Negatives)). AUC characterizes model prediction and discrimination ability relative to random chance (Phillips & Dudík, 2008), with values ranging from 0 to 1 where higher values indicating greater model performance and 0.5 is indicative of model predictive discrimination that is approximately random. The use of AUC has received criticism due to the equal weighting of omission and commission error and dependence on prevalence (Lobo et al., 2008). As commission error and true negatives are difficult to estimate in presence-only SDMs and may be of particular issue for invasive species, we explored model evaluation metrics that less reliant on commission error. We calculated the pAUC as the area under the portion of the curve (Peterson, Papeş, & Soberón, 2008) where omission errors are less than or equal to 10%. To assess whether pAUC was larger and significantly different from random expectation, we calculated the pAUC ratio, defined as the pAUC divided by the area under the 1:1 line that indicates random expectation for the same portion of the AUC curve. The pAUC ratio ranges from 0 to 2 and has a random expectation of 1 with values above 1 indicating better than random performance (Escobar, Qiao, Cabello, & Townsend Peterson, 2018). pAUC ratios were calculated and their significance assessed using the 'pROC' function in the package "NicheToolbox" (Osorio-Olivera, 2016) with 50% random withholding and 500 bootstrap iterations. Additionally, we calculated CBI, which measures the correlation between occurrence probability in the data and predicted probability (Boyce et al., 2002;Hirzel et al., 2006). CBI ranges from −1 to 1, where positive values indicate an accurate model that correlates with the data, zero indicates no difference from random, and negative values indicate a negative correlation. Sensitivity quantifies the correctly predicted positive fraction of occurrences and was evaluated at the threshold of 0.5 for all models.

| Model predicted area
To quantify how occurrence point downsampling and GKD background points affected the quantity of predicted suitable habitat, we computed the area of suitable habitat across North America using thresholds varying from 0.01 to 0.99 at 0.01 intervals. We then plotted the relationship between proportion of suitable habitat against threshold value for each projection (hereafter: "threshold-area curves"). We tested for differences in the effects of model input modifications on projected suitable area at a 50% threshold (i.e. total area of predicted habitat suitability at or above 0.5; N = 36 models) with ANOVA where species grouping (e.g. common and historically established, incipient and diffuse, and incipient and clustered), occurrence dataset (filtered or downsampled), and background dataset (random, moderately clustered or highly clustered) and all second-order interaction terms were included as fixed, independent variables and area predicted to be suitable habitat at a 50% probability of occurrence threshold as the dependent variable.  (Elith et al., 2010;Merow et al., 2013). More complex models had a greater number of parameters and generally higher AICc scores. We tested for significant differences in model performance (mean cross-validated AUC, mean cross-validated pAUC ratio, mean cross-validated sensitivity and mean cross-validated CBI), variability (AUC coefficient of variation and sensitivity coefficient of variation) and complexity (number of parameters) using ANOVA by including species, occurrence dataset (1 or 20 km) and background sampling (λ = 0: random; λ = 1: highly clustered; λ = 3: moderately clustered) and all second-order interactions as independent variables. We corrected for multiple tests using a sequential Bonferroni (Holm) correction.

| Structured decision process for model selection
We designed a model selection process based on measures of model performance, variability and complexity (Figure 1). We used this process to assess each of the six models per species and choose the model that optimally balances performance, variability and complexity. For each performance, variability and complexity score, we ranked each model and assigned it a scalar value from one to six, where higher values indicate better evaluation metrics. For example, one model per species with the highest mean cross-validated AUC received the highest score of six, and models with lower AUC performance received descending scores. Similarly, models with the lowest coefficient of variation received the highest score of six. For model complexity, we assigned two scores, where models with the fewest parameters received a score of six, and the lowest AICc score also received the highest score of six. Once all model metrics were scored, the model with the highest overall summed score was selected as the optimum model among the six for each species.

| RE SULTS
3.1 | Theme 1: How did methods that reduce the effects of spatial bias affect measures of model complexity and amount of predicted suitable habitat?

| Model complexity
Models constructed with modified background sampling and/or occurrence data downsampling were less complex. This was most apparent for number of model features; across all species, models with GKD (clustered) background sampling had fewer features relative to random background sampling (F df=2,25 = 16.81, p < .001; Figure 2; Table S9), suggesting the latter are highly parameterized and more complex. Similarly, models with 20 km occurrence downsampling had fewer model features (F df=2,12 = 71.07, p < .001; Figure 2; Table S9).
Additionally, these models had lower AICc scores (Table S15), suggesting models with 20 km occurrence downsampling are more likely to balance model fit with model complexity.

| Response curves
Upon visual inspection, model response curves for all species were smoother in models constructed with occurrence downsampling and GKD background sampling ( Figures S8-S14).

| Threshold-area curves
Regardless of the occurrence dataset used (1 km-filtered; 20 kmdownsampled), models using random background sampling predicted substantially less suitable habitat than models that used background points sampled using a GKD of 1 or 3 (highly or moderately clustered sampling) at nearly every possible threshold value (Figure 3 When suitable habitat was assessed using a threshold probability of occurrence of 50% (i.e. all area with predicted probability of occurrence greater than or equal to 50%), models using GKD background sampling (highly or moderately clustered) had greater predicted suitable habitat area relative to models built using random background sampling (F df=2,28 = 6.58, p = .005; Table S10). Further, using that same threshold, models using occurrence downsampling (to 20 km) had significantly greater predicted suitable habitat area relative to models using the filtered (downsampling only to 1 km) occurrence datasets (F df=2,28 = 10.49, p = .003; Table S10).
The effects of background sampling design were more consistent F I G U R E 1 Model construction, evaluation and structured decision-making process flowchart. A total of 36 models for six species were constructed and evaluated, resulting in one selected model per species

Model EvaluaƟon
Uniform Random

| Variability
The coefficients of variation (among the four cross-validation datasets) for both AUC and sensitivity were significantly greater when models were built using either GKD background sampling relative to random sampling (AUC: F 2,12 = 9.49, p = .003; sensitivity: F 2,12 = 20.07, p < .001; Table S14).

| Model projections
The results for theme 1 and 2 were agnostic to model selection and were meant to explore common patterns in model outputs generated by changes in data modifications. Here, we focus on the models that were retained following our structured decision process (Figures 1 and 4). Six of seven selected models used the 20 km occurrence downsampling and three of seven models used GKD background sampling (Figure 4; Tables S15 and S16). Overall, projections of selected models using current climate aligned well with current species distributions both across the continent and in the region of interest ( Figures S15-S21).
For the common and historically established species (Common Tansy, Wild Parsnip, and Leafy Spurge), selected models used 20 km occurrence downsampling and either highly clustered (GKD λ = 1) or random background sampling (Table S16). Occurrence downsampling reduced bias associated with uneven sampling effort at local scales. The projected suitable but unoccupied area primarily represents range filling, rather than range expansion.
For incipient but diffusely distributed species (Brown Knapweed and Common Teasel), the selected models used 20 km occurrence downsampling and highly clustered (GKD λ = 1) background sampling (Table S16). These occurrence and background point modifications reduced bias associated with uneven sampling effort at both local and regional levels. The projected suitable but unoccupied area primarily represented range expansion, not range filling.
For incipient but highly clustered species (Black Swallowwort and Dalmatian Toadflax), there were no clear patterns (Table S16).
For Dalmatian Toadflax, the selected model was built using the 1 km filtered dataset and random background sampling, suggesting this dataset may not have readily identifiable spatial bias patterns in the occurrence records. Conversely, for Black Swallowwort, the F I G U R E 2 Interaction plot of occurrence downsampling by background point clustering effects for number of model parameters. Models were constructed with either 1 km filtered datasets (circles) or 20 km downsampled datasets (triangles) and either random (GKD 0, blue), highly clustered (GKD 1, grey) or moderately clustered (GKD 3, green) background sampling a) Model Parameters by Occurrence and Background Sampling Types

| D ISCUSS I ON
Building climate-based SDMs that can successfully predict suitable habitat for invasive species remains a fundamental challenge despite its widespread adoption for management purposes. We implemented a factorial design of data modifications to comprehensively evaluate and reduce spatial bias for multiple invasive species. Our results suggest that models constructed with downsampled occurrence data and GKD background sampling had fewer parameters (i.e. were less complex) and were more regularized. We found that model performance and model complexity metrics were often not in agreement and that models that were overly complex often failed to predict occurrences in a region at the advancing front of geographic ranges (low transferability). Using our full matrix of models, we designed a decision-making process that considered model complexity, performance and geographic variability for multiple model evaluation metrics. Overall, our results suggest that careful consideration of spatial bias in model inputs and evaluation criteria are essential for identifying SDMs that have the capacity to predict suitable habitat at expanding range fronts of invasive species.
Our approach is uncommon in that we used a factorial design to independently assess the effects of different data modifications on F I G U R E 3 Species threshold-area curves and proportion of predicted suitable habitat at 50% threshold. For each species and species grouping: (a) common and historically established species including (b) Common Tansy, (c) Wild Parsnip, (d) Leafy Spurge, (e) incipient and diffuse species including (f) Common Teasel, (g) Brown Knapweed, and incipient and clustered species including (i) Black Swallowwort, (j) Dalmatian Toadflax. The curves depict the relationships between proportion of predicted suitable habitat (y-axis) and the threshold used to determine suitability (x-axis). Each threshold-area curve panel has six curves that depend on the combination of occurrence dataset used (1 km filtered dataset = solid lines; 20 km downsampled dataset = dashed lines) and background sampling (random = blue; highly clustered/ GKD 1 = grey; moderately clustered/GKD 3 = green). In the left column, the interaction between GKD and downsampling is shown at the 50% threshold; error bars represent ± 1 SE There has been growing recognition that model input for SDMs may require modification to mitigate the effects of spatial bias. For both native and invasive species, downsampling of occurrence data (e.g. Briscoe Runquist, Lake, Tiffin, & Moeller, 2019;Kramer-Schadt et al., 2013) and background bias correction (Clements et al., 2012;Kramer-Schadt et al., 2013;Phillips et al., 2009;Syfert, Smith, & Coomes, 2013)    While occurrence and background data manipulation may improve the quality of models and their projections of habitat suitability, reducing the number of input records or manipulating background sampling to achieve a balanced design may reduce model performance measures (Hernandez, Graham, Master, & Albert, 2006). Due to this known tradeoff in improving model quality and decreasing model performance, the most commonly used metric to evaluate model performance (test AUC) may be misleading (Lobo et al., 2008;Warren & Seifert, 2011). Overall, maximizing AUC will often not result in SDMs that are successful at predicting suitable but unoccupied territory. Rather, they reflect success in predicting occurrences in the oldest regions of the invaded range where records are numerous and dense across the landscape. These SDMs also often have sharp transitions between habitat predicted to be highly suitable and highly unsuitable, which indicates that the model is likely overly complex and not appropriately regularized (Figures S15-S21).
One of the overarching patterns for our set of selected models was that they consistently had low AUC scores despite high pAUC ratio, CBI and sensitivity scores. Additionally, AUC did not agree well with model fit statistics, such as AICc (which balanced model complexity with model information); this result is common across model-building efforts (Lobo et al., 2008). AUC favoured models built using standard datasets (i.e. filtered to 1 km with random background points); however, these models generally performed poorly when evaluated using alternative methods. Perhaps because of this, we found no obvious relationship between maximizing AUC and model selection. AUC equally weights errors of omission and commission (Lobo et al., 2008). When AUC fails, it often does so because validation statistics assume errors of omission and commission are well characterized (Oreskes, Shrader-Frechette, & Belitz, 1994).
Because background points are not true absences, true species prevalence and true omission and commission errors are not well captured, and AUC may be misleading (Leroy et al., 2018). Indeed, a recent simulation study of AUC found that discrimination accuracy metrics were only informative about model quality when the models were simple, and the input data closely mimicked the true ecological niche (e.g. nearly complete range filling where background points were more similar to absences; Warren, Matzke, & Iglesias, 2019).
In our models, we did see a pattern that AUC more likely to partially agree with other metrics for our common, historically established species, which are the most likely to have filled much of their available ecological niche in their invaded range.
Due to the known challenges of AUC, we advocate that modellers use multiple evaluation metrics and integrate that information during model selection. In particular, modellers should incorporate metrics that do not equally weight commission and omission errors such as pAUC ratio and sensitivity (Anderson, Lew, & Peterson, 2003;Lobo et al., 2008;Peterson, 2006). The pAUC ratio limits evaluation of models to regions of the AUC curve with low omission (Peterson, 2006). Likewise, sensitivity evaluates only the proportion of true positives (the omission rate) and does not rely on commission errors. CBI is valuable because it is spatially explicit and useful for incorporating knowledge about variation in prevalence across a landscape, but still may be prone to errors of commission. In the end, because all metrics have potentially complementary pitfalls, we used a structured process to incorporate information from them all during model selection.

| Invasive species distributions
Our selected models predicted variable but increased habitat suitability at and beyond species' range margins (in the Upper Midwest) under current climate. For common and historically established species (Common Tansy, Wild Parsnip and Leafy Spurge), predicted suitable habitat mostly reflected range infilling, with some geographic expansion to the north and west. For the other four species, all showed potential for range expansion to the north and west. Similarly, other invasive species have also been shown to have greater potential for range infilling and northward expansion (e.g. Jarnevich & Stohlgren, 2009;Mainali et al., 2015). Similarly, non-native plant species of North America occupied a smaller fraction of their potential range than native species, suggesting greater potential for range infilling (Bradley, Early, & Sorte, 2014). Last, non-native species have also been found to have potential for eastwest range expansion, which was also common among our species (Bradley et al., 2014).

| CON CLUS IONS
Invasive species test the limits of SDM capabilities. Navigating these challenges requires developing and selecting models using data modifications to understand model behaviour and guard against common model pitfalls. It is important that SDMs are evaluated holistically with a diverse set of model diagnostics to avoid selecting and relying on models that do not generalize beyond the oldest regions of invaded ranges and cannot predict suitable habitat in incipient areas of invasion. Our study suggests that modifications of occurrence and background data should be implemented when modelling invasive species to minimize model overfit and spatial biases. Finally, interpretation of these models as the description of a species' ecological niche should be done with caution and include additional sources of evidence beyond SDMs.

ACK N OWLED G EM ENTS
We thank R. Venette for thoughtful discussion and comments and Z. Radford and K. Wilson for help gathering occurrence record data.