Predictor complexity and feature selection affect Maxent model transferability: Evidence from global freshwater invasive species

Ecological niche models (ENMs) are widely used to address urgent real‐world problems such as climate change effects or invasive species; however, the generality of models when projected through space and/or time, that is transferability, remains a key challenge. Here, we explored the effects of complex predictors and feature selection on ENM transferability in a widely employed algorithm, Maxent, using five globally invasive freshwater species as case studies.


| INTRODUC TI ON
Ecological niche models (ENMs; also known as environmental niche models or species distribution models) are tools that use species occurrence records in conjunction with environmental data to infer species' niches and predict the relative suitability of habitats across geographic landscapes (Elith et al., 2011;Radosavljevic & Anderson, 2014;Warren & Seifert, 2011). They have been widely used to answer increasingly complex questions in various fields such as ecology and biogeography (e.g., , epidemiology (e.g., Cardoso-Leite et al., 2014), conservation biology (e.g., Warren et al., 2014) and invasion biology (e.g., Rodda et al., 2011). Of the numerous ENM techniques available, the most extensively used algorithm is Maxent (Phillips et al., 2006), a presence-only method based on the principle of maximum entropy. Its popularity has been attributed to its high predictive accuracy compared to other modelling methods , robustness to small sample sizes (Pearson et al., 2007), wide range of customizable settings, and ease of use Merow et al., 2013). Despite its flexibility in model construction, and perhaps in part owing to its ease of use, the algorithm is often not applied optimally by users relying on default settings without consideration for data treatment and model parameterization-despite such practices being deemed inappropriate (Merow et al., 2013).
A key challenge in ecological niche modelling is one of transferability, that is the generality of models when transferred and projected into broad unsampled environments across time and space . Transferability typically involves two components-interpolation (model predictions within the environmental constraints encountered during model calibration) and extrapolation (predictions outside of environmental ranges encountered during model calibration) (Elith & Leathwick, 2009). These have far-reaching implications for many conservation applications such as forecasting biological invasions (e.g., Rodda et al., 2011;Rödder et al., 2009), the identification of priority conservation areas for rare or threatened species (e.g., Breiner et al., 2015), quantification of climate change risks (e.g., Warren et al., 2014), and elucidating ecological processes driving species diversification (e.g., Acevedo et al., 2014). Consequently, uncertain model transferability can hinder efforts to address urgent real-world problems or lead to misleading conclusions (Merow et al., 2013). This issue arises because presence-only or presence-background ENMs (such as Maxent) typically quantify realized species-environment niche relationships using species occurrence data that may be in part influenced by dispersal limitations and biotic interactions, whereas accurate model projections into unsampled regions require reasonable estimations of fundamental-niche relationships based on realized niche information (Soberón & Peterson, 2005;Peterson et al., 2011). Projections are therefore only robust if modelled species-environment relationships are aligned with biological expectations, and are simple enough to allow for accurate extrapolations (Rodda et al., 2011).
One aspect that has received less attention, however, is the selection of environmental predictors, despite predictor selection being fundamental to any modelling effort (Araújo & Guisan, 2006).
A common practice in many ENM studies is to use "standard" sets of environmental predictors (Rödder et al., 2009), such as bioclimatic variables secondarily derived from annual or monthly means in temperature or precipitation to reflect biologically relevant energy and water balances (Xu & Hutchinson, 2013), although this may lead to overfitted models if not tuned for appropriate levels of model complexity (Moreno-Amat et al., 2015;Warren & Seifert, 2011;Warren et al., 2014). In contrast, using a subset of predictors based on the known physiology or ecology of species can produce ENMs that project more accurately into novel environments (Rödder et al., 2009;Zeng et al., 2016;Petitpierre, Broennimann, Kueffer, Daehler, & Guisan, 2017), although such biological information is often unavailable. Alternatively, heuristic approaches in predictor selection and reduction (e.g., stepwise removal of variables using variable contribution scores; Zeng et al., 2016) may also improve model transferability, though the efficacy of such procedures remains highly dependent on the initial set of starting variables.
It has been suggested that the indiscriminate use of complex composite bioclimatic variables (secondarily calculated from annual or monthly means in temperature or rainfall) without a priori ecological hypotheses may explain poor predictivity in certain cases where models are transferred across geographic regions (Peterson & Nakazawa, 2008), although this has sometimes also been attributed has severe implications when ENMs are used to infer species' niches, and quantify ecological or evolutionary change across impacted landscapes.

K E Y W O R D S
aquatic invasions, bioclimatic variables, ecological niche model, feature selection, Maxent, model tuning, niche conservatism, predictor selection, species distribution model, transferability to shifts in species' niches (Broennimann et al., 2007). Composite variables are widely utilized in many ENM studies; in particular, they comprise the majority (13 out of 19 variables) of the WorldClim bioclimatic dataset (Hijmans et al., 2005). Nonetheless, their overall impact on model transferability is not well understood.
Maxent can construct simple to highly complex, nonlinear species-environment relationships using various mathematical transformations of variables (termed features). Six feature classes are available-linear, quadratic, product, hinge, threshold and categorical. To reduce overfitting, Maxent uses a regularization procedure (L1-regularization) to balance model fit with complexity, by penalizing models based on the magnitude of their coefficients (Phillips et al., 2006). By default, the program uses all feature classes in model fitting (contingent on a sufficient number of species occurrences) and a relatively low regularization multiplier of 1, although this can potentially lead to models that are overparameterized (Elith et al., 2011;Merow et al., 2013). Many recent studies (e.g., Moreno-Amat et al., 2015;Radosavljevic & Anderson, 2014;Warren & Seifert, 2011) have demonstrated the importance of species-specific tuning of regularization parameters for optimal model complexity, though few have examined the effects of fitted feature forms on model transferability. While the use of appropriate regularization parameters can reduce the need for feature selection, it may be advisable to constrain the feature classes used in model fitting to produce models that are more interpretable and transferable (Merow et al., 2013).
In recent years, the Akaike's information criterion correction for small sample sizes (AICc) has been increasingly used as an evaluation tool to tune Maxent parameters. Models tuned using AICc have been shown to generally be more parsimonious and less overfitted, while better estimating habitat suitability and relative importance of predictor variables (Warren & Seifert, 2011). While AICc may serve as a useful alternative to traditional cross-validation model evaluation, it is calculated on unpartitioned species occurrence datasets and therefore does not truly infer model transferability (Warren & Seifert, 2011;Warren et al., 2014). There has been evidence suggesting that AICc-tuned models may occasionally exhibit poorer performance in some cross-validation transferability assessments (e.g., slightly lower discriminatory ability on withheld testing data; Muscarella et al., 2014), although this may be confounded by geographic sampling bias present within training datasets (Galante et al., 2018). Additionally, a recent study utilizing simulated species (Velasco & González-Salazar, 2019) showed weak correspondence between AICc and geographical predictive accuracy. Nonetheless, the overall trend between model parsimony and transferability has yet to be clarified using empirical species datasets.
Here, we modelled the global distributions of five notorious freshwater invasive species (African sharptooth catfish Clarias gariepinus, Mozambique tilapia Oreochromis mossambicus, American bullfrog Lithobates catesbeianus, red swamp crayfish Procambarus clarkii, and Australian redclaw crayfish Cherax quadricarinatus), using three predictor datasets of varying complexities derived from two commonly used climatic data sources (WorldClim and IPCC) and three methods of model tuning that differentially incorporated feature selection. We then conducted spatially explicit transferability validations using a suite of evaluation metrics previously used to quantify Maxent model performance (Hirzel et al., 2006;Warren et al., 2010;Peterson et al., 2011;Warren & Seifert, 2011;Radosavljevic & Anderson, 2014). Our five focal species were chosen as they are (a) invaders with long histories of introductions and establishments (Ficetola et al., 2007;Lodge et al., 2012;Lowe et al., 2000;Palaoro et al., 2013;Weyl et al., 2016), and thus more likely to be at equilibrium with the environment in non-native regions, and (b) possibly already occupying a large proportion of their fundamental niches across native and invasive distributions (Araújo & Pearson, 2005;Václavík & Meentemeyer, 2009), thereby making them prime candidates for transferability assessments.
Our main objective is to understand how methodological considerations can impact model transferability, and our results serve to guide future modelling efforts and improve ENM specification techniques. To this end, we specifically addressed the following

| Species occurrences and accessible areas
We collated occurrence data for the five focal species from published literature (e.g., Both et al., 2011;Russell et al., 2012;Snovsky & Galil, 2011;Teugels, 1986;Torres & Álvarez, 2012;Zeng et al., 2016;Zeng & Yeo, 2018), online databases (e.g., Global Biodiversity Information Facility (GBIF) (http://www.gbif.org), Global Invasive Species Database (GISD) (http://www.iucng isd.org/gisd)), and natural history museum collections (e.g., Smithsonian National Museum of Natural History, Lee Kong Chian Natural History Museum; Appendix S1). These comprised both native occurrences as well as records of non-native establishments. A species is considered established at a locality if there is evidence of a self-sustaining breeding population, or if the reported locality is situated within a region where the species is already known to be well-established. In total, we collated 545 occurrence records ( Datasets of species occurrences often exhibit strong geographic bias, whereby certain areas (e.g., near busy roads or towns) tend to be more heavily sampled than others. This bias can lead to the pseudo-replication of points clustered around accessible areas and the over-representation of certain environmental conditions during model fitting, therefore negatively impacting ENM transferability Kramer-Schadt et al., 2013). To reduce the effect of sampling bias, spatial filtering Fourcade et al., 2014) was carried out on the occurrences datasets.
While Maxent, by default, removes duplicate occurrences within each predictor raster grid cell, occurrences were further thinned using the thin function in the R package "spThin" (Aiello-Lammens et al., 2015), which utilizes a randomization algorithm to return the highest number of occurrences for a specified thinning distance. We chose a thinning distance of 50 km, as visual examinations of model outputs during preliminary runs showed that predictions were generally overfitted on training datasets at lower thinning distances (e.g., 20 km), whereas higher thinning distances (e.g., 100 km) often produced underfitted models lacking in spatial resolution. Following Maxent fits models by contrasting predictor variables at known occurrence points against random "background" localities in the study region (Phillips & Dudík, 2008). Conceptually, the background should comprise areas accessible to a species over relevant time periods (Barve et al., 2011), though delimiting its extent can be challenging for invasive species outside their native distributions as population boundaries might still be expanding (Elith et al., 2010).
Native-range background extents were defined using a conventional minimum convex polygon around native occurrences with an additional 4° buffer (Rodda et al., 2011). On the other hand, the areas accessible to non-native populations were defined using distance buffers estimating maximum dispersal distance around known invasive occurrences (Barve et al., 2011;Figure S2.1). Maximum dispersal distances were calculated as the average known natural and/ or human-mediated dispersal rates for a species (Appendix S3 for details) multiplied by the number of years since first introduction/ record at a locality. Native-and invasive-range background extents were concatenated and 10,000 background points randomly drawn for model calibration.

| Environmental datasets
ENMs built for terrestrial organisms often utilize climatic data to directly infer species-environment relationships, as many key physiological processes (e.g., photosynthesis, evapotranspiration) are directly affected by air temperature and rainfall (Austin, 2002). The use of such variables for modelling freshwater species may, however, be counterintuitive, given that their distributions are typically limited by instream conditions (e.g., hydrology, water temperature).
While high-resolution instream data are currently unavailable on a global scale, there has been evidence (e.g., McGarvey et al., 2018) demonstrating that climatic variables, in conjunction with physical habitat data, can serve as reasonable surrogates for instream variables, given that these distal variables are often causally linked to proximal drivers of freshwater species distributions (e.g., air temperature directly affecting water temperature; altitude and slope directly influencing channel morphology) (Zeng et al., 2016).
The WorldClim dataset was downloaded at a spatial resolution of 0.0416° (~5 km at the equator), whereas IPCC layers were downloaded at a native resolution of 0.5° (~55 km at the equator) and were artificially resampled to 0.0416° resolution to match the WorldClim layers.
Three climatic predictor datasets were prepared for comparison: 1. BC1: a dataset that comprised all 19 WorldClim bioclimatic variables. This comprised variables describing annual trends in temperature and precipitation, climatic seasonality, and extreme conditions that might limit species distributions.
2. BC2: a dataset using a reduced subset of the BC1 dataset, excluding complex composite variables (secondarily derived from annual or monthly means in temperature or precipitation; Table S4.1).
Additionally, in all three datasets, we included additional variables summarizing topographic and streamline characteristics, namely altitude, slope and total upstream area (Lehner & Grill, 2013). These variables have been previously shown to be informative in modelling freshwater species distributions, and served as indicators of habitat integrity (e.g., presence of refugia), hydrologic regimes (e.g., flow velocity) and channel properties (e.g., size of water bodies, stream order) (McGarvey et al., 2018).
While some studies recommend the removal of highly collinear predictor variables prior to analyses (e.g., Dormann et al., 2013), we did not do so because removing correlated variables may confound direct comparisons of our predictor datasets across species. This is given that interspecific differences in collinearity patterns (Appendix S7) would necessitate the removal of different numbers of correlated variables across species, leading to varying numbers of final predictors utilized per species for each predictor dataset.
Additionally, in the absence of detailed knowledge of the biology of our study species, the selection of which collinear variables to remove is not straightforward and can be a source of additional error (Dormann et al., 2013;Zeng et al., 2016). Recent studies (e.g., Feng et al., 2019) have shown that Maxent is highly robust to the effects of predictor collinearity during model calibration. Nonetheless, to better understand how predictor datasets can differentially influence model transferability, we quantified predictor collinearity and collinearity shifts across geographic regions by calculating pairwise Pearson's correlation coefficients (Pearson's r) between predictor variables, and differences in raster correlations between native-and invasive-range background extents, using the raster.cor.plot function in the R package "ENMTools" . Additionally, we quantified the degree of environmental similarity between training localities (native and invasive occurrences) and entire background extents using multivariate environmental similarity surface (MESS) maps (Elith et al., 2010), to compare the extent of extrapolation (i.e., environmental novelty) between predictor datasets.

| Model fitting and tuning
Maxent can model complex species responses via transformations of predictor variables (i.e., features). Six feature classes (FCs) are available-linear (the untransformed variable itself), quadratic (square of the variable), product (product of two variables), threshold (a "step" function generating a different constant function above a threshold or knot; equivalent to a piecewise constant spline), hinge (similar to threshold features but generates a linear function above the knot; equivalent to a piecewise linear spline), and categorical (Elith et al., 2011). By default, Maxent determines the FCs allowed in model calibration based on the number of occurrence points, though this has the potential to create complex fitted functions that might overfit the training data and impact model transferability (Merow et al., 2013). Additionally, Maxent uses a procedure known as L1regularization to select individual features that contribute most to model fit while penalizing for excess parameters (Elith et al., 2011;Merow et al., 2013). The default regularization parameters in Maxent were based on tuning experiments conducted on a large number of species across multiple taxonomic groups (Phillips & Dudík, 2008 2. "RM-only": Models were built using default FCs (LQHP) and RM values ranging from 0 to 5 with increments of 0.5. Following the methodology of Radosavljevic and Anderson (2014), we used low overfitting as the primary criteria, and high discriminatory ability as a secondary criteria to select for optimal model parameters.
Specifically, from each set of initial candidate models constructed per species/predictor dataset combination, we retained the model that minimized the difference between training and testing AUC (AUC DIFF ) and reduced 10% omission rates (OR 10 ) to the lowest observed value (or close to it) (primary criteria), while also having an evaluation AUC value (AUC TEST ) that was close to the highest observed value (secondary criterion).
3. "RM + Hinge": Models were built using only hinge features and RM values ranging from 0 to 5 with increments of 0.5. Models built using only hinge features tend to produce relatively smooth fitted response curves, similar to a generalized additive model (GAM) (Elith et al., 2010(Elith et al., , 2011. We selected for optimal regularization parameters following the methodology of Radosavljevic and Anderson (2014), with low AUC DIFF and OR 10 as the primary criteria and high AUC TEST as the secondary criterion.  (2014), with low AUC DIFF and OR 10 as the primary criteria and high AUC TEST as the secondary criterion.
Following model tuning, we retained four final selected models ("Default," "RM-only," "RM + Hinge," "Full-tuning") for each species/ predictor dataset combination. All subsequent downstream comparisons (e.g., between predictor datasets and tuning methods) and analyses were only based on final selected models.

| Evaluation of model performance
Model transferability and overall performance were assessed using seven evaluation metrics that reflected aspects of discriminatory ability (AUC TEST ), overfitting (AUC DIFF ), omission errors (OR 10 , OR MIN ), niche conservatism (Schoener's D), predictive accuracy (Boyce Index) and overall parsimony (AICc; Table 1 for details). AICc scores were calculated on full unpartitioned species occurrence datasets (Warren & Seifert, 2011;Warren et al., 2014), whereas AUC TEST , AUC DIFF , OR 10 , OR MIN and Boyce Index scores were calculated using masked geographically structured cross-validations, which are transferability assessments in the strictest sense as calibration and testing are carried out using occurrences from different geographic regions (Radosavljevic & Anderson, 2014). Species occurrences were first partitioned into two spatially explicit bins-(a) native-range records and (b) invasive-range records-and each bin was in turn used to calibrate a model while the other was withheld for testing. Schoener's D was calculated between geographic predictions of the two spatial folds, with a high value indicating that models separately calibrated using each occurrence bin (native-or invasive-range records) resulted in similar geographic predictions (Schoener, 1968;Warren et al., 2010). "Clamping" was performed during model projections into testing regions, which treats environmental conditions not encountered during model training as though they were at the limits of the training range. This holds fitted species responses at constant probabilities outside of training conditions, therefore restricting model extrapolations when projecting into novel environments (Elith et al., 2011). Geographically structured cross-validations, as well as calculations of metric scores for AUC TEST , AUC DIFF , OR 10 , OR MIN and AICc, were performed using the ENMevaluate function ("user" partitioning option) in the R package "ENMeval" (Muscarella et al., 2014), whereas metric scores for Schoener's D and Boyce Index were calculated using the nicheOverlap function in the R package "dismo" (Hijmans et al., 2017)

Evaluation metric Description
Model transferability AUC TEST The area under the receiver operating characteristic curve (AUC) calculated using withheld testing data during cross-validation . A higher value indicates a model's increased ability to discriminate between conditions at withheld testing points and background points AUC DIFF The difference between AUC values calculated using training points (AUC TRAIN ) and withheld testing points (AUC TEST ) during cross-validation (Warren & Seifert, 2011). A higher value indicates that a model is overfitted on training data such that it performs poorly when evaluated on independent testing data

OR 10
The threshold-dependent 10% training omission rate, defined as the percentage of withheld testing points with Maxent suitability scores lower than that of the 10% of training points with the lowest predicted suitability scores Radosavljevic & Anderson, 2014 Schoener's D The degree of similarity between geographic predictions of sub-models calibrated separately using native-and invasive-range occurrences, and projected across accessible areas (background extents) for each species (Schoener, 1968;Warren et al., 2010). A higher value indicates a greater degree of similarity between geographic predictions of the spatial folds (i.e., greater geographic niche conservatism), but does not reflect model fit (e.g., a high value will be obtained if native-and invasive-range submodels both grossly under-or overpredict suitability)

Boyce Index
The Continuous Boyce Index, which measures concordance between predicted habitat suitability gradients and the distribution of withheld testing points (Hirzel et al., 2006). This metric requires only presence data and is threshold-independent. A higher value indicates that model predictions are consistent with the observed distribution of presences in testing regions Model parsimony

AICc
The sample-size-adjusted Akaike's information criterion as adapted for ENMs (Warren & Seifert, 2011), which measures model overall goodness-of-fit while penalizing for excessive complexity. AICc is calculated using the full unpartitioned dataset and does not indicate model transferability. The model with the lowest AICc value (ΔAICc = 0) is considered the most parsimonious out of a set of models fitted using the same occurrence dataset TA B L E 1 Evaluation metrics used to assess model transferability and overall parsimony To ensure that model performance was directly comparable across multiple species, we standardized metric scores (with the exception of AICc) of all selected models retained for a particular species (12 selected models per species, i.e., 3 predictor datasets × 4 tuning methods) by scaling them by their standard deviation, and centring them around their per-species means. This was performed using the following formula: standardized X = [X -mean (X)] / SD (X), where X is a metric, and mean (X) and SD (X) are the mean value and standard deviation, respectively, of X across all selected models for a particular species (Schielzeth, 2010). AICc values were standardized per species by calculating their within-species relative rankings, with the model with the lowest AICc score (i.e., ΔAICc = 0) within a set of selected models for a particular species given an AICc rank of 1 and the model with the next lowest AICc score given a rank of 2 (and so on).
Metric scores of final selected models were compared between different predictor datasets (BC1, BC2, IPCC) and model-tuning methods ("Default," "RM-only," "RM + Hinge," "Full-tuning") using mixed-effects analyses of variance (ANOVA) with species as a random effect (i.e., metric score ~ predictor dataset + model-tuning method | species ), and post hoc Tukey tests. We further explored relationships between our seven performance metrics (AUC TEST , AUC DIFF , OR 10 , OR MIN , Schoener's D, Boyce Index and AICc) using linear mixed-effects models, with species as a random effect (i.e., metric A ~ metric B | species), and quantified the strength of correlation between metric pairs using Pearson's correlation coefficients (Pearson's r). Mixed-effects models were fitted using the R package "nlme" (Pinheiro et al., 2019), whereas the R package "multcomp" (Hothorn et al., 2008) was used to conduct post hoc comparisons.

| RE SULTS
Model calibration using different predictor datasets and tuning methods across all five focal species resulted in a final set of 60 selected models-45 models tuned for optimal Maxent parameters (RM and/or FCs) and 15 models built using default Maxent settings (

| Effects of predictor dataset
The effects of predictor dataset on model transferability are shown in Figure 1 (top panels of Figures S9.13-S9.17 in Supporting

F I G U R E 1
The effect of environmental predictor datasets on Maxent model (a) discriminatory ability, (b) degree of overfitting, (c-d) omission error rates, (e) geographic niche conservatism and (f) predictive accuracy. Only final selected models were used for comparisons. Statistically significant pairwise differences, based on post hoc Tukey comparisons following mixed-effects ANOVA, are marked with different letters. To ensure that model performance was directly comparable across multiple species, we standardized metric scores of all selected models for a particular species by scaling them by their standard deviation and centring them around their per-species means, using the formula: [X -mean (X)] / SD (X), where X is a metric, and mean (X) and SD (X) are the mean value and standard deviation, respectively, of X across all selected models for a particular species. A standardized score of 0 indicates average scores per species.   Figure 3 (bottom panels of Figures S9.13-S9.17 in Supporting Information for results of individual species). All model-tuning methods led to higher discriminatory ability (higher AUC TEST ), less overfitting (lower AUC DIFF ), and lower 10% omission errors (lower OR 10 ,), as compared to "Default" models. Incorporating some form of feature selection in the tuning process ("RM + Hinge" and "Full-tuning") further increased geographic niche conservatism (higher Schoener's D) and

| Effects of tuning method
parsimony (lower AICc ranks; Figure S8.12), as compared to models calibrated using default FCs ("Default" and "RM-only"). Additionally, "Full-tuning" models had less overfitting (lower AUC DIFF ) and lower To ensure that model performance was directly comparable across multiple species, we standardized metric scores of all selected models for a particular species by scaling them by their standard deviation and centring them around their per-species means, using the formula: [X -mean (X)] / SD (X), where X is a metric, and mean (X) and SD (X) are the mean value and standard deviation, respectively, of X across all selected models for a particular species. A standardized score of 0 indicates average scores per species. Solid black lines indicate median values, whereas boxes encompass the interquartile range. Whiskers extend to data extremes, excluding outliers. Outliers are indicated by open circles. Higher AUC TEST , Schoener's D and Boyce Index scores, and lower AUC DIFF , OR 10 and OR MIN scores indicate better model performance differences were statistically significant for AUC TEST (p < .001), AUC DIFF (p < .001), OR 10 (p < .001), Schoener's D (p < .001) and AICc (p < .01), though not for OR MIN (p = .059) and Boyce Index (p = .091).

| Relationship between AICc and model transferability
We found that model parsimony (as indicated by AICc) was not a significant predictor of model discrimination (AUC TEST , p = .170), minimum training presence omission error rates (OR MIN , p = .526) and predictive accuracy (Boyce Index, p = .100). On the other hand, models with lower AICc (i.e., more parsimonious models) generally exhibited less overfitting (lower AUC DIFF , p < .05), lower 10% omission rates (lower OR 10 , p < .05) and greater correspondence across

| Predictor complexity
Compared to the models fitted using simpler predictor datasets (BC2 and IPCC), models built using all 19 WorldClim bioclimatic variables (BC1), a common practice in many ENM studies (Warren et al., 2014), were generally overfitted on training data and performed poorly when transferred to testing regions. Poor model transferability has sometimes been attributed to shifts in species' fundamental or realized niches (Broennimann et al., 2007), although the improved performances of our BC2 and IPCC models suggest that predictor F I G U R E 4 Pairwise relationships between model transferability (AUC TEST , AUC DIFF , OR 10 , OR MIN , Schoener's D and Boyce Index) and parsimony (AICc rank) for all five focal species. Upper diagonals show pairwise Pearson's correlation coefficient (Pearson's r) between metrics. Only final selected models were used for comparisons. To ensure that model performance was directly comparable across multiple species, we standardized metric scores (with the exception of AICc) of all selected models for a particular species by scaling them by their standard deviation and centring them around their per-species means, using the formula: [X − mean (X)] / SD (X), where X is a metric, and mean (X) and SD (X) are the mean value and standard deviation, respectively, of X across all selected models for a particular species. AICc values were standardized by calculating their within-species relative rankings (i.e., the model with the lowest AICc per species (ΔAICc = 0) is given an AICc rank of 1). Higher AUC TEST , Schoener's D and Boyce Index scores, and lower AUC DIFF , OR 10 , OR MIN and AICc rank scores indicate better model performance choice may be the more parsimonious explanation for the low predictive ability of BC1 models (Peterson & Nakazawa, 2008 tributions but truly having no effect on species responses, may be erroneously treated as important by the modelling algorithm and lead to non-predictivity when projected into non-analogue (novel) environments where correlation patterns differ (Rödder et al., 2009;Warren et al., 2014).
Comparisons of models constructed using the two simpler datasets showed that IPCC and BC2 models exhibited very similar overall performance across all evaluation metrics, and both correctly inferred habitat suitability in known areas of invasion where BC1 models did not ). This finding was unexpected, given the coarser native spatial resolution of IPCC variables (0.5°) as compared to BC2 variables (0.0416°). Our results corroborated previous studies highlighting that ENMs are generally insensitive to the effects of changes in predictor grain size (e.g., Guisan et al., 2007), although our spatial-filtering procedure (thinning to 50-km resolution) also likely ensured that few occurrence points would fall into the same original IPCC grid cells. Nonetheless, we expect that negative impacts of increased predictor grain sizes on ENM performance may become more pronounced when fine-scale environmental gradients or dispersal barriers shape species distributions, such as in the case of island endemics (Fourcade et al., 2014).

| Constraining fitted functions during model tuning
All  (Elith et al., 2010;Radosavljevic & Anderson, 2014;Warren & Seifert, 2011). Additionally, while we expect the choice of FCs selected during model tuning to differ between models built using different sets of predictor variables, our results showed that the "Full-tuning" method produced models that were largely (93% of all "Full-tuning" models) built using only hinge or linear features (the latter which are special cases of hinge features; Phillips & Dudík, 2008), and exhibited the highest degrees of model transferability. It is not a new concept that models with overly complex species responses to environmental gradients will be overfitted on idiosyncrasies within input datasets and thus perform poorly when extrapolated beyond training regions Warren et al., 2014), although fitted functions that are overly simple (i.e., underfitted) will also lead to predictions of low discriminatory ability (Barry & Elith, 2006). Allowing for feature flexibility during model fitting and using quantitative measures to select for optimal parameters appear to produce smooth responses (e.g., from hinge or linear features) that focus on the strongest ecological patterns while excluding spurious fits (Elith et al., 2010).
Nonetheless, we highlight caution when interpreting model projections into novel environmental space. Visual examination of our predictive outputs showed that linear-only "Full-tuning" models had the occasional tendency to overpredict habitat suitability in certain regions known to be suboptimal for our focal species, such as deserts and near latitudinal extremes ( Figures S6.2-6.6). By definition, linear features will be unable to characterize certain nonlinear aspects of species physiologies such as thermal tolerances (Merow et al., 2013), and can result in highly asymmetrical response curves predicting high suitability at one end of background environmental ranges . While model clamping (i.e., holding predicted suitability at constant beyond the limits of training environments) can restrict model extrapolations, it can still lead to highly erroneous predictions when environmental conditions in prediction regions are highly dissimilar to those in training regions (i.e., extremely novel conditions, so requiring a lot of extrapolations). We therefore emphasize that when model transference into unsampled regions is required, it will be necessary to quantify the degree of environmental novelty in prediction regions (e.g., by using multivariate environmental similarity surface maps) (Elith et al., 2010(Elith et al., , 2011 as well as examine model outputs (e.g., predictive maps, response curves) for biological plausibility (Guevara et al., 2018).

| Relationship between AICc and model transferability
We did not find any significant relationship between AICc and AUC TEST or Boyce Index, although there were significant but weak effects on AUC DIFF , OR 10 and Schoener's D. In contrast, pairwise comparisons of all metrics calculated solely on cross-validation data found highly concordant relationships. Taken together, these suggest that models chosen via AICc may have a higher tendency (cf. models selected using cross-validation metrics) to underfit on training data such that they overpredict wide areas of suitability in testing regions (hence low overfitting and greater niche overlap between native-/invasive-range sub-models), thereby biasing predictions towards higher errors of commission (i.e., predicting suitability in unsuitable areas).
While AICc has been shown to be useful in restricting models to a reasonable range of complexity (Moreno-Amat et al., 2015;Warren & Seifert, 2011), it can bias model selection towards certain feature types owing to its use of non-zero Maxent parameters as an estimate of degrees of freedom (Hastie et al., 2009). As tens to hundreds of hinge or threshold features can be fitted on a single covariate (as compared to a single linear or quadratic feature per predictor), models relying more heavily on these FCs will be expected to have considerably inflated parameter counts and therefore be excessively penalized by AICc (Elith et al., 2011;Merow et al., 2013).
Consequently, model selection using AICc may produce models with insufficient parameters and/or overly simple response curves that do not reflect complex species-environment relationships.
Additionally, AICc is calculated on the full unpartitioned occurrence dataset and does not account for spatial differences in environmental niche space (Rodda et al., 2011), and may therefore perform poorly at identifying important predictors when strong correlations exist and differ across study regions (Warren et al., 2014).
Our results corroborate a recent simulation study (Velasco & González-Salazar, 2019) demonstrating poor correspondence between AICc and geographical predictive accuracy. While their study found no correlation between AICc and six threshold-dependent evaluation metrics across nine simulated species, we found weak but significant relationships between model parsimony, and the degree of overfitting and geographic niche conservatism. We postulate that these differences may arise from discrepancies in species characteristics (i.e., virtual vs. empirical species), data partitioning methods (random partitioning vs. masked geographically structured cross-validations) or performance metrics (threshold-dependent vs. threshold-independent metrics) applied. Nonetheless, given that both simulated and empirical studies showed a tenuous link between AICc and other model evaluation metrics, we suggest that model parsimony may not be the optimal model-selection criterion when requiring projections into unsampled environments.

| Implications for the ecological niche modeller
Our results demonstrate that methodological considerations in predictor choice and model parameterization can have substantial impacts on Maxent model transferability, and highlight the inherent uncertainties in many applications of ENMs. Nevertheless, given the urgent need to address conservation issues such as the impacts of climate change or biological invasions (e.g., Rodda et al., 2011;Warren et al., 2014), imprecise ENMs are often the best tools available to quantify ecological or evolutionary change in impacted landscapes and can help guide conservation interventions (Warren, 2012 (Tucker, 1979), have been shown to perform well in some cases (Peterson & Nakazawa, 2008)  We also found that incorporating feature selection into model tuning can substantially improve model transferability as the type of feature functions limiting species distributions will be expected to differ from species to species, and using Maxent defaults (LQHP) may introduce noise that overfit onto idiosyncrasies or errors within occurrence datasets (e.g., georeferencing errors, sampling bias). We therefore recommend that feature selection be considered an integral part of the model-tuning process though we emphasize that detailed examinations of model outputs, on top of quantitative measures of model performance, will always be necessary to determine biological plausibility.
Our results also indicate a somewhat tenuous relationship between AICc and model transferability, suggesting that the indiscriminate use of AICc in model selection may lead to erratic model performance. Cross-validation may generally be preferable because it allows for straightforward evaluations of model transferability (arguably the primary objective for most ENM applications), and quantitative measures of predictive accuracy and uncertainty. "Traditional" measures of model performance, such as presence-background AUC or threshold-dependent omission rates, have previously been criticized as providing misleading indications of model fits (Merow et al., 2013), although we found high correspondences with the presence-only, threshold-independent Boyce Index. Nonetheless, further empirical testing using simulated or real species across a wide range of taxonomic groups and habitats will be needed to determine the generality of our observations.
Lastly, while shifts in species' realized ecological niches undoubtedly occur when species undergo range shifts (e.g., Broennimann et al., 2007), we highlight that inferences of niche divergence or conservatism (i.e., effect sizes of niche shifts) can be sensitive to choices in environmental predictors and model parameters. For example, while we made no prior assumptions of niche conservatism for our five focal freshwater invasive species, models exhibited varying levels of cross-geographic niche similarity dependent on combination of predictor dataset and tuning method (which in the most extreme case can range from low niche overlap (Schoener's D = 0.2-0.4) to very high niche overlap (Schoener's D = 0.8-1.0), even within a single species; niche overlap categories following Rödder and Engler (2011)). Conclusions of realized niche shifts based on arbitrary sets of predictors or model parameters need to be examined in the context of ecological processes and fundamental-niche relationships that limit species distributions (Rödder et al., 2009). ENM users should therefore prioritize understanding the underlying biotic and abiotic mechanisms limiting the survival and dispersal of their study species in order to make reliable inferences of ecological niches.

ACK N OWLED G EM ENTS
We thank Dan Warren, Rudolf Meier, Ryan Chisholm and Chong Kwek Yan for valuable insights and comments on early drafts of the manuscript. This study was supported by an AcRF

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/ddi.13211.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data sources for occurrence localities and details on dispersal rates for our five study species can be found in Appendices S1 and S3, respectively. Global climatic layers were obtained from the WorldClim bioclimatic dataset version 1.4 (Hijmans et al., 2005)