SDMtune: An R package to tune and evaluate species distribution models

Abstract Balancing model complexity is a key challenge of modern computational ecology, particularly so since the spread of machine learning algorithms. Species distribution models are often implemented using a wide variety of machine learning algorithms that can be fine‐tuned to achieve the best model prediction while avoiding overfitting. We have released SDMtune, a new R package that aims to facilitate training, tuning, and evaluation of species distribution models in a unified framework. The main innovations of this package are its functions to perform data‐driven variable selection, and a novel genetic algorithm to tune model hyperparameters. Real‐time and interactive charts are displayed during the execution of several functions to help users understand the effect of removing a variable or varying model hyperparameters on model performance. SDMtune supports three different metrics to evaluate model performance: the area under the receiver operating characteristic curve, the true skill statistic, and Akaike's information criterion corrected for small sample sizes. It implements four statistical methods: artificial neural networks, boosted regression trees, maximum entropy modeling, and random forest. Moreover, it includes functions to display the outputs and create a final report. SDMtune therefore represents a new, unified and user‐friendly framework for the still‐growing field of species distribution modeling.

availability of high computational power, and due to their ability to fit complex nonlinear relationships without requiring an a priori definition of a data model (Breiman, 2001). However, there still are many decisions to be made at various steps of the model building process that can influence the final output (Guisan & Thuiller, 2005). For example, the amount of complexity should be cautiously controlled to avoid models that underfit or overfit the underlying data (Merow et al., 2014;Warren & Seifert, 2011).
In general, the amount of complexity of a model depends on the number of chosen predictors and their transformations (Merow et al., 2014). Moreover, each machine learning algorithm has a series of parameters, known as hyperparameters. In contrast to model parameters, which are estimated from the data during model training, hyperparameters have a fixed value that must be defined before model training. Even if most machine learning algorithms have predefined default values, the optimal value of each hyperparameter is unknown, as it is specific to the modeling problem and the dataset.
However, its choice affects model complexity and/or performance.
For example, in a neural network, the maximum number of iterations controls the amount of iterations executed by its optimization algorithm. This value does not affect model complexity but if it is too low the algorithm might not converge, thus generating a model with lower performance. On the other hand, increasing the size of the hidden layer increases the number of parameters of the model and consequently its complexity, which in turn might affect its performance. In a Maxent model (Phillips, Anderson, & Schapire, 2006), the amount of regularization controls overfitting by shrinking some parameters toward zero which consequently penalizes model complexity. Although several authors have stressed the importance of inspecting the hyperparameters because default settings did not always yield an optimal performance (Elith et al., 2010;Merow, Smith, & Silander, 2013;Warren & Seifert, 2011;Warren, Wright, Seifert, & Shaffer, 2014), the time-consuming task of comparing models trained with a multitude of possible combinations of hyperparameters' values (e.g., Zeng, Low, & Yeo, 2016) may discourage many researchers from doing so in practice.
In order to optimize model complexity and performance, both the predictors used to build the model and the values of hyperparameters should be carefully selected which represents a challenge given the often numerous possible options. The new package SDMtune described here offers a framework to build and systematically tune SDMs. The package includes utilities that help R users (R Core Team, 2019) all along the analysis process, from data preparation to graphical representation of the results and reporting. In particular, it contains dedicated functions to perform variable selection and hyperparameter tuning. Hyperparameter tuning, also called hyperparameter optimization, is a process usually based on a trial and error experiment during which several models with different values of the hyperparameters are trained and evaluated in order to identify which combination yields the best performance. The simplest algorithm for hyperparameter tuning, grid search, trains and compares models with all possible combinations of the defined hyperparameters' values and can thus be a very time-consuming process.
While other available R packages contain functions for tuning one (e.g., ENMeval (Muscarella et al., 2014), wallace (Kass et al., 2018)), kuenm (Cobos, Townsend Peterson, Barve, & Osorio-Olvera, 2019) or several statistical model types (e.g., biomod2 (Thuiller, Georges, & Breiner, 2019), sdm (Naimi & Araújo, 2016), zoon (Golding et al., 2018) and caret (Kuhn et al., 2019)), functions for data-driven variable selection are not always included and the hyperparameter tuning is always based on grid search or random search algorithms. SDMtune offers an alternative that relies on a genetic algorithm for exploring the hyperparameter configuration space (Lessmann, Stahlbock, & Crone, 2005;Young, Rose, Karnowski, Lim, & Patton, 2015), applicable to the most commonly used SDM algorithms. This method significantly reduces the time required to find a near-optimal or the optimal model configuration. As an additional advantage, all functions for selecting the variables and tuning the hyperparameters are supported by an interactive real-time displayed chart that shows the change in model performance during the different steps of function execution. The chart is created in the RStudio (RStudio Team, 2018) viewer pane using the open source library Chart.js (https://www. chart js.org), thus facilitating the understanding of the underlying algorithm action through a graphical representation of the output and avoiding the user's feeling of handling a black box that usually comes up when dealing with complex methods.

| PACK AG E WORKFLOW AND DE SCRIP TION
In this section, we present a possible use of the SDMtune package that covers a complete analysis in seven steps (  (7) generating an output report. Users can combine the available functions in a way that best suits them. For example, step 4 could be repeated after step 5 to further fine-tune model hyperparameters.

| Preparing data for the analysis
SDMtune uses a special object to compile the data for the analysis.
This object, called SWD (samples with data, a format similar to the one used by the Maxent software), bundles all the information related to each record (name of the species, coordinates of the species' presence and absence/background locations, and the values of the environmental variables at these locations), thereby reducing the risk of mistakes in further analyses.
Before starting the analysis the user should decide which evaluation strategy to use. SDMtune provides two methods: (1) simple hold-out validation and (2) k-fold cross-validation. The k folds for the cross-validation can be created either randomly, using the provided randomFolds function, or spatially/environmentally clustered, using functions included in the packages ENMeval and blockCV (Valavi, Elith, Lahoz-Monfort, & Guillera-Arroita, 2019): In this case, SDMtune will internally convert the folds into the required format. The selected validation strategy is used to perform the variable selection and/or tune the model hyperparameters in order to optimize the model performance and address overfitting. When tuning the hyperparameters, several models with different configurations are trained and evaluated in an iterative process that aims at improving the predictive performance on the validation dataset, or-if cross-validation is used-on the arithmetic mean of the evaluation metric across all folds. During this process, part of the information contained in the validation dataset F I G U R E 1 Package workflow illustrating the seven steps of the model tuning process. The functions required to perform the different steps are given in the headline.

TA B L E 1
Overview of the hyperparameters that can be tuned per statistical method and underlying package is inevitably transferred into the trained model, even if the validation data are not directly used to train the model (Chollet & Allaire, 2018;Müller & Guido, 2016). It is therefore advisable to hold apart an independent partition of the data, that is, the testing dataset, to obtain an unbiased evaluation of the final model (Hastie, Tibshirani, & Friedman, 2009;Merow et al., 2014).
The selection of a metric and a validation strategy should therefore be done early in the model tuning process, because it has implications on how the data should be split before training the first model. Note that the AICc score is computed using all the observation locations (Warren & Seifert, 2011) and does not require to partition the observation data into training and validation.

| Training and evaluating a model
Currently, four machine learning methods are available (Table 1): artificial neural networks (ANN), boosted regression trees (BRT), maximum entropy (ME), and random forest (RF). Two different implementations of the ME method can be selected: "Maxent" to use the Java implementation (version 3.4.1 or higher) and "Maxnet" for the R implementation using the maxnet package (Phillips, Anderson, Dudík, Schapire, & Blair, 2017;Phillips et al., 2006). There are specific arguments of the train function that can be used to set the model hyperparameters. By default, these arguments are set to the same values as implemented in the dependent packages.
A trained model can be evaluated using one of the three implemented metrics: (1) the area under the receiver operating characteristic (ROC) curve (AUC) (Fielding & Bell, 1997), (2) the true skill statistic (TSS) (Allouche, Tsoar, & Kadmon, 2006), and (3) Akaike's information criterion corrected for small sample sizes (AICc, only for ME method) (Burnham & Anderson, 2004;Warren & Seifert, 2011). It should be noted that AICc is a relative measure describing the model fit in relation to complexity (parsimony) but holds no information on predictive performance. It can thus only be used to compare competing models trained using the same data but not for final model evaluation.

| Performing the variable selection
When the environmental variables used to train the model are highly correlated, it is difficult to interpret the model output, especially the relative importance of the variables and their response curves. A common practice is thus to select a subset of variables among which collinearity falls below a predefined threshold. A reasonable approach to remove highly correlated variables is to base the selection on expert knowledge, that is, retaining the environmental variable that is most likely to be ecologically meaningful for the target species. When this is unknown, an alternative approach is a "data-driven" variable selection that uses the information contained in the data to select the variable with the highest explanatory value among those that are highly correlated. The function varSel iterates through several steps: Starting from a trained model, it checks if the variable ranked as the most important (using the permutation importance or, optionally for Maxent models, the percent contribution (Phillips, 2017a)) is correlated with any of the other variables, using a given correlation criterion (e.g., Spearman's rho) and correlation threshold. If so, a leave-one-out Jackknife test is performed, starting with the full model, and among all correlated variables the one that decreases least model performance on the training dataset is discarded. A new model without this variable is then trained and again checked for highly correlated variables. The process is repeated until the correlations among all retained variables fall below the predefined threshold. During the execution of the function varSel, a real-time chart shows which variable is removed and the relative effect on the model performance. Grid search is based on a brute force method that results in a very time-consuming process with high computational costs. A possible alternative is to randomly select some hyperparameters' combinations among the user-defined values (Bergstra & Bengio, 2012). This approach is implemented in the randomSearch function that usually finds a better performing model compared with the starting one. However, the disadvantage of the grid search and random search methods is that they do not use any information acquired during the iteration through the hyperparameter configuration space in order to improve the model performance. The function optimizeModel applies a genetic algorithm (Holland, 1992) instead, to more quickly optimize the combination of the hyperparameters (an example of a genetic algorithm used to define hyperparameters and architecture of a deep neural network is presented by Miikkulainen et al. (2018)). The algorithm ( Figure 2) starts by generating a random initial "population" of models (using the randomSearch algorithm), with a given "population size". The "fitness" of the population is measured with the chosen evaluation metric computed on the validation dataset and models are ranked accordingly. During the evaluation of the "fitness," underfitting is controlled by ensuring that models for which the evaluation metric computed for the validation dataset is higher than the one computed for the training dataset are ranked in the last positions. At this point starts, the selection process during which some models ("individuals") are selected according to their "fitness" from the initial "population" to create the first "generation." There are two selection criteria. At first, a predefined proportion of the "fittest" models (i.e., models ranked in the first positions) is retained. Afterward, a small portion of the poor performing models (i.e., those not selected as "fittest") is randomly retained in order to keep more variation in the population and reduce the possibility that the algorithm falls in a local optimum. The retained models are then submitted to the optimization process: they are "bred" (i.e., combined) to create other "individuals" and to reach again the predefined "population" size. In this process, two models, called "parents," are randomly selected from the retained models ("selected individuals") to "breed" and generate a "child." This new model will randomly inherit a value for each hyperparameter from one of the "parents," a process called "crossover." A "mutation" chance with a predefined probability is added to increase the variation in the population. When the "mutation" is triggered one of the model's hyperparameter is randomly selected and its value is randomly sampled from those available but not included in the "parents." Once the population reaches the defined size, the "fitness" is calculated again, and the process is repeated for the number of generations specified in the function. The user can set all the arguments: population size, number of generations, fractions of both best and worst performing models to be retained at each generation as well as the probability of mutation during crossover episodes, but default values-that will work in most cases-are also defined. All the functions described in this section come with a real-time chart showing the model performance while the algorithm is running in the background.

| Optimizing model parsimony
As soon as an optimal hyperparameter combination has been selected, we may want to reduce model complexity by removing some environmental variables ranked as less important.

F I G U R E 2
Flowchart illustrating the steps of the genetic algorithm implemented in the function optimizeModel, with orange ovals representing the begin and the end of the algorithm, blue boxes the main operations executed by the algorithm, and the green hexagons the iteration loops. In gray are provided the default values used by the function, with "size" indicating the initial population size; "keep best" the proportion of best models retained; "keep random" the proportion of less performing models retained; "mutation chance" the probability that a mutation event occurs. Keep best and keep random are provided as proportion of the initial population size. The dotted box shows an example of crossover during which two models, randomly selected from the selected "individuals", are combined to generate a child model that inherits the first and third hyperparameters' values from Model 2 and the second from Model 1. When the number of generations is zero, the flowchart represents the algorithm implemented in the function randomSearch

| Evaluating the final model
At this point, after the variable set has been optimized (varSel and reduceVar) and the hyperparameters of the model have been tuned (gridSearch, randomSearch, or optimizeModel) the model can be evaluated on the held apart testing dataset, which was never used during the tuning procedure, using one of the functions that compute the chosen metric (i.e., AUC or TSS). Another possibility would be to train a new model using the selected variables and hyperparameter combinations with the full dataset (i.e., without applying cross-validation or data partitioning) and evaluate it on the held apart testing dataset (Chollet & Allaire, 2018). This way the model can avail of a greater amount of information and might thus be able to generalize better.

| PERFORMAN CE A SS E SS MENT OF G ENE TI C ALG ORITHM
We evaluated the performance of the genetic algorithm in terms of time-saving and model accuracy for the four SDM-methods available in SDMtune by comparing the output of the optimizeModel and gridSearch functions. We used the virtualSp dataset provided with the package. This dataset contains simulated presence, absence, and background locations generated with the package virtualspecies (Leroy, Meynard, Bellard, & Courchamp, 2016). For artificial neural network, boosted regression trees, and random forest we used the presence and absence datasets, while for the maximum entropy method we used the presence and background datasets.
The maximum entropy method was performed with the "Maxnet" implementation. In all cases, a 10-fold cross-validation was used as validation strategy and the AUC was used as evaluation metric. As first step, we trained a model with default hyperparameters' values (for artificial neural network we used an inner layer of a size equal to the number of environmental variables), and then executed the two functions testing 1200 possible hyperparameters' combinations (Table A1, for the optimizeModel function we used default arguments). The results of the analysis are presented in Table 2. In all cases, the optimizeModel functions found a near-optimal solution in a significantly reduced amount of time.
TA B L E 2 Performance assessment of the gridSearch compared to the optimizeModel function for model tuning regarding execution time (expressed as HH:MM:SS) and evaluation metric (on the training dataset "Train AUC," the validation dataset "Val AUC," given as arithmetic mean across the folds of a 10-fold cross-validation) for the four methods implemented in SDMtune

| E X AMPLE OF APPLI C ATI ON: B E ARDED V ULTURE IN THE SWISS ALPS
To We randomly split the observations into two partitions and used 80% (1363 observations) as training dataset and the remaining 20% (584) as testing dataset. A set of 39 environmental predictors that might be relevant for the species was prepared for the analysis, as using numerous predictors together with a large amount of species observations allows for a better illustration of the advantages and time-saving functionalities provided by our package. The variables included information on topography, climate, geology, anthropogenic infrastructure, land cover, and food availability, referring to Hirzel et al. (2004). All predictors were prepared as raster maps with a resolution of 100 × 100 m, with each cell containing the average value of the respective variable within a 1 km 2 circular moving window (a list of the variables is provided in Appendix A, Table A2). The whole analysis was conducted using R version 3.6.0 (R Core Team, 2019).
We performed the data-driven variable selection using the function varSel on the initial set of 39 predictors. As a first step, we trained a model using the "Maxent" method with default settings (i.e., linear, quadratic, product and hinge as feature class combinations, regularization multiplier equal to 1, 10,000 background locations and 500 iterations) and the 39 environmental variables. We then used the varSel function to execute the variable selection using the percent contribution to rank variable importance and the AUC as performance metric. The function arguments were set to check for Spearman's correlation coefficients |r s | greater than or equal to 0.7, based on 30,000 random background locations (Table A3).
Starting with the model trained using the 28 selected variables (i.e., the output of the varSel function, Table A4), we conducted a simple experiment to investigate the performance of the optimize-Model compared to the gridSearch function in terms of execution time and best hyperparameter combination. We selected the AUC as the performance metric running a fourfold cross-validation. The folds were created by randomly splitting the training dataset with the function randomFolds. For the optimizeModel function, we used the default arguments: a population size of 20 models, five generations, kept 40% of the best performing models, randomly retained 20% of the less performing ones and used a mutation chance of 40%. We tested different sets of hyperparameters (Table A5 and  Note: The models were trained using the Maxent method. The number of tested hyperparameters' combinations is given by "h". A description of the exact hyperparameters' combinations is provided in Appendix A, Table A5. "FC" represents the feature class combination, "reg" the regularization multiplier and "iter" the number of iterations for the best performing model. a FC: (l) linear, (q) quadratic, (p) product, and (h) and hinge.

TA B L E 3
Performance of the gridSearch compared to the optimizeModel function for model tuning regarding execution time (expressed as HH:MM:SS) and evaluation metric (on the training dataset "Train AUC," the validation dataset "Val AUC" and the difference between both "Diff AUC," given as arithmetic mean of the fourfold cross-validation) on the case example data of the bearded vulture considered the Maxent percent contribution to rank the environmental variables, a threshold of 2% for variable removal and used the Jackknife approach. We could remove nine and seven environmental variables, respectively, without reducing the mean validation AUC (Table A6 and Figure A2).
Finally, we trained a model using the full training dataset without cross-validation, the selected environmental variables and the best hyperparameter configuration found by the two functions. We estimated the performance of these tuned models on the held apart testing dataset, obtaining very similar results (Table 4).  (Muscarella et al., 2014), wallace (Kass et al., 2018), and kuenm (Cobos et al., 2019), this process can be very time consuming (Table 2 and 3) and limiting, especially when performed for multiple species. With SDMtune, we introduce a genetic algorithm that drastically reduces the computation time of hyperperameter tuning while achieving an optimal or near-optimal model configuration.

| D ISCUSS I ON
While the gridSearch function can be preferred for tuning a single or a few hyperparameters, it quickly comes to its limits when testing   In the performance assessment of the genetic algorithm and in the example of application presented here (Table 2 and  Not only the tuning of hyperparameters, but also the selection of environmental variables for SDMs has gained attention in recent years (Jueterbock, Smolina, Coyer, & Hoarau, 2016;Warren et al., 2014;Zeng et al., 2016). Despite the fact that highly correlated environmental variables are not a problem when the aim of the study is prediction in the same extent of the observed data, reducing collinearity is recommended in order to reduce model complexity and increase the interpretability of the predictors (Dormann et al., 2013;Merow et al., 2013). In addition, although the degree of accepted model complexity varies according to the modeling scope(s) (Halvorsen, 2012;Halvorsen, Mazzoni, Bryn, & Bakkestuen, 2015), it has been pointed out that models might perform best when trained with a reduced number of predictors (Brun et al., 2020;Halvorsen et al., 2015). Even though the selection should be driven by the knowledge of the modeled species, this might be difficult when the user must decide among several a priori ecologically relevant predictors for the species, or if the ecology of the species is poorly known.  (Braunisch et al., 2013), causing unexpected predictions (Warren et al., 2014). This risk is reduced with a reduced number of predictors. Moreover, reducing the number of predictors may limit overfitting, and thus result in a model that generalizes better and thus yields more accurate predictions for data not used during training. The selection of a threshold to reduce the number of variables with the function reduceVar is quite arbitrary. If the aim is to remove as many variables as possible while preserving model performance, the threshold could be set to 100 and the Jackknife method must be selected. On the contrary, if the user, based on his expertise, judges a certain variable as ecologically important for the species and wants to retain it in the model, he could define a threshold that is lower than the importance of this variable.
Nevertheless, the functions presented in this article should not be applied blindly. Therefore, SDMtune provides interactive real-time charts to visualize every step of the algorithms with the idea that the user further evaluates the validity of the final output.
These charts are particularly useful for two reasons. First, because they are updated in real time, they confirm that the running function is working properly and is not frozen at some unknown step. This is especially important for functions that take long to be executed. Second, because they are interactive, different types of information can be provided without overloading a single graph, since extra information is embedded in a tooltip that appears when the user hovers over a specific element of the chart. Interactive real-time charts are well known and used in other fields that represent the state-of-the-art of machine learning, and available in few R packages such as keras (Allaire & Chollet, 2020) which allows the user to build complex deep learning models.

| CON CLUS ION
The new R package SDMtune enables data-driven variable selec-

| I N S TA LL ATI O N
The package SDMtune is available in the CRAN repository at https:// CRAN.R-proje ct.org/packa ge=SDMtune and can be installed in R with the command install.packages("SDMtune"). The package is under development and the source code is hosted in GitHub (https:// github.com/ConsB iol-unibe rn/SDMtune). We encourage future users to provide feedback and report bugs by opening an issue on the GitHub platform. Gysel Stiftung für Natur und Vogelschutz, and Samy Harshallanos.

CO N FLI C T O F I NTE R E S T
None declared. formal analysis (equal); funding acquisition (supporting); methodology (equal); software (supporting); supervision (lead); writing-original draft (lead).

DATA AVA I L A B I L I T Y S TAT E M E N T
The Bearded vulture locations and the environmental variables used in the case example have not been archived because part of the data cannot be made publicly available due to data property rights and conservation vulnerability of the species. However, the analysis steps illustrated in the case example could also be reproduced following the articles published on the SDMtune website (https://consb iol-unibe rn.github.io/SDMtu ne/) and using the virtualSp dataset provided with the package. The code necessary to reproduce the performance assessment of the genetic algorithm is provided in Appendix A.

R CO D E TO R E PRO D U CE TH E PE R FO R M A N CE A SS E SS M E NT O F TH E G E N E TI C A LG O R ITH M
The output presented in the article has been produced using R version 3.6 in a Linux operating system. Results might be slightly different with other versions of R or operating systems due to possible different random numbers generated after setting the seed.

TA B L E A 3
Spearman' correlation coefficient of highly correlated environmental variables (r s ≤ −0.7 and r s ≥ 0.7). For variable codes see

TA B L E A 6
List of the remaining environmental variables after the execution of the reduceVar function ( Figure A2) and relative percent contribution rounded to the first decimal place Note: The model parsimony optimization was performed based on the output of the optimizeModel and gridSearch functions respectively, executed to tune 1200 possible combinations of hyperparameters.
F I G U R E A 1 Snapshot of the real-time chart after executing the optimizeModel function (with a "population size" of 20 models and five model generations) using 1200 different hyperparameter combinations on the case example data of the Bearded vulture. The scatterplot on top shows the training AUC (in orange) and the validation AUC (in blue) of the 20 ranked models at the end of the fifth generation, given as the arithmetic mean of the fourfold cross-validation. The line plot at the bottom shows the increase in model performance (based on the AUC on both training and validation dataset) at each generation with start: the starting model before running the optimization process, 0: the best performing model after the random population is created, 1-5: the best performing models in each of the five generations of the optimization process F I G U R E A 2 Snapshot of the realtime chart after executing the reduceVar function on the case example data of the Bearded vulture. The bar chart on top shows the 28 uncorrelated environmental variables with the percent contribution of the retained environmental variables at the end of the selection process, calculated according to Phillips et al. (2006). The line chart at the bottom shows the change in model performance (based on the AUC on both the training and validation dataset, given as arithmetic mean of the fourfold cross-validation) at each iteration where a single variable is removed. In the RStudio viewer pane, the chart is interactive and when the user hovers over the line or bar chart a tooltip reports the variable that has been removed and the relative model performance values. Variable codes are provided in Table A2