A review of evidence about use and performance of species distribution modelling ensembles like BIOMOD

The idea of combining predictions from different models into an ensemble has gained considerable popularity in species distribution modelling, partly due to free and comprehensive software such as the R package BIOMOD. However, despite proliferation of ensemble models, we lack oversight of how and where they are used for modelling distributions, and how well they perform. Here, we present such an overview.


| INTRODUC TI ON
Species distribution models (SDMs) are an increasingly important tool in ecology, biogeography and conservation sciences.
Correlative SDMs are models that use species-environment relationships to explain and predict distributions of species. SDMs have broad applications, including exploring ecological and evolutionary hypotheses, invasive species management, reserve planning and predicting the impact of past and future climate change on species and communities (Guillera-Arroita et al., 2015;Guisan & Thuiller, 2005;Hijmans & Graham, 2006). There are now many methods used for distribution modelling, varying in how they approach model selection, how they define fitted functions and interactions, whether they can handle imperfect detection and sampling biases and so on (Franklin, 2010;Guisan, Thuiller, & Zimmermann, 2017;Peterson et al., 2011). For example, statistical regression models such as generalized linear models (GLMs, McCullagh, 1984) compute species occurrence as parametric functions of environmental variables whereas random forests (RF, Breiman, 2001) are based on machine-learning decision tree approaches. Predictive outcomes across SDM methods are known to be variable, and the choice of modelling method can significantly affect model predictive performance. However, no one method is consistently superior in performance across species, regions and applications (Elith et al., 2006;Pearson et al., 2006;Segurado & Araujo, 2004). This makes it difficult to choose which method (individual model hereafter) to use, prompting the idea of combining predictive outputs from different models in a so-called ensemble (Araújo & New, 2007).
The underlying philosophy of ensemble modelling is that each individual model carries both some true "signal" about the relationships the model is aiming to capture, and some "noise" created by errors and uncertainties in the data and the model structure.
Ensembles combine models with the intention of obtaining better separation of the signal from noise (Araújo & New, 2007;Dormann et al., 2018). Ensemble modelling has a history in other fields dealing with complex and dynamic systems, such as economics (Gregory, Smith, & Yetman, 2001) and meteorology (Sanders, 1963), and is thought to be traceable to the earliest days of statistical sciences (de Laplace, 1818). The concept of ensemble is widely used in machine learning, often with complex classifiers built by combining many simple modelling units. These types of ensembles have been shown in many cases to have superior predictive performance to individual models (Friedman & Popescu, 2008;Seni & Elder, 2010). They use a variety of methods (including bagging and boosting, Dietterich, 2000) to form the ensemble and are integral to a number of methods used in species distribution modelling (e.g., Cutler et al., 2007;Elith, Leathwick, & Hastie, 2008;Hardy, Lindgren, Konakanchi, & Huettmann, 2011).
A popular variant of SDM ensembles emerged over a decade ago from within the species distribution modelling research community, and here, we are interested to study that variant.
Early ideas emphasised use of SDM ensembles for forecasting species distribution changes under future climates (the term "consensus forecast" is often used in this context, Araújo & New, 2007;Thuiller, Lafourcade, Engler, & Araújo, 2009) and focused on ensembles across modelling methods. This contrasts with the idea of ensembles or averages across multiple instances of the same method, such as in model averaging of regression models with different combination of predictors (Burnham & Anderson, 2003), or ensembles of trees such as boosted trees or RF (Hastie, Tibshirani, & Friedman, 2009). From here on we focus on the ensembles-across-methods ideas that have emerged in the SDM literature, in the interest of targeting a widely used approach in species distribution modelling and understanding its use and performance.
We note that, within this literature, some users focusing on climate change view the consensus forecast as consensus across model predictions and future climate change scenarios (e.g., combinations of emission scenarios and global circulation models). However, to enable review across the range of applications, we focus here on ensembles-across-methods (i.e., algorithms), regardless of whether they are also across future climate scenarios.
Various strategies exist to combine predictions from individual models into an ensemble, the most intuitive of which is simply taking the mathematical mean or median across predictions, irrespective of whether such predictions are binary or continuous. More complex approaches involve "weighting," scaling predictions of different models by weights based on some measure of predictive performance (Araújo & New, 2007). These weights are often derived by validating predictions from individual models on some test data. Weighting is thought to improve how well the ensemble predicts (Araújo & New, 2007;Marmion, Parviainen, Luoto, Heikkinen, & Thuiller, 2009), although weighted ensembles also require more effort to produce as the individual models need to be validated before they can be combined (although users of unweighted ensembles may also choose to do so).
We choose BIOMOD (Thuiller, 2003;Thuiller et al., 2009) as our focus because BIOMOD is the most well-known and well-established ensemble software created within the SDM community, it is freely available, and we expect its use to continue. Other approaches are also available but often tailor-made by specific groups of researchers, and not necessarily widely used by the broader community. BioEnsembles is a notable example (Diniz-Filho et al., 2009), but there are also several ensemble modelling examples without specialised tool sets (e.g., Crimmins, Dobrowski, & Mynsberge, 2013;Hardy et al., 2011). Mention of BIOMODbased ensembles is increasingly popular in the SDM literature.
A search of citations of the two BIOMOD introductory papers (Thuiller, 2003;Thuiller et al., 2009) on ISI Web of Science database (searched on 07/06/2017) yields 829 unique results, with 254 of them published in the 3 years prior to June 2017. Despite this proliferation, there is not yet an overview of published studies employing these types of ensemble methods. Such an overview is an important part of understanding modelling approaches and how they perform. Here, we aim to fill this knowledge gap by providing a review of SDM studies that used BIOMOD-like ensemble modelling. We summarize these in terms of their studied species, geography, data and choices in modelling methods, as a prelude to analysing evidence on model performance of both the ensembles and the individual models, as reported in these studies. As we started this study, our ultimate aim was to enhance understanding of how these types of ensemble models perform in their applications, how sensitive they are to choices made in the modelling process, and whether they are particularly suited to some situations over others.

| BACKG ROUND: US E OF EN S EMB LE S LIKE B IOMOD
Many users cite the superior predictive performance of ensembles over individual models (as seen in Crossman & Bass, 2008;Marmion et al., 2009) as justification for choosing them. However, individual models have also been shown to outperform ensemble models (Crimmins et al., 2013). It is reasonable to believe another important contributor to the popularity of ensembles is the existence of comprehensive free ensemble modelling tools, such as BIOMOD.
BIOMOD provides a suite of methods and tools relevant to the problem of modelling distributions, such as the ability to quickly build individual models and to combine them in different ways.
BIOMOD was first developed in the S-Plus language environment in 2003 (Thuiller, 2003) and was later ported to the R statistical language environment (R Core Team, 2016) as a package under the name "biomod2." The latest version of "biomod2" (ver. 3.3-7; Thuiller, Georges, Engler, & Breiner, 2016) is capable of computing SDMs with up to 10 different modelling methods (for a full list, including abbreviations of model names used from here on, see Table 1). "biomod2" can combine these individual models using various approaches to form ensembles, including: Weighted Mean, Mean, Median, and Committee Averaging. The first three approaches correspond to the ensemble strategies previously introduced, while the last one, Committee Averaging, transforms probabilistic predictions from the single models into binary predictions using a threshold, and then averages them. Users of "biomod2" can choose which individual models they want to include in the ensemble. For example, they could use all individual models available, or use only models that performed better than a specified threshold on a selected metric.
Another major module in "biomod2" is model evaluation. The evaluation statistics, useful in their own right, are also used as a basis for weighting models in Weighted Mean ensembles, and for determining which individual models to exclude from the ensemble due to poor performance. "biomod2" allows evaluation of models based on several metrics derived from a confusion matrix, including sensitivity (proportion of true positives correctly identified), specificity (proportion of true negatives correctly identified), area under curve (AUC) of receiver operating characteristics (Hanley & McNeil, 1982), and true skill statistics (TSS; Allouche, Tsoar, & Kadmon, 2006). These latter two evaluation metrics are indicators of discrimination capacity, which quantifies how well the  Busby (1991) a Note that the "biomod2" package uses different names for some of the classes (cf. Thuiller et al., 2009), that is generalized boosted model (GBM) instead of boosted regression tree (BRT), and surface-range envelope (SRE) instead of BIOCLIM. b Presence-only data can be supplied to some presence/absence methods by selecting background points, however one should interpret model outputs in these situations as relative but not absolute probability of species occurrence, consistent with other presence-background methods.
TA B L E 1 Classes of SDM modelling methods available in BIOMOD model can distinguish presences from absences (or presences from background samples, when absences are unavailable). As held-out subsets of the original dataset are often used for model evaluation (known as cross-validation, CV), strategies for partitioning data into subsets are provided in "biomod2." The package also allows users to define their own subsets or use a completely independent dataset for evaluation.

| ME THODS FOR LITER ATURE RE VIE W
To scan for BIOMOD-relevant studies, we considered those articles that have cited any of the two introductory BIOMOD papers (Thuiller, 2003;Thuiller et al., 2009 (2003) and Thuiller et al. (2009)  Alternative keywords "model averaging" and "consensus" were also tested in Google Scholar but the SDM-relevant results tended to be a subset of the "ensemble" search results and thus these terms were not pursued further.
After combining the pools of results and removing duplicates, we further removed any publication that was not a peer-reviewed journal article (theses, book chapters etc.). We then screened the abstracts of remaining articles, removing any that did not include a specific study using correlative SDMs. These removed papers were generally either not SDM-oriented or were SDM discussion/review papers that did not include actual analysis. After screening, 694 papers were retained and then read in full (see Supporting Information Appendix S1 for the complete list of assessed papers). From these, we only retained those papers that unambiguously used an SDM ensemble of the type we defined

| Target species
We categorised modelled species into eight broad categories: plants, birds, herpetofauna (amphibians and reptiles), mammals, invertebrates, fish, mixed taxa and viruses or bacteria; we also made the distinction between marine and terrestrial/freshwater species.

| Data details
The quality and quantity of data is especially important for performance of models (Araújo & Guisan, 2006) and to their suitability for particular end-uses (Guillera-Arroita et al., 2015). Therefore, we documented: (a) the number of species records; (b) whether species absence records were used in modelling; and (c) spatial extent and resolution of data.

| Predictor variables
We documented the number of predictors considered by studies, and broadly categorised predictors into four classes based on the na-

| Performance results reported
We documented results on model performance where provided. In particular, we noted if ensemble models were compared to individual models, and whether models were cross-validated or validated on independent data (here we define independent data as data that are independently collected from data used in model fitting; e.g., data collected in a different time period or location, or data collected by a different group of recorders).

| RE SULTS AND D ISCUSS I ON
We observe that BIOMOD-like ensembles have been used in a diverse range of applications; here, we summarize the main aspects of these ensemble SDM studies.

| Species and geography
A wide variety of species are modelled. The most represented taxonomical group is terrestrial plants, followed by invertebrates and F I G U R E 1 Workflow diagram illustrating our literature review protocol, modified from the PRISMA scheme (Moher et al., 2009) birds (Figure 2a), consistent with the broader applications of SDMs (Elith & Leathwick, 2009). Seventy reviewed studies model distribution of a single species and the remaining 154 model more than one species (no. of species in multispecies models ranges from 2 to 8,472 with median = 29 and mean = 434). Among reviewed studies, 26 model species across taxonomical groups, or model distribution of response variables other than biological species (e.g., landform processes in Aalto & Luoto, 2014). The geographical focus of reviewed studies is heavily biased towards Europe (106 studies, Figure 2b), which in some studies also include surrounding regions (the Mediterranean areas of Asia and North Africa). Terrestrial or freshwater species are modelled more often than marine species (208 and 16 studies, respectively). Data used in these studies vary greatly in raster resolution, with grid cell size ranging from 5 by 5 m in small scale regional studies, for example Engler et al., (2013) to ~110 by 110 km (or 1-degree longitude/latitude) commonly used in global models. The median value for cell size is ~5 by 5 km and the mean value is ~16 by 16 km. The above summary of cell size excludes 30 studies that use stream segments as spatial units to model aquatic species or that do not make spatial predictions.

| Data and predictors
Among reviewed studies, 85 use species absence data in their models (among which two also use data on species abundance, or den-

| Model transfer
We observe that the scope of ensemble SDM applications has ex-

| Methodology
In contrast to the diverse applications of models, modelling methodologies employed by these studies are generally similar. This is largely a consequence of how we construct our review set, and the consistency in approach across ensemble applications. Among reviewed studies, 12 explicitly report not using BIOMOD but using their own codes or other toolboxes such as BioEnsembles. However, the methodologies employed by these 12 studies are closely related to those of BIOMOD and thus relevant to our review focus. The remaining studies all use BIOMOD, likely in the form of the package "biomod2" implemented in R (specific version number of BIOMOD reported in 57 studies).
On average six individual models are used in ensembles (median = 6 and mean = 6.2). Regarding specific modelling approaches (see details and acronyms, Table 1 In our reviewed studies, individual models are often subject to an initial round of validation before they are combined, as a mean to provide weighting score for Weighted Mean ensembles and to inform if some models should be excluded from ensembles due to poor performance. These procedures often assess predictive accuracy using one or more measures of discrimination derived from a confusion matrix (e.g., 168 studies using AUC and 116 studies using TSS). It is noteworthy that when "biomod2" is used to validate models, by default sensitivity and specificity are always reported, although not all users choose to discuss these results. In rare cases

| Performance
Perhaps  Table  1. We distinguish those using presenceonly data (PO) from those using presenceabsence data (PA) because MaxEnt and BIOCLIM are specifically designed for PO data and thus we expect higher usage of these methods among PO studies.
We The final validations use different data from those used in initial validations (eight use a held-out proportion of original data different from those used in initial validation, and 12 use independent data). The remaining 26 studies use the same data for validating both initial individual models and final ensemble models. In theory, we can expect that model performance assessed in this way could be biased towards "performance-informed" ensembles (i.e., Weighted Mean or ensembles selecting models performing best on evaluation data). This is because these ensembles could favour individual models that performed better on the held-out evaluation data, and therefore are primed to behave well on that held-out dataset in the final validation. However, we do not observe evidence of such a bias in the studies reviewed (there is no evidence that the performance of ensembles relative to individual models is different between those 26 studies and the other 20 studies; Pearson's chi-squared test, p-value ≈0.40; also see Figure 4).
In our selected set of papers, two studies specifically aim to compare performances between ensemble and individual models are lack of evaluation on data not used to tune the ensembles and, more generally, lack of reporting of model performance.

| Use of ensembles
In the Introduction, we noted the continual growth in popularity of the ensemble approach in SDMs, and this review provides relevant data. Figure 5 presents the number of reviewed papers published each year since the initial release of BIOMOD in 2003.
The figure also shows some key events in the development of ensemble SDMs, which may have contributed to the proliferation of ensembles.
The results above summarize the main trends in ensemble SDM use. Further specifics that may be of interest to particular readers are available in Supporting Information Appendix S2.

| LE ARNING MORE
Our review provides a broad overview of the patterns and habits of BIOMOD ensemble users. However, we were unable to insightfully characterize these studies in terms of modelling choices and predictive performance of models, because documentation on these aspects is often lacking or ambiguous. For instance, while most studies reported what algorithms are included in the ensemble, relatively few detailed the version of BIOMOD being used and whether they used default settings or how they fine-tuned the individual models.
We know that these methodological details are not central to the  (Thuiller, 2003), the publication of Araújo and New, (2007) laying a popular theoretical framework for ensemble modelling in SDMs, Marmion et al., (2009) demonstrating predictive superiority of ensemble models (particularly weighted ensembles), the release of "biomod2" , and implementation of the popular MaxEnt algorithm in "biomod2" (Thuiller et al., 2016) instance, at times it may be appropriate to spatially or environmentally separate the folds using block cross-validation (Roberts et al., 2017;Valavi, Elith, Lahoz-Monfort, & Guillera-Arroita, 2018). Different strategies will impact how models are weighted in a weighted ensemble, and thus, potentially affect how the ensemble models predict. We suggest this is well worth exploring further for users of BIOMOD ensembles, to build understanding of optimal strategies for the range of likely applications. In addition, for users who aim to fairly compare performance of "performance-informed" ensembles to other models but who do not have independent validation data, we suggest a "two-step" internal validation approach (Figure 6), which we observe in Marmion et al. (2009) and Meller et al. (2014). This approach involves first dividing all data into "outer" training and testing sets, and then further dividing outer training sets into "inner" training and testing sets. We gathered limited evidence from our sampled studies on if and how model tuning affects performance of these types of ensembles.
The lack of information on model tuning may reflect a belief among authors that when an ensemble approach is used, tuning of individual models is no longer relevant, or it may the practical difficulty of implementing both tuning and ensemble procedures, as both can be complex. However, one would expect model tuning to be important because: (a) there is strong evidence that model tuning affects performance of individual models (e.g., Anderson & Gonzalez, 2011); it has been repeatedly suggested that ensemble model performance is dependent on the quality of individual models that compose the ensemble (Araújo & New, 2007;Araújo, Whittaker, Ladle, & Erhard, 2005;Marmion et al., 2009). Presumably many authors used BIOMOD defaults, either due to inexperience with the modelling methods, or because they assumed these settings to be the optimal tuning. Evidence is lacking for whether these defaults are optimal. By extension, we also cannot discern based on the available literature how ensembles with default tuning of individual models predict compared to well-tuned individual models or ensembles of well-tuned models.
To increase our knowledge regarding how performance of these types of ensembles is affected by choices in the modelling process, it would be beneficial for future ensemble SDM literature to have thorough coverage about these methodological choices. We emphasise again that, while model performance is not the main focus of many SDM studies, reporting methodological details is both beneficial to the accumulation of knowledge in the area and to the reproducibility of SDM studies in general. We provide an outline of the types of information that could be documented in Figure 6. This figure is by no means an exhaustive checklist of all possible choices in an ensemble modelling workflow, especially if the modelling workflow is very different from that of BIOMOD. However, with either explicit reporting of software version and selected methodology, or ideally with published code and data, one could fully reproduce ensemble SDM studies and build a comprehensive picture of the performance of ensembles across datasets and applications. We encourage authors publishing new work to provide such details to support an enhanced F I G U R E 6 Schematic of a typical ensemble modelling workflow, including BIOMOD function calls and partitioning of data used in each stage of modelling, and items to report for transparent and reproducible documentation of the modelling process. *Partitioning data for inner-training and validation ensures unbiased comparison between performance of "performance-informed" ensembles and that of other models. When only "performance-naïve" ensembles are used (e.g., Mean or Median), such step is not necessary. **Validation set can be a held-out proportion of original data or independently collected data. ***Currently "biomod2" does not support two-step validations within its natural workflow, so users must manually reserve the outer validation set. However, users can use the "biomod2" function "evaluate" to validate models on any data, including those held-out in the validation set