A comparison in species distribution model performance of succulents using key species and subsets of environmental predictors

Abstract Identifying the environmental drivers of the global distribution of succulent plants using the Crassulacean acid metabolism pathway of photosynthesis has previously been investigated through ensemble‐modeling of species delimiting the realized niche of the natural succulent biome. An alternative approach, which may provide further insight into the fundamental niche of succulent plants in the absence of dispersal limitation, is to model the distribution of selected species that are globally widespread and have become naturalized far beyond their native habitats. This could be of interest, for example, in defining areas that may be suitable for cultivation of alternative crops resilient to future climate change. We therefore explored the performance of climate‐only species distribution models (SDMs) in predicting the drivers and distribution of two widespread CAM plants, Opuntia ficus‐indica and Euphorbia tirucalli. Using two different algorithms and five predictor sets, we created distribution models for these exemplar species and produced an updated map of global inter‐annual rainfall predictability. No single predictor set produced markedly more accurate models, with the basic bioclim‐only predictor set marginally out‐performing combinations with additional predictors. Minimum temperature of the coldest month was the single most important variable in determining spatial distribution, but additional predictors such as precipitation and inter‐annual precipitation variability were also important in explaining the differences in spatial predictions between SDMs. When compared against previous projections, an a posteriori approach correctly does not predict distributions in areas of ecophysiological tolerance yet known absence (e.g., due to biotic competition). An updated map of inter‐annual rainfall predictability has successfully identified regions known to be depauperate in succulent plants. High model performance metrics suggest that the majority of potentially suitable regions for these species are predicted by these models with a limited number of climate predictors, and there is no benefit in expanding model complexity and increasing the potential for overfitting.

insight into the fundamental niche of succulent plants in the absence of dispersal limitation, is to model the distribution of selected species that are globally widespread and have become naturalized far beyond their native habitats. This could be of interest, for example, in defining areas that may be suitable for cultivation of alternative crops resilient to future climate change. We therefore explored the performance of climate-only species distribution models (SDMs) in predicting the drivers and distribution of two widespread CAM plants, Opuntia ficus-indica and Euphorbia tirucalli. Using two different algorithms and five predictor sets, we created distribution models for these exemplar species and produced an updated map of global inter-annual rainfall predictability. No single predictor set produced markedly more accurate models, with the basic bioclim-only predictor set marginally out-performing combinations with additional predictors. Minimum temperature of the coldest month was the single most important variable in determining spatial distribution, but additional predictors such as precipitation and inter-annual precipitation variability were also important in explaining the differences in spatial predictions between SDMs. When compared against previous projections, an a posteriori approach correctly does not predict distributions in areas of ecophysiological tolerance yet known absence (e.g., due to biotic competition). An updated map of inter-annual rainfall predictability has successfully identified regions known to be depauperate in succulent plants. High model performance metrics suggest that the majority of potentially suitable regions for these species are predicted by these models with a limited number of climate predictors, and there is no benefit in expanding model complexity and increasing the potential for overfitting.

| INTRODUC TI ON
Identifying the environmental conditions under which a species can thrive is an important question in biogeography and ecology both to understand the environmental tolerances of individual organisms and to be able to predict their distributions across current and future climates. Many parts of the world are likely to experience warmer climates and reduced and/or more variable precipitation in the decades ahead, so there is interest in determining which organisms may be relatively well adapted to these future climate regimes. A group of plants that are particularly characteristic of warm, semi-arid parts of the world with strong seasonal rainfall patterns are succulents using the specific mode of photosynthesis known as Crassulacean acid metabolism (CAM). By virtue of being able to fix most of their carbon dioxide from the atmosphere at night rather than during the day time, CAM plants typically show high water-use efficiency and can survive in environments with high daily temperatures and relatively limited water availability (Cushman, 2001;Lüttge, 2010;Osmond, 1978;Winter, 1985;Winter & Smith, 1996). The environmental resilience of these plants makes them attractive species for cultivation on marginal land for a variety of potential uses, such as fodder, bioethanol production, or as feedstock for anaerobic digestion (Acharya et al., 2019;Borland et al., 2009;Davis et al., 2011;Hastilestari et al., 2013;Holtum et al., 2011;Loke et al., 2011;Mason et al., 2015;Mwine et al., 2013;Yan et al., 2011). Such crops may be of particular value in semi-arid regions most likely to experience increased drought risk (e.g., Marthews et al., 2019;Otto et al., 2018).
The growth and the ecophysiological controls on the natural distribution of CAM species have been widely studied and observed across a range of environments. Broadly speaking, the methods previously used to observe the distribution of specific CAM species can be split into those that are: observation based; growth/trial based; and those that are based on models-both process and data-driven (Ringelberg et al., 2020). However, a comparison of the importance in different environmental parameters and derived indices in explaining the variability in CAM plant distribution has not yet been completed. Using existing studies published in the literature it is possible to compare areas of expected growth and productivity suitability (i.e., the locations with the environmental conditions required for specific species growth) (Guisan et al., 2017) based on processbased models (e.g., Owen et al., 2015) or using climatic envelope methods (e.g., Louhaichi et al., 2015). However, there is also the potential to use methods based on derived environmental parameters and those driven by a posteriori models (e.g., species distribution modeling (Guisan et al., 2017)) to identify the relationship between known observations of CAM species and predictor variables; thus projecting maps of suitable biotic conditions for species to occur based on climatological, environmental, and/or biotic correlations (Aguirre-Gutiérrez et al., 2013;Soberón & Nakamura, 2009).
Correlative species distribution models (SDMs) have been commonly employed as predictive tools to quantify relationships between species occurrence datasets and measurements of environmental variables (Dormann et al., 2012) across ecology, but seldom applied to the specific mapping of CAM plants. Equally, as noted by Bucklin et al. (2015), there remains no consensus on which variables should be included as predictors in SDM analysis more generally. While many climate-only SDMs (i.e., using only climatic parameters) have been highlighted as important tools for both projecting current and future ecological niches (e.g., for guiding future conservation efforts   (Bucklin et al., 2015)), some studies have criticized this approach for providing only an incomplete representation of complex environmental systems (Araújo & Peterson, 2012;Bahn & McGill, 2007;Beale et al., 2008;Heikkinen et al., 2006).
Using different combinations of bioclimatic and derived environmental indices, this study tests and compares the relative importance of parameters in explaining the distribution of CAM plants, focusing specifically on Opuntia ficus-indica (L.) Mill. And Euphorbia tirucalli L.
as example species. In doing so, this study attempts to define the best, minimal predictors of plant distribution so that models have the greatest predictive power without being over-parameterized (Merow et al., 2014;Raes & Aguirre-Gutiérrez, 2018).
Unlike recent analyses which have ensemble-modeled numerous species with the aim of identifying the wider natural succulent biome distribution (e.g., Ringelberg et al., 2020), this study takes an alternative approach by selecting a minimal number of species of interest, but for which their distribution is successfully wide, occupying all available climatic niches, and with minimal dispersion limitations. There are numerous rare succulent species that have very restricted ranges on account of being dispersal-limited for which this analysis would not be appropriate. By comparison, O. ficus-indica is a successful invasive species having established itself across every continent (except Antarctica) (CABI, 2019) and found across all latitudes. Opuntia ficus-indica and E. tirucalli have also shown great potential suitability for bioeconomic uses Mason et al., 2015); and are therefore suitable test species to use for this analysis which is interested in exploring the possibility for these plants to be actively grown as a crop-highlighting the potential that can be achieved with CAM plantation for bioeconomic and land restorative purposes. While most previous distribution modeling exercises have been built on the natural distribution of native species, additional novel information might be obtained from explicitly considering the extent

T A X O N O M Y C L A S S I F I C A T I O N
Agroecology; Applied ecology; Biogeography; Ecophysiology and spread of introduced invasive species, once they are given the opportunity to spread into other parts of the "potential niche." Specifically, this study will compare different sets of variables to predict zones of potential suitability for Opuntia ficus-indica and Euphorbia tirucalli growth. In doing so, this study aims to first predict the current locations with suitable biotic conditions for the occurrences of O. ficus-indica and E. tirucalli using different SDMs tested in this study. Second, the results will help identify the most important set of variables that help define the environmental niche of two CAM species of interest. While the natural distribution of both species has generally been restricted to semi-arid regions as outcompeted by other plants, their natural ecological requirements permit them growing in wetter areas, and competition factors have largely restricted the spread of the species to regions with annual rainfall <500 mm (Luttge, 2004). Opuntia ficus-indica is a successful invasive which has been widely sighted across regions outside of central America (e.g., Africa, southern Europe), while E. tirucalli is native to Africa (Palgrave, 1977;Webb et al., 1984) but has also been found in central America, Europe, and other locations globally. Given the successful expansion, but different origins of these two species, comparison of the potential regions through which they could be successfully cultivated for bioeconomic (e.g., biogas) uses across a region (e.g., sub-Saharan Africa) with low levels of energy access, increased agricultural pressure in the face of drought, and high climatic suitability for these species is particularly interesting (Buckland & Thomas, 2021). For this reason, this study will initially calibrate and project models based on a global view, before taking a deeper focus on Africa as a potential region for cultivation, bioenergy and bioeconomic uses.

| MATERIAL S AND ME THODS
Using SDM techniques, this study compares the relative performance of five SDMs to predict the potential distribution of O. ficusindica and E. tirucalli based on current climatic conditions. The five SDMs each capture different combinations of environmental variables defined in the WorldClim 2.1 bioclim database (Fick & Hijmans, 2017) and derived indices or parameters that have previously been cited as impacting upon the spatial distribution of CAM plants: the Hellmann-Eberle quotient (a measure of inter-annual rainfall predictability used by Ellenberg, 1981), the aridity index (the ratio between annual precipitation and potential evapotranspiration (PET)), cloud cover (as a proxy for light intensity), and the R-index (the ratio between actual and PET) (Yao, 1974). As noted in Title and Bemmels (2018), the inclusion of more complex climatic indices may characterize environmental conditions that are more directly physiologically relevant to particular species than more primary climatic parameters (e.g., temperature, precipitation). Due to the successful invasive nature of both species, we have considered their expansion to be largely limited by environmental conditions rather than distribution-limited, and thus only climatic-based parameters have been used.

| Predictor datasets
The choice of environmental variables selected should ideally be based on the known ecology of the species (Title & Bemmels, 2018), as this has previously demonstrated more realistic SDMs (Rödder et al., 2009;Saupe et al., 2012). With this in mind, a combination of bioclim datasets from the WorldClim 2.1 catalogue (Fick & Hijmans, 2017) and derived environmental metrics were compiled and a sensitivity analysis (Pearson's Correlation) was used to remove highly correlated variables. Inclusion of co-variant parameters leads to over-parameterization of the model. All predictor datasets were bilinearly resampled to the same 2.5 min resolution.

| Bioclim datasets
Based on existing research of the parameters impacting the growth and distribution of succulents and CAM plants more generally (Acharya et al., 2019;Inglese & Scalenge, 2009;Le Houérou, 1996;Louhaichi et al., 2015;Masocha & Dube, 2018), and the results from covariance testing (Appendix A), four bioclim variables were selected for use as explanatory parameters ( Table 1).

| Hellmann-Eberle quotient
The Hellmann-Eberle quotient provides a measure of inter-annual precipitation variability and is defined as the ratio between precipitation of the wettest year and precipitation of the driest year over an extended period of time. Ellenberg (1981) examined the distribution pattern of tall stem succulents in relation to climate and found that they tended to occur in areas where rainfall was low (i.e., <500 mm per annum), but regularly received (i.e., where

Bioclim 15
Precipitation seasonality (coefficient of variation) TA B L E 1 Bioclim parameters (from Fick & Hijmans, 2017) used in the final model iterations the Hellmann-Eberle quotient <5 over a series of years) (Cowling et al., 1997). Ellenberg's original study from 1981 was based on 35 years of observations  and has since been referred to and expanded in more recent studies exploring the controls on CAM distribution (e.g., Holtum et al., 2016Holtum et al., , 2017Lüttge, 2010;Ringelberg et al., 2020). For this reason, combining the Hellmann-Eberle quotient with other bioclimatic parameters in the SDM analysis has the potential to improve our distributional understanding of key species of interest.

| Aridity index, R-index and cloud cover
The Aridity Index (AI) is commonly considered to provide a measure of overall water availability, a central component to all vegetative growth. Based on global raster data from 1970 to 2000 AD, a global aridity index based upon the implementation of the Penman-Monteith reference evapotranspiration equation (Allen et al., 1998) was used in this study (Trabucco & Zomer, 2018). The R-index is calculated as the ratio between actual evapotranspiration (AET) and PET and is a measure of plant water supply in relation to plant water demand (Yao, 1974). A global R-index raster was calculated using the average annual AET and PET rates available via the Consultative Group for International Agricultural Research (Trabucco & Zomer, 2018). Finally, as a proxy for photosynthetically active radiation, cloud cover was included as a potential parameter that could be inversely related to plant growth. CAM plant growth shows a saturation-type relationship to light intensity (Nobel, 1988;Nobel & Valenzuela, 1987) with the three main environmental limitations on CAM plant growth considered water, light, and temperature (Nobel, 1988;Owen et al., 2015). Process-based models have thus included a proxy for light intensity as a measure to predict the variability in spatial productivity of CAM plant species in existing literature (e.g., Owen et al., 2015). In this study, a global raster of mean annual cloud

| Occurrence data
Opuntia ficus-indica and Euphorbia tirucalli were the two species of interest selected for analysis in this study. The former is an especially suitable test species for this analysis since its occurrences are already occupying most of its geographic range allowing us to model a potential distribution closer to its fundamental niche (i.e., all the environmental conditions where a species could potentially exist) as opposed to the realized niche (i.e., those conditions in which the species currently does exist) (Chase & Leibold, 2003;Hutchinson, 1957). By comparison, often the current distributions of localized or very rare species are restricted by dispersal limitations and species interactions; in such cases the realized niche will be smaller than the fundamental niche, and we cannot independently test the impact of different climatic and environmental parameters on defining areas suitable for species occurrence.
Opuntia ficus-indica and E. tirucalli occurrence data were downloaded from the GBIF data repository (GBIF.org, 2020) (Accessed 09/06/2020) and cleaned according to the method described in Zizka (2019). Species occurrence data from both the native and introduced ranges was used for both species. One of the main aims of this study is to identify regions which could support the cultivation of these species under current climatic conditions (i.e., to map the fundamental niche of the species). As such, we do not need to limit the training dataset to the native distribution, rather observations of the species across a range of geographic zones are useful in identifying the scope of environmental settings which are suitable.

| Pseudo-absences
Unlike "presence" datasets, "absence" datasets are not often readily available. Since some SDM algorithms require both datasets, pseudo-absence (PA) datasets are created as a replacement for true absence records (Raes & Aguirre-Gutiérrez, 2018

| Model fitting
There are numerous options for algorithms to use in SDM studies (summarized in Raes & Aguirre-Gutiérrez, 2018), but there is often no model of "best" choice (Qiao et al., 2015).

| Evaluating model comparison
As well as TSS and AUC (ROC) scores calculated for each of the individual models, the TSS and AUC scores of the ensemble models were compared to determine the relative best performing model and identify whether the additional parameters used in SDMs 2-5 increased the predictive accuracy of SDM 1 (bioclim-only predictors). As discussed in Komac et al. (2016), the AUC provides us with a measure of the performance of ordinal score models and a threshold measure of accuracy (Thuiller et al., 2005), while the TSS score provides us with a measure of evaluative performance which has all the advantages associated with the Cohen's kappa statistic (Cohen, 1968) but is not

F I G U R E 2 Final 1085
Euphorbia tirucalli occurrences downloaded from the GBIF dataset (GBIF.org, 2020) after spatial bias analysis completed sensitive to prevalence (Allouche et al., 2006). Ensemble models from the five SDM scenarios were initially projected on to the world to generate a continuous map showing variations in the suitability/ probability of occurrence for the two species of interest. Then, using the ensemble model cut-off values to provide a binary measure of habitat suitability, projections were then compared against projections based on existing methods from the literature (e.g., Louhaichi et al., 2015) to identify the spatial variability in identified suitable regions between the methods. Ensemble binary cut-off values are calculated as those that give the maximum "sensitivity" and "specificity" scores (Thuiller et al., 2005).

| Assessing variable importance
Individual variable importance was approximated using the

| Ensemble model projections and comparisons
A total of 500 individual models and projections were produced for each species and ensembled to produce weighted mean projections with coefficient of variation (uncertainty between the individual projections) measurements for each of the scenarios. Ensemble results from SDM scenario 1 are presented in Figures 3 and 4, with TSS scores across the five ensembles shown in Table 3 When the deviation in environmental suitability is compared between SDM scenarios ( Figures 5 and 6), the inclusion of either the Hellmann-Eberle quotient, aridity index, or R-index all produced overall results with lower suitability projections than those predicted using bioclim variables alone (SDM 1). It is only in SDM 4 (Figures 5c and 6c) that ensemble model projections suggest that some regions (typically those with reduced overall certainty) have a higher level of environmental suitability than projections based on the four bioclim variables alone. However, these results are not necessarily corroborated when we consider the binary cutoff values at a regional scale for example (i.e., where maximum specificity and sensitivity are achieved) and the results are presented as either "suitable" or "unsuitable" areas (

| Environmental variable importance
Results from individual variable importance analysis were calcu-  (1992) who consider low temperatures a key limiting factor in succulent growth when referring to succulent growth on hill slopes in Tenerife. The relatively minor variation in overall model performance between the SDMs with and without the additional parameters is also in agreement with the results noted by Bucklin et al. (2015), who have suggested that climate-only predictor sets may be equally as effective in producing environmental suitability maps.
Following the role of extreme cold temperatures, moisture availability measured either through annual precipitation or the aridity index or R-index is shown to be the second most important independent variable on overall model performance. When an alternative precipitation metric is included in the model (i.e., SDM scenarios 3 and 5), the relative importance of annual precipitation is reduced. The compound variable, aridity index, is defined as the ratio between annual precipitation and PET-reflecting the amount of moisture potentially available for vegetation growth.
Equally, the R-index as calculated as the ratio between AET and PET, provides a measure of water supply in relation to water demand (Yao, 1974); unsurprising that the relative importance of annual precipitation as an individual metric is reduced when considered alongside these compound variables. However, it is also worth noting that the R-index used in this study (derived from AET and PET datasets (Trabucco & Zomer, 2018)) is based on spatially standardized vegetation and soil coefficients (i.e., based on typical n/a n/a n/a n/a 2 1% 70% 17% 4% 8% n/a n/a n/a 3 2% 74% 5% 4% n/a 15% n/a n/a 4 1% 78% 13% 5% n/a n/a 4% n/a 5 1% 75% 5% 4% n/a n/a n/a 14% TA B L E 6 Standardized mean variable importance of each parameter across the five different species distribution model (SDM) scenarios for E. tirucalli

Annual precipitation
Precipitation seasonality Hellmann Eberle quotient Aridity index Cloud cover R-index 1 2% 71% 26% 1% n/a n/a n/a n/a 2 2% 68% 23% 1% 7% n/a n/a n/a 3 2% 74% 3% 1% n/a 19% n/a n/a 4 1% 82% 14% 1% n/a n/a 2% n/a 5 2% 76% 4% 1% n/a n/a n/a 18% agronomic crops at maturity and an average soil texture for plant rooting depth at 2 m). Variations in both the vegetation and soil stress coefficients specific to the characteristics of the species of interest would perhaps produce a more useful spatial representation and metric to test.
Moreover, it is important to note that the variable importance This being said, while the results in the spatial deviation of individual SDM scenarios from SDM 1 projections (Figures 5 and 6) suggest variation in the continuous likelihood profiles, binary cut-off levels ( 1) across Africa, the potential yields will vary within these locations/ SDM projections and hence a combination of both the continuous scale likelihood and the binary cut-off values is useful in assessing the true scale of potential niche that could be used for growing these species.

| Land suitability estimates
A key advantage of the SDM approach is the capacity to produce a more refined estimate of land area that is potentially available, after ficus-indica SDM 1 model with existing methods from previous literature focusing specifically on Africa as an example region. Figure 7 presents the comparison of the land suitability estimates found in this study following SDM 1 (binary cut-off) and the predicted suitable areas for O. ficus-indica growth according to the parameters detailed in Louhaichi et al. (2015), and the adapted productivity index displayed in Owen et al. (2015). where Figure 7c suggests a zone of high productivity. This is a good demonstration that our approach has taken the "competi-

| Updated Hellmann-Eberle quotient map
While minimum temperatures were demonstrated as key in determining the majority of the variability in spatial distribution of the species, analysis of an updated Ellenberg index (Hellmann-Eberle quotient combined with average annual precipitation) also highlighted the importance of precipitation predictability in the distribution of succulents. As noted above, Ellenberg (1981) examined the distribution pattern of tall stem succulents in relation to climate (von Willert et al., 1992) and found that they tended to occur in areas where rainfall was low (i.e., <500 mm), but regularly received (Hellmann-Eberle quotient <5 over a long series of years) (Cowling et al., 1997). Since Ellenberg's original study, which was based on precipitation data from 1905 to 1940, further studies have also explored the predictability of rainfall as a parameter by which to explain succulent distributions (Holtum et al., 2016;Ringelberg et al., 2020). As part of this study, an updated global Hellmann-Eberle quotient based on a longer time-series of monthly precipitation data from 1960 to 2018 was used as a predictor parameter for the ensemble model. In addition to use in the ensemble modeling, the updated map of a revised "Ellenberg index" shown in Figure 8 provides further valuable discussion to unresolved problems regarding succulent distribution. The near absence of stem succulents from arid Australia, for example, is one particular example which has invited discussion among research groups (Holtum et al., 2016;Ringelberg et al., 2020). While Ellenberg (1981)  America the Atacama. This updated visualization based on a longer time series than previously studied suggests that perhaps high variability in annual precipitation levels over the long term is key to explaining succulent absence, such as the lack of endemic terrestrial species with CAM in arid Australia.

| CON CLUS IONS
In comparison with existing methods of land suitability estimation for these species, this study has taken an a posteriori modeling approach using SDMs and known occurrences to extrapolate wider areas of potential suitability for cultivation of these species. In doing so, it has allowed us to qualify the models of suitability estimates with a level of evaluative performance, incorporates the nuances and complexities of relationships between environmental parameters and known occurrences, and produce a more refined estimate of Funding acquisition (equal); Investigation (supporting); Supervision (supporting).

ACK N OWLED G EM ENTS
This research was funded by the Oxford Martin School as part of the Dryland Bioenergy project. We would also like to extend our gratitude to the two anonymous reviewers for providing valuable feedback on an earlier draft of this manuscript.

CO N FLI C T O F I NTE R E S T
The authors declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The authors confirm that the results supporting the findings of this study are available within the article and its supplementary materials. Raw data and code (R script) to reproduce the results are available at Dryad: https://doi.org/10.5061/dryad.wwpzg msmt.