Using a multi‐model ensemble approach to determine biodiversity hotspots with limited occurrence data in understudied areas: An example using freshwater mussels in México

Abstract Species distribution models (SDMs) are an increasingly important tool for conservation particularly for difficult‐to‐study locations and with understudied fauna. Our aims were to (1) use SDMs and ensemble SDMs to predict the distribution of freshwater mussels in the Pánuco River Basin in Central México; (2) determine habitat factors shaping freshwater mussel occurrence; and (3) use predicted occupancy across a range of taxa to identify freshwater mussel biodiversity hotspots to guide conservation and management. In the Pánuco River Basin, we modeled the distributions of 11 freshwater mussel species using an ensemble approach, wherein multiple SDM methodologies were combined to create a single ensemble map of predicted occupancy. A total of 621 species‐specific observations at 87 sites were used to create species‐specific ensembles. These predictive species ensembles were then combined to create local diversity hotspot maps. Precipitation during the warmest quarter, elevation, and mean temperature were consistently the most important discriminatory environmental variables among species, whereas land use had limited influence across all taxa. To the best of our knowledge, our study is the first freshwater mussel‐focused research to use an ensemble approach to determine species distribution and predict biodiversity hotspots. Our study can be used to guide not only current conservation efforts but also prioritize areas for future conservation and study.


| INTRODUC TI ON
Understanding processes and constraints influencing the distribution and abundance of species is a fundamental goal of basic and applied ecological research (Austin, 2002;Jones & Lawton, 2012;Poff, 1997). Of particular interest over the last few decades has been understanding how the spatial and temporal configuration of habitat shapes ecological processes (Newton et al., 2008). In lotic systems, habitat is viewed as a hierarchy with stream segments, reaches, mesohabitats, and microhabitats nested within a watershed. These habitat levels are simultaneously parts and wholes (Miller, 2008), such that each level shapes the environment of all habitat levels nested within it (Frissell et al., 1986). Poff (1997) recognizing the hierarchical nature of environmental constraints and the role it plays in shaping the distribution and abundance of lotic species developed a conceptual framework contrasting species traits with multi-scale habitat data after. The resulting model acknowledged that each habitat level has its own functional and structural properties shaped by the previous level, which served as environmental filters of species traits over varying spatiotemporal periods. In practice, this idea suggests a species can only be present at one level of habitat (i.e., mesohabitat scale) if it is able to pass through preceding habitat levels, each having its own unique properties that filter taxa lacking certain prerequisite traits (Poff, 1997;Southwood, 1977Southwood, , 1988. Inoue et al. (2017) demonstrated this point by evaluating the distribution of freshwater mussels and fish in central and northern Europe, where species occupancy was determined, in part, by a suite of nested environmental variables ranging from landscape to the mesohabitat scale. Similar findings have been observed for other taxa, such as reef fish assemblages (López-Pérez et al., 2013), forest plant communities (Kolb & Diekmann, 2004), and wetland species (Quesnelle et al., 2013). These studies suggested a hierarchical perspective of habitat configuration might be useful to inform basic and applied ecological research.
Species distribution models (SDMs) have become an increasingly useful tool for understanding how habitat filters (i.e., land use, topography, and climate) shape species distributions. Because of this utility, SDMs have been widely applied to estimate species ranges, identify environmental factors shaping distribution, abundance patterns, determine areas of conservation importance, and predict ranges under past and future environmental scenarios (Daniel et al., 2018;Dormann, 2007;Thuiller, 2003;Wilson et al., 2011). SDMs are rooted in the concepts presented by Poff (1997), as well as niche theory, which proposes the distribution of a species is related to its ability to cope with varying environmental conditions (Grinnell, 1917;Hutchinson, 1957;Peterson et al., 1999). SDMs work by relating environmental and biological data to species presence to determine habitat limits, which are subsequently used to predict occupancy or environmental suitability across a geographic area (Buckley et al., 2010;Guisan & Thuiller, 2005). Because SDMs are based on multi-scaled habitat data, predicted ranges for a given species represent geographic areas across time and space that permit species occurrence.
Ensemble SDMs (ESDMs), a type of SDM, have become an increasingly popular approach to predicting the occupancy of rare species (De Marco & Nóbrega, 2018;Sousa-Silva et al., 2014).
Ensemble modeling works by taking individual SDMs (e.g., MAXENT, Random Forest, Boosted Regression Trees, etc.) and averaging their output into one final prediction, which greatly reduces the generalization error of single model approaches (Araújo & New, 2007;Hao et al., 2019;Seni & Elder, 2010). Individual models can be combined in a variety of ways, with the simplest being the mathematical mean or median (Hao et al., 2019), while more complex options include weighted averages, or scaling predictions based on model evaluation statistics (Araújo & New, 2007;Hao et al., 2019;Marmion et al., 2009). The combined predictions, regardless of the approach, are often better than standalone SMD methods. Hao et al. (2019) found through a meta-analysis of SDM studies that ensemble model performance was generally higher than individual models. Elder and Lee (1997) comparing the model potential of ensembles versus single model type found predicted occupancy was similar between ensemble and single models, but ensembles had lower out-of-the-bag error (mean prediction error of training data). Seni and Elder (2010), building off Elder and Lee (1997), noted ensemble models were generally better than single models because of weighted averages, model selection, and variable pruning to optimize total model performance.
Additionally, ESDMs distribute individual model type bias more evenly and reduce overprediction bias based on data types (i.e., continuous, binomial, and categorical), which creates less disjointed and abrupt predicted ranges compared to single-model methods.
In this study, we use ESDMs to predict distributions of an imperiled and understudied group of freshwater bivalves native to central and southern México. This region harbors some of the most biologically diverse freshwater streams in the world (Graf & Cummings, 2021a;Smith & Bermingham, 2005). The Pánuco River Basin (hereafter PRB), located in East Central México, is one of these diverse drainages, and contains more than 95 species of fish, dozens of species of freshwater mollusks (gastropods and bivalves), and a diverse collection of other macroinvertebrates (Martínez-Lendech et al., 2020). Included among the mollusks are 14 species of native freshwater mussels, eight of which are endemic to the Pánuco River Basin and one introduced species. The status and distribution of freshwater mussels in the PRB, including factors that contribute to their persistence, remain unknown. As a result, it is unclear how habitat at various spatial and temporal scales shapes the distribution and abundance of the mussel fauna in the PRB or how species traits facilitate occupancy at those various scales. It is also unclear if species distributions and composition have changed overtime. The lack of accurate distribution and composition data can be problematic and will likely negatively affect the conservation and management of the mussel fauna in this region.
To begin addressing the knowledge gaps for mussels within the PRB, we evaluated the role of habitat in shaping mussel occurrence across the landscape. We also provide baseline information for a fauna that is poorly known in a region considered as one of the 25 major global biodiversity hotspots in need of conservation, protection, and further study (Myers et al., 2000). Additionally, we assess the potential of ESDMs for predicting occurrence with limited environmental and presence data for conservation and management of species. The specific objectives of our study were to (1) use SDMs and ESDMs to predict the distribution of freshwater mussels in the PRB; (2) determine habitat factors shaping mussel occurrence in the PRB; and (3) use predicted occupancy across a range of taxa to identify mussel biodiversity hotspots to guide conservation and management.

| Study area, focal taxa, and environmental data
The PRB is the 3rd largest river basin by size (98,227 km 2 ) in México and contains three major sub-basins, the Moctezuma (42,726 km 2 ), Tamuin (also called the Tampaón) (33,260 km 2 ), and Tamesí (19,127 km 2 , Hudson et al., 2005). The Pánuco River is the primary stream within the PRB and is considered the 10th longest river in México (510 km) and the 4th largest by discharge (20.3 billion m 3 /annually, Hudson, 2003. The Pánuco River is formed from the Moctezuma and Tamuin rivers after crossing the Sierra Madre Oriental and joining with the Tamesí River north of Tampico, México, before discharging into the Gulf of México (Hudson et al., 2005). The PRB is located primarily in a subtropical climate zone with annual precipitation of approximately 30 cm near the headwaters, increasing to 240 cm near the coast (Hudson, 2003). Rainfall is typically greatest from May to October, with July and September being the wettest months. Temperature varies greatly with elevation, but on average ranges from 15°C in January to 24°C in June (Hudson, 2003).
Land use within the PRB is largely farming and ranching with primary crops being sugarcane, citrus, and coffee (Hudson et al., 2005).
Occurrence data includes recent and historical observations collected by Texas A&M University, the Illinois Natural History Survey, Urbana-Champaign, MUSSELp (http://musse l-proje ct.uwsp.edu/), and U.S. Geological Survey. Nomenclature for mussels followed Graf and Cummings (2021a). We focused on 15 species found in the PRB ( Figure 1) whose species relationships and distribution remain unresolved (Graf & Cummings, 2021a). Occurrence data for our focal species came from surveys conducted in 2017 and 2018 (Inoue et al., 2020) and from data aggregated by MUSSELp (Graf & Cummings, 2021b unique observations were omitted due to poor model success. To model the distribution of our focal species, we used climate, elevation, and land use data. We chose these variables because previous studies have shown all three are important determinants of mussel occupancy (Gama et al., 2016;Hopkins, 2009;Santos et al., 2015) and because these data were readily available within our remote study area. Climate data were obtained from the Worldclim bioclimatic database (Fick & Hijmans, 2017) and include data on the minimum and maximum levels, mean values and ranges, and quarterly summaries for temperature and precipitation on a global scale (Hijmans et al., 2005;Maria & Udo, 2017;Nix, 1986). We also used air temperature as a proxy for water temperature (Caissie, 2006;Mohseni & Stefan, 1999). Elevation data were taken from WorldClim, and land cover data were obtained from DIVA-GIS (Hijmans et al., 2004), both of which were taken at the same resolution and extent as the Worldclim bioclimatic data. Percent slope changes were calculated from the elevation data set using the terrain function in the "raster" package of R (Hijmans, 2020). Multicollinearity was

| Modeling protocol
To predict mussel occurrence in the PRB, we used an ensemble modeling approach. In our analysis, we used the following models, which are all available within the "SDM" package in R (Naimi & Araújo, 2016) To train the individual models, we used 80% of the occurrence records for a given species to develop predictions, which were then tested against the remaining 20% of data. To improve parameter estimates, each training and testing group was randomly resampled using 100 bootstrap replicates. Both individual model methods that were used to create the final ensemble model and the ensemble model for each species were assessed using the true skill statistic (TSS) and area under the curve (AUC) of the receiver operating characteristic (ROC). TSS takes into consideration both omission (false absences) and commission (false presences) errors and is unaffected by prevalence (Allouche et al., 2006). TSS ranges from 0 to 1, and values from 0.2 to 0.5 indicate poor model fit, those values from 0.6 to 0.8 denote adequate model fit, and values greater than 0.8 are considered excellent model fit (Coetzee et al., 2009). The ROC shows the classifying performance based on a threshold parameter (Fielding & Bell, 1997;Phillips et al., 2004). It plots the true positive rate against the false-positive rate relative to each threshold, creating a curve of expected outcomes. The area under the ROC, or AUC is the probability of a model classification correctly predicting the outcome (i.e., presence vs. absence). Models with AUC values <0.5 are considered worse than random, values from 0.5 to 0.7 are considered poor, 0.7-0.9 are considered fair, and values >0.9 are considered excellent fit (Swets, 1988). In addition to model performance, the variable of importance, which identifies the variable that contributes the most to model accuracy, was also assessed using To map the predicted presence of a given species within a cell (30 arc-second grid), optimum threshold (OT) values for the species were created from the weighted average based on the maximum (sensitivity + specificity) of all species-specific models. Predicted values of a cell greater than the OT of a species were given a value of 1 to indicate predicted presence, and 0 for predicted absence.
These binomial presence/absence ensemble maps for each species were then stacked by summing the collected outputs to create local richness maps which denote the number of predicted species within a given cell. All statistical analysis was performed using R version 4.1.1 (R Core Team, 2021).

| RE SULTS
A total of 11 species with 621 species-specific observations at 87 unique sites were used in our ensemble modeling after removing species with small sample sizes (i.e., n ≤ 10: U. imbecillis (2), P. semigranosa  For all individual type model outputs (Figure 2), model performance was fair to excellent based on testing parameters (Table 2).

| DISCUSS ION
In this study, we modeled distributions of some of the mussel species found in the PRB, understudied drainage in East Central México, using a limited suite of available environmental and climatic variables. We accomplished this using an ensemble approach, wherein multiple SDM methodologies were combined to create a single ensemble map of predicted occupancy. To the best of our knowledge, this is the first study to use an ensemble approach to determine the species distribution of freshwater mussels and predict biodiversity hotspots. Of the variables assessed in our study, precipitation during the warmest quarter, elevation, and mean temperature during the driest quarter were consistently the most important discriminatory environmental variables among species, whereas land use had limited influence across all taxa. We suggest these findings can be used to guide current conservation efforts and prioritize objectives for future conservation planning for freshwater mussels in the PRB.

| Ensemble species distribution models
SDMs are an important component of conservation and natural resource management for the game and non-game species (Elith & Leathwick, 2009;Randklev et al., 2015;Sherwood et al., 2018).

In the United States, the U.S. Fish & Wildlife Service (USFWS) has
used ensemble models to support species conservations status assessments because of its predictive power, accuracy, and ability to handle understudied species and/or areas with limited environmental data (Breiner et (Hao et al., 2020). This issue is of particular note with the current study, where both occurrence data and information on each taxon were limited. Marini et al. (2010) found that ESDMs could be used to predict the potential distribution of seven tropical bird species with as few as 10 observations. Our results mirror those by Marini et al. TA B L E 2 Individual species distribution model performance for each species of freshwater mussels found within the Pánuco River Basin, México  Conditions in the landscape are known to affect instream habitat, which in turn, can shape species occupancy (Daniel & Brown, 2013;Newton et al., 2008;Poff, 1997). In our study, precipitation during the warmest quarter or mean temperature during the driest quarter was the most important predictor of habitat suitability for 6 of the 11 species assessed. Specifically, we found areas with higher precipitation during summer months harbored greater mussel diversity compared to areas where summer precipitation was reduced. This relationship likely illustrates the effect of low flows on mussels during thermally stressing events such as periods of low precipitation. It is well known that elevated water temperatures can affect mussel survival (Khan et al., 2019(Khan et al., , 2020Pandolfo et al., 2010), and these effects often occur during periods of reduced flow (Archambault et al., 2014). Reduced flows are inherent to a rivers flow regime and are biologically important (Biggs et al., 2005;Bovee, 1986;Poff et al., 2006), but they can become problematic during periods of reduced precipitation and/or overuse by humans (Golladay et al., 2004;Randklev et al., 2018). Golladay et al. (2004) found significant declines in mussel fauna where a stream reaches ceased to flow, which was exacerbated by increased irrigation draws during drought. For the PRB, it remains unknown if low precipitation, water consumption, or a combination of both are contributing to mussel distribution or absence. However, given the role both have played in mussel declines elsewhere in North America, our finding should serve as a warning to conservationists and resource managers in this basin.
We also found that elevation was an important determinant of occupancy for all 11 species in this study. The elevation is known to shape mussel occurrence by affecting stream flows, water level, scouring events, and access for larval mussels to host fish (Hastie et al., 2001;Wilson et al., 2011). The latter is of note because freshwater mussels possess a unique life cycle involving an obligate parasitic larval stage (glochidia or lasidia in the case of the Mycetopodidae) that must attach to aquatic vertebrates (primarily fish) to develop into a free-living juvenile (Barnhart et al., 2008), which also means their dispersal is tied to that of their host fish.
The elevation is rooted in past geological events, which are known to influence patterns of aquatic biodiversity (e.g., Hoagstrom et al., 2014). In our study, known observations, predicted presence, and diversity are situated between the coastal fall line, which represents the maximum extent of the Gulf of México during the Cenozoic period, and the arid central plateau. The Gulf of México has slowly retreated since the Cenozoic period (Galloway et al., 2011;Smith & Bermingham, 2005), which has likely isolated taxa once connected by fish infected with glochidia or lasidia that migrated along coastal inlets or between pirated stream systems. The arid central plateau exhibits a marked decrease in precipitation compared to the lower coastal reaches of the PRB (Hudson, 2003), has a limited number of spring-fed stream systems, and is characterized by highly variable temperatures with summer temperatures reaching upwards of 30°C. These factors have likely constrained mussel occurrence and expansion by limiting fish infected with glochidia through physical barriers such as sharp increases in slope and shifts in geology and constraining habitat available for colonization due to reductions in water quantity, quality, and colonization of stream reaches that extend to the plateau.
We found land-use contributed very little to model performance, which is interesting given that land-use change is often cited as a primary factor responsible for mussel declines (Allan, 2004;Box & Mossa, 1999;Randklev et al., 2015). We suspect this finding is due to the coarse resolution of our land use data and homogeneity of land use across the landscape. We also suspect it could be due to differences in scale of land-use changes between regions where mussels have been well-studied (e.g., United States and Europe) and the PRB. In the United States, the landscape has been altered at a scale (i.e., time and space) that is much broader and more intensive than what we observed in the PRB. In the United States, these alterations have left many streams and rivers devoid of any sort of significant mussel fauna (Haag & Williams, 2014). In contrast, the

| Conservation implications
We successfully predicted the occupancy of mussels in a region of México where the fauna is largely unknown. Future studies could build off these efforts through ground-truthing our predictions utilizing methods laid out in Randklev et al. (2015) for model validation.
In that study, the authors classified unsampled reaches using probabilities of occupancy into the following classes: 0%-20%, 20%-40%, 40%-60%, 60%-80%, and 80%-100%. The authors then randomly generated sample points within each probability class and performed surveys at those locations. The new survey data collected by Randklev et al. (2015) was then fed back into their models to improve model performance. Given the influence of temperature and rainfall in our study, both of which will be affected by changing climates, the models and baseline data presented in this study could be used to predict the effects of future climate conditions on freshwater biodiversity in general, and specifically, freshwater mussels. These efforts could be enhanced by including sites and reaching specific variables.
For example, Wilson et al. (2011)  Given that species, occupancy is an interplay between species traits and selective characteristics of habitat (Poff, 1997), the inclusion of species trait information in future modeling efforts will help with understanding mechanisms behind species occurrence and optimize model prediction. Unfortunately, very little if anything is known about the life history of mussels in the PRB, which is a problem for mussel conservation and management regardless of geographic location. Because of the lack of data, future studies could focus on enumerating demographic traits such as fecundity, longevity, host use, and growth rates known to be useful for explaining occupancy for other aquatic taxa (Mims & Olden, 2012;Winemiller et al., 2015). That said, our study develops a framework for future research studies in biodiversity research and conservation biology planning in this region and provides baseline information for a poorly understood mussel fauna considered of high conservation concern (e.g., Myers et al., 2000).

F I G U R E 3 (a)
Map of local biodiversity hotspots within the entire Pánuco River Basin; (b) biodiversity map outputs restricted to within 2 km buffer of major streams within the basin. Biodiversity maps were created using stacked ensemble model distributions for 11 species of freshwater mussels found within the reach: Actinonaias coyensis, Actinonaias medellina, Actinonaias signata, Disconaias disca, Disconaias fimbriata, Cyrtonaias tampicoensis, Friersonia iridella, Nephronaias aztecorum, Popenaias berezai, Popenaias semirasa, and Anodontites cylindracea. Black circles denote known presence data from all species modeled and may represent multiple species within the same location

ACK N OWLED G M ENTS
We give special thanks to Aaron Koch, Heather Warman, Enrique Aguado, and Charlie Morones from SUP Kentucky and Kayak Huasteca for guiding us safely during our field excursions in México.
We thank Edna Naranjo for assistance with permit acquisition. We without their support this study could not have happened. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
Mussel occurrence records are available: https://doi.org/10.5061/ dryad.18931 zczf and through http://musse l-proje ct.uwsp.edu/.