Performance evaluation of cetacean species distribution models developed using generalized additive models and boosted regression trees

Abstract Species distribution models (SDMs) are important management tools for highly mobile marine species because they provide spatially and temporally explicit information on animal distribution. Two prevalent modeling frameworks used to develop SDMs for marine species are generalized additive models (GAMs) and boosted regression trees (BRTs), but comparative studies have rarely been conducted; most rely on presence‐only data; and few have explored how features such as species distribution characteristics affect model performance. Since the majority of marine species BRTs have been used to predict habitat suitability, we first compared BRTs to GAMs that used presence/absence as the response variable. We then compared results from these habitat suitability models to GAMs that predict species density (animals per km2) because density models built with a subset of the data used here have previously received extensive validation. We compared both the explanatory power (i.e., model goodness of fit) and predictive power (i.e., performance on a novel dataset) of the GAMs and BRTs for a taxonomically diverse suite of cetacean species using a robust set of systematic survey data (1991–2014) within the California Current Ecosystem. Both BRTs and GAMs were successful at describing overall distribution patterns throughout the study area for the majority of species considered, but when predicting on novel data, the density GAMs exhibited substantially greater predictive power than both the presence/absence GAMs and BRTs, likely due to both the different response variables and fitting algorithms. Our results provide an improved understanding of some of the strengths and limitations of models developed using these two methods. These results can be used by modelers developing SDMs and resource managers tasked with the spatial management of marine species to determine the best modeling technique for their question of interest.

have been used to predict habitat suitability, we first compared BRTs to GAMs that used presence/absence as the response variable. We then compared results from these habitat suitability models to GAMs that predict species density (animals per km 2 ) because density models built with a subset of the data used here have previously received extensive validation. We compared both the explanatory power (i.e., model goodness of fit) and predictive power (i.e., performance on a novel dataset) of the GAMs and BRTs for a taxonomically diverse suite of cetacean species using a robust set of systematic survey data  within the California Current Ecosystem. Both BRTs and GAMs were successful at describing overall distribution patterns throughout the study area for the majority of species considered, but when predicting on novel data, the density GAMs exhibited substantially greater predictive power than both the presence/absence GAMs and BRTs, likely due to both the different response variables and fitting algorithms. Our results provide an improved understanding of some of the strengths and limitations of models developed using these two methods. These results can be used by modelers developing SDMs and resource managers tasked with the spatial management of marine species to determine the best modeling technique for their question of interest.
When evaluating the performance of SDMs, there has been an emphasis on statistically comparing models with different conceptual frameworks (Austin, 2007;Elith & Leathwick, 2009;Franklin, 1998;Guisan & Thuiller, 2005). A robust comparison sometimes involves simulated data so that the "true" relationship between the response and predictor variables is known and results are not confounded by differences in responses, predictors, or model parameterization (Austin, 2007;Brodie et al., 2019). The use of real data is also valuable, because cross-validation with spatially and/or temporally novel datasets can be used to quantitatively assess model performance with data that were not used to build the models (Hijmans, 2012;Shabani et al., 2016). Comparing the results from different models built with real data can provide important insights for the spatial management of marine species  and increase our understanding of both the strengths and weaknesses of different modeling techniques, helping to guide future modeling efforts.
Comparative modeling studies have been developed for marine species such as fish (BRTs, RFs, GAMS; Leathwick et al., 2006;Stock et al., 2019), seabirds (GLMs, GAMs, RFs, BRTs, and MAXENT;Oppel et al., 2012), and additional taxa (Robinson et al., 2017), but results from these marine-based model comparisons have not been consistent across species. Few comparison studies have focused on cetacean SDMs (e.g., GLMs vs. GAMs, Becker et al., 2010;GAMs vs. MAXENT, Fiedler et al., 2018). Studies that have compared cetacean SDMs have primarily used nonsystematic survey data for model development (e.g., GLMs, GAMs, BRTs, MAXENT, and support vector machines;Derville, Torres, Iovan, & Garrigue, 2018; BRTs vs. generalized additive mixed models [GAMMs], Abrahms et al., 2019 and have rarely explored how species distribution characteristics (i.e., spatial distribution and habitat preference) affect model performance. Finally, ensemble modeling has emerged as a robust method for combining multiple modeling results (e.g., Abrahms et al., 2019;Forney, Becker, Foley, Barlow, & Oleson, 2015;Marmion, Parviainen, Luoto, Heikkinen, & Thuiller, 2009;Oppel et al., 2012;Pikesley et al., 2013;Scales et al., 2016;Woodman et al., 2019), prompting a need to better understand the strengths and weaknesses of different modeling approaches to inform uncertainty-based weightings. is a need to understand the spatial and temporal habitat use of these species. SDMs for cetaceans have been developed for U.S. west coast waters from systematic ship survey data collected by the Southwest Fisheries Science Center (SWFSC) since 1991, and these GAMs have been extensively evaluated using cross-validation (Barlow et al., 2009;Becker et al., 2010;Forney, 2000;Forney et al., 2012) and predictions on independent datasets (Barlow et al., 2009;Becker et al., 2012Becker et al., , 2014Becker et al., , 2018Calambokidis et al., 2015;Forney et al., 2012). The most recent models provide spatially explicit density predictions at a 0.1˚ (approximately 10km x 10km) grid resolution , and they have been used by the Navy to assess potential impacts on cetaceans as required by U.S. regulations such as the MMPA and ESA (U.S. Department of the Navy, 2013Navy, , 2015Navy, , 2017. However, a comparison between different model types developed using these systematic data has not been performed, despite the potential for insight into both model performance and management of these species. The objective of this study was to compare the explanatory and predictive power of the two most prevalent modeling frameworks used to develop SDMs for cetaceans, BRTs and GAMs, and evaluate how species distribution characteristics affect model performance. Ultimately, a better understanding of model performance will improve the application of SDMs for marine spatial planning and conservation efforts. We used systematic cetacean survey data collected by SWFSC between 1991 and 2014 to develop SDMs for seven taxonomically diverse species or subspecies that have different spatial distributions and habitat preferences: short-beaked common dolphin (Delphinus delphis delphis; Figure 1), long-beaked common dolphin (Delphinus delphis bairdii), striped dolphin (Stenella coeruleoalba), northern right whale dolphin (Lissodelphis borealis), Risso's dolphin (Grampus griseus), fin whale (Balaenoptera physalus), and humpback whale (Megaptera novaeangliae). We selected these species to provide a comparison of (a) species with widespread distributions (short-beaked common dolphin and fin whale) versus restricted distributions (long-beaked common dolphin and humpback whale) in the study area, (b) species that occur in more dynamic nearshore habitat (northern right whale dolphin) versus less variable offshore habitat (striped dolphin), and (c) a species for which previous density GAMs did not perform as well as expected (Risso's dolphin; e.g., Becker et al., 2010;Forney et al., 2012). Since the majority of BRTs developed for marine species have been used to predict habitat suitability (probability of presence or relative abundance), which is then converted to presence/absence, ideally using a meaningful threshold value (Abrahms et al., 2019;Brodie et al., 2018;Derville et al., 2018;Hazen et al., 2017;Maxwell et al., 2019;Scales et al., 2017), we first compared BRTs to GAMs that used presence/ absence as the response variable. We then compared results from these models to GAMs that predict absolute density (animals per km 2 ), since density models built with these data have previously received extensive validation. Results enhance our scientific understanding of how the ecology and life history of different species affect the accuracy of models developed in different frameworks and, thus, the accuracy of potential management advice.

| Survey data
Cetacean survey data used to build the SDMs were collected in the California Current Ecosystem (CCE) during the summer and fall (July through early December) of 1991, 1993, 1996, 2001, 2005, 2008, 2009, and 2014 using systematic line-transect methods (Buckland et al., 2001). With the exception of 2009, which covered a limited area to target common dolphins (Carretta, Chivers, & Perryman, 2011), transect lines were arranged in a systematic grid to provide even coverage of the survey region over the course of each survey. When combined across years, the surveys provided dense coverage of waters from the west coast of the United States to approximately 556 km offshore (Figure 2; Barlow, 2016;Barlow & Forney, 2007;Carretta et al., 2011). We used on-effort sampling data from transect segments where Beaufort sea state (a wind index inversely correlated with animal detection rate) was ≤5. The survey protocol was the same for all years (see Barlow & Forney, 2007;Kinzey, Olson, & Gerrodette, 2000). Research vessels traveled at approximately 18 km/hr along predetermined transect lines while two experienced observers searched with pedestal-mounted 25 × 150 binoculars (approximately 10-15 m above sea-level depending on the research vessel). A third observer searched with unaided eyes and 7× hand-held binoculars and also recorded data on survey conditions and cetacean sightings. When cetaceans were sighted, the research vessel approached the group as needed to identify the species and estimate the number of individuals in the group. All observers independently provided best, high, and low group size estimates; we averaged the best estimates for each species to obtain a single group size estimate for each sighting.
The modeling dataset was created by dividing continuous portions of survey effort into approximate 5-km segments using the approach F I G U R E 1 Short-beaked common dolphin (Delphinus delphis delphis) in the California Current Ecosystem study area. Photograph taken by K.A. Forney under NMFS Permit No. 19091 described by Becker et al. (2010). Species-specific sighting data were assigned to each segment (total number of sightings and average group size), and habitat covariate values were derived based on the segment's geographic midpoint. Sighting data were truncated at a distance of 5.5 km perpendicular to the track line to eliminate the most distant groups (Buckland et al., 2001) and to maintain consistency with the species-specific effective-strip-width estimates (key parameters in line-transect density analyses that provide measures of how far animals are seen from the transect line) derived by Barlow, Ballance and Forney (2011) and used in this study to estimate cetacean densities.

| Habitat variables
As is the case for most cetacean SDMs, the selected habitat predictors are most likely proxies for unmeasured underlying ecological processes driving species distributions. The same suite of predictor variables was used for both model types (GAM, BRT) and included a combination of dynamic, bathymetric, and spatial covariates as described below. We also offered year as a potential predictor in all models to capture population trends for species whose abundance has increased substantially during the time period considered in our analyses: the short-beaked common dolphin (Barlow, 2016), fin whale (Moore & Barlow, 2011), and humpback whale (Barlow, Calambokidis, et al., 2011).

| Dynamic variables
Dynamic variables used in this study are defined as those that change on temporal scales of days to weeks in the CCE study area (Bograd et al., 2009). Dynamic predictors derived from a data assimilative CCE configuration of the Regional Ocean Modeling System (ROMS), produced by the U.C. Santa Cruz Ocean Modeling and Data Assimilation group Neveu et al., 2016), have been shown to be effective in similar SDMs for these species in this study area . We used daily output for each ROMS variable at the 0.1 degree (~10 km) horizontal resolution of the model. We used output from both a historical reanalysis (1980Neveu et al., 2016) and a near-real-time data assimilation system (2011-present; Moore et al., 2013;ocean model ing.ucsc. edu) to cover the broad temporal span of our survey data . Both systems provide data-constrained state estimates for our study area, but they differ in assimilation details and the specific data used. We limited the predictors to those consistent between the two sources : sea surface temperature (SST) and its standard deviation (sdSST), calculated for a 3 × 3 pixel box centered on the pixel containing the modeling segment midpoint, mixed layer depth (MLD, defined by a 0.5°C deviation from the SST), sea surface height (SSH), and its standard deviation, sdSSH. An offset (+0.035 m) was applied to the near-real-time SSH data to match the historical reanalysis dataset, which had a different reference level (Scales et al., 2017).

| Bathymetric variables
Bathymetric data were derived from ETOPO1 (obtained from https://www.ngdc.noaa.gov/mgg/globa l/global.html; 0.1-degree resolution; Amante & Eakins, 2009). Given its prevalence as an important predictor in past modeling studies in this ecosystem (e.g., Becker et al., 2010Becker et al., , 2016Becker et al., , 2018, we selected water depth (m) as a habitat variable to represent bathymetry, obtained for the midpoint of each transect segment.

| Spatial variables
Latitude and longitude are prevalent as covariates in many cetacean modeling studies (e.g., Cañadas & Hammond, 2008; Forney

F I G U R E 2 Completed transects for the Southwest Fisheries
Science Center systematic ship surveys conducted between 1991 and 2014 in the California Current Ecosystem study area and the eight geographic strata used to evaluate the accuracy of the spatial patterns of predicted habitat suitability/density. The four north-south strata are consistent with those used for line-transect abundance estimation (Barlow & Forney, 2007) and an offshoreonshore division occurs at the 2,000 m isobath. Region names are as follows: (1) OR/WA west, (2) OR/WA east, (3) NorCA west, (4) NorCA east, (5) CenCA west, (6) CenCA east, (7) SoCA west, and (8) SoCA east. The green lines show on-effort transect coverage in Beaufort sea states ≤ 5. Also shown are names of geographic places mentioned in the text (SCB, Southern California Bight) et al., 2015;Hedley et al., 1999;Pirotta, Matthiopoulos, MacKenzie, Scott-Hayward, & Rendell, 2011;Tynan et al., 2005;Williams, Hedley, & Hammond, 2006). They were included as covariates in our study as they have been shown to increase the explanatory power of SDMs because they often account for unmeasured variables that might be important for driving species distributions . The inclusion of spatial covariates prohibits predictions outside of the study area, but allowed us to explicitly evaluate how the different modeling methods handled discrete spatial data.

| Interaction terms
One of the advantages of BRTs is their ability to automatically fit interactions between predictor variables, while interactions must be explicitly defined when fitting GAMs. Previous comparative modeling studies have recognized the importance of interaction terms in SDMs and explicitly included them in GAMs to enable a more equitable comparison (Leathwick et al., 2006). Given the importance of spatial interaction terms in past cetacean SDMs Forney, 2000;Palacios et al., 2019;Yuan et al., 2017), bivariate interaction terms between latitude and each of the dynamic variables (SST, MLD, and SSH) were included individually when building the GAMs (see below for a description of the GAM modeling framework).

| Generalized additive models
Both the habitat suitability and density GAMs were developed in R (R Core Team, 2017) using the package "mgcv" (Wood, 2011), which uses cross-validation as part of the model selection process. We used restricted maximum likelihood (REML) to optimize the parameter estimates and a variable selection process that uses a shrinkage approach to modify the smoothing penalty and effectively remove nonsignificant variables from the model (Marra & Wood, 2011). REML provides more accurate smooth term estimates than other methods such as Akaike's information criterion (AIC) that have been shown to be prone to undersmoothing (Marra & Wood, 2011). To ensure that models were not overfit, we also removed variables that had p-values > .05 and then refit the models to ensure that all remaining variables had p-values < .05 (Redfern et al., 2017;Roberts et al., 2016). Pairwise interaction terms were considered separately in the GAMs to avoid overfitting and to aid in the ecological interpretation of the interaction term . Correlations among the predictor variables in our study ranged from 0.003 to 0.66 (absolute values), but mgcv is robust to such effects (termed "concurvity"; Wood, 2008).

| Habitat suitability GAMs
To develop presence/absence models from the systematically collected survey data, we assigned values of 1 to those segments that included sightings and values of 0 to those segments with no sightings. We fit binomial GAMs using a logit link function so that the resultant models describe the probability of species presence, also termed "habitat suitability" (Brodie et al., 2018) or "habitat preference" ).

| Density GAMs
The methods used to develop the density GAMs followed those described in Becker et al. (2018). For species that occur in small groups (i.e., fin and humpback whales), we fit a single response model using the number of individuals per transect segment as the response variable with a Tweedie distribution to account for overdispersion (Miller, Burt, Rexstad, Thomas, & Gimenez, 2013). The other species are all members of the Family Delphinidae that tend to occur in groups with large and variable sizes, so we fit separate encounter rate and group size models.
Encounter rate (number of sightings per segment) models were fit with all transect segments using a Tweedie distribution (i.e., assume the number of groups sighted per segment is Tweedie distributed, e.g., Foster & Bravington, 2013). Group size models were fit with only those transect segments that included sightings, using the natural log of group size as the response variable and a Gaussian link function. To account for observed geographic differences in the size of delphinid groups (Barlow, 2015;Cañadas & Hammond, 2008;Ferguson et al., 2006), group size was modeled using a tensor product smooth of latitude and longitude Wood, 2003). The natural log of the effective area searched (described below) was included as an offset in both the single response and encounter rate models to account for both varying segment lengths and the different detection probabilities recorded during the surveys.
Density (number of animals per km 2 ) was estimated by incorporating the model results into the standard line-transect equation (Buckland et al., 2001): where i is the segment, n is the number of sightings, s is the average group size, and A is the effective area searched: where L is the length of the effort segment (km), ESW is the effective strip half-width (km), and g (0) is the probability of detection on the transect line. Following the methods of Becker et al. (2016), species-specific estimates of ESW and g(0) were derived based on the recorded detection conditions on each modeling segment using coefficients estimated by Barlow, Ballance, et al. (2011) for ESW andBarlow (2015) for g(0).

| Boosted regression trees
BRTs use machine-learning methods whereby predictions from singletree models are combined to maximize predictive performance (Elith , 2008). We fit the BRTs in R (R Core Team, 2017) using the package "dismo" (Elith et al., 2008), following the methods described in Leathwick et al. (2006) and Elith et al. (2008). For each set of models, we built presence-absence BRTs specifying a binomial distribution consistent with the habitat suitability GAMs described above. The BRTs were assigned a tree complexity of 3, a bag fraction of 0.6, and we adjusted the learning rate ("shrinkage") for each model to ensure that at least 1,000 trees were included in the final model configuration (Elith et al., 2008). Due to the tendency of BRTs to overfit (Leathwick et al., 2006), we included a random number as a potential predictor in each model run to compare against the other variables' contributions; only relevant predictors (i.e., those more significant than the random variable) were included in the final BRTs (Eguchi, Benson, Foley, & Forney, 2017).
We built BRTs for each species using four combinations of variables to reduce the potential for overfitting and to explore the effect of including geographic (latitude, longitude) terms in the models: (a) dynamic and bathymetric variables only; (b) dynamic, bathymetric, and latitude; (c) dynamic, bathymetric, and longitude; (d) all variables. These models were compared using explained deviance, the receiver operating characteristic curve (AUC; Fawcett, 2006), and the true skill statistic (TSS; Allouche, Tsoar, & Kadmon, 2006), all metrics commonly used to assess BRTs (Brodie et al., 2018;Elith et al., 2006;Franklin et al., 2009;Oppel et al., 2012;Scales et al., 2017). Each BRT iteration is stochastic, and although generally the key variables (i.e., those with the most influence) are consistent among individual models runs, we used the best BRT of 10 model iterations for this analysis.

| Model predictions
For each species, the three models (suitability BRT, suitability GAM, and density GAM) were each used to make predictions on distinct daily composites of environmental conditions for all 1991-2014 survey days (n = 1,312) used to develop the models. We used the average of all composites to represent expected long-term patterns in species distributions that account for the varying oceanographic conditions during the 1991-2014 summer/fall SWFSC cetacean surveys. Log-normal 90% confidence intervals for the spatial predictions principally reflect temporal variability in population density/habitat suitability since this has been shown to contribute the greatest source of uncertainty in these models (Barlow et al., 2009;Becker et al., 2014;Boyd et al., 2018;Ferguson et al., 2006). The prediction grid was clipped to the boundaries of the approximate 1,141,800-km 2 study area to ensure that predictions were not extrapolated outside the region used for model development.

| Performance evaluation
The explanatory power of the models was compared using a set of established SDM performance metrics including AUC, TSS, and visual inspection of predicted and observed distributions during the 1991-2014 summer/fall SWFSC cetacean surveys (Barlow et al., 2009;Becker et al., 2010Becker et al., , 2016Forney et al., 2012;Oppel et al., 2012;Scales et al., 2016;Woodman et al., 2019). AUC and TSS measure the discriminatory ability of an SDM and can be calculated using any type of prediction value. To calculate TSS for the GAM density models, we used the sensitivity-specificity sum maximization approach (Liu, Berry, Dawson, & Pearson, 2005) to obtain thresholds for species presence. To assess the ability of the models to predict spatial distribution patterns, we used the presence/absence GAMs and BRTs to estimate habitat suitability and the density GAMs to estimate abundance specific to eight geographic strata within the CCE study area (Figure 2): four north-south strata consistent with those used for line-transect abundance estimation (Barlow & Forney, 2007), and an offshore-onshore division at the 2,000-m isobath, which roughly represents the transition from the continental slope to the continental rise. The four north-south strata included wa- ranked predicted values across the eight geographic strata to those derived from the actual survey data (Becker et al., 2014).
To compare the models' ability to predict on novel data, we also built the models without the 2014 survey data and then used each of these models to predict on the 2014 environmental conditions of the summer/fall SWFSC survey. We selected 2014 for this evaluation because during this time waters in the CCE became anomalously warm as an unprecedented marine heatwave spread over the area (Bond, Cronin, Freeland, & Mantua, 2015;Cavole et al., 2016;Di Lorenzo & Mantua, 2016;Leising et al., 2015), providing a unique opportunity for cross-validation. A previous study  assessed the ability of the density GAMs to predict abundance and distribution during this novel year for some of the species considered here, and given their success, we wanted to compare novel predictions from the presence/ absence GAMs and BRTs. Following the methods of Becker et al. (2018), models were built with different combinations of the habitat variables and the model with the highest predictive performance was carried forward to represent that model type. For each of the three models, we then computed overall study area ratios of observed-to-predicted values and inspected predicted 2014 distribution patterns as compared to the 2014 survey observations (Barlow et al., 2009;Becker et al., 2010Becker et al., , 2016Becker et al., , 2018Forney et al., 2012;Redfern et al., 2013).
Results were examined in light of the species-specific characteristics that could affect model performance including study area distribution patterns and habitat preferences.

| Explanatory performance
The 1991-2014 SWFSC surveys provided 82,659 km of on-effort data in Beaufort sea states ≤5 which were used to develop the three different SDMs. The number of sightings available for modeling varied between species, ranging from 115 for northern right whale dolphin to 906 for short-beaked common dolphin (Table 1). Key predictor variables selected by the two different (habitat suitability and density) GAMs and BRT fitting algorithms (Table 1) were consistent with those found in previous studies that used subsets of the same survey data (Barlow et al., 2009;Becker et al., 2010Becker et al., , 2016Becker et al., , 2018Forney et al., 2012;Redfern et al., 2013), although the BRTs included the greatest number of predictor variables.
Year was included as a continuous variable in the final habitat suitability and density GAMs and BRT for humpback whale, capturing the increasing population trend of this species during the time period considered in our analysis (Barlow, Calambokidis, et al., 2011;Calambokidis, Barlow, Flynn, Dobson, & Steiger, 2017). Year was also a key predictor variable in the short-beaked common dolphin and fin whale habitat suitability and density GAMs, but was not significant in the BRTs for these species (i.e., year had lower relative variable contribution than the random variable).
The overall study area ratios of observed-to-predicted density/ habitat suitability values were very similar for the GAMs and BRTs, yet the percentage of explained deviance, AUC, and TSS metrics were generally highest for the BRTs and lowest for the density GAMs (Table 1). The nonparametric rank correlations were significant for all species for both types of GAMs and the BRTs, although the density GAM exhibited better performance overall (Table 2). AUC, TSS, and the rank correlations all measure the discriminatory ability of an SDM (i.e., how well a model separates occupied from unoccupied sites; Vaughan & Ormerod, 2005); however, results from the rank correlations suggest that the density GAMs were better able to capture large-scale spatial distribution patterns throughout the study area.  (Carretta et al., 2011;Gerrodette & Eguchi, 2011). There have been recent sightings of long-beaked common dolphin north of 36°N (Ford, 2005;Huggins et al., 2011) andFord (2005) anticipated that additional records of the species would occur during anomalously warm water periods. However, long-beaked common dolphins are not sighted north of Pt. Conception consistently enough to validate the BRT predictions.
During the model development phase, we attempted to improve the BRT prediction for long-beaked common dolphin by building BRTs with various combinations of the dynamic, bathymetric, and spatial variables that could potentially eliminate the extension of low to moderate habitat suitability north of 36°N. Unfortunately, no combination of predictor variables we offered the BRT was able to successfully capture the limited distribution pattern of long-beaked common dolphin in the southern inshore portion of the study area, and all model predictions were worse (Figure 4) than the original ( Figure 3b). As an experiment, we also built BRTs using data only south of 37°N, but when making predictions on the entire study area, these models still predicted high habitat suitability in the northern portions of the study area and in some cases well offshore ( Figure 4d). In addition, the inclusion of latitude created apparent modeling artifacts in some of the BRT predictions (Figure 4a,d).
The lower and upper 90% confidence intervals (CIs) of density/ habitat suitability for the GAMs and BRTs showed overall similarities throughout the study area, with regional differences apparent in species distribution patterns for both the lower and upper CIs that were similar to differences apparent in the multiyear average density/habitat suitability comparisons (Figure 3). For the majority of the species, the upper CIs for the BRTs were higher across the study area than those of the GAMs (e.g., Figure 3b,c,e-g).

Species
Predictor variables SST, sea surface temperature; SSTsd, standard deviation of SST. The "LON:LAT" and "SST:LAT" terms in the GAMs indicate an interaction term. All models for short-beaked common dolphin, fin whale, and humpback whale were also offered a year covariate to capture their change in abundance during the 1991-2014 survey years (see text for details). Variables are listed in the order of their importance in each model. Comparative explanatory performance metrics (i.e., model goodness of fit) included percentage of explained deviance (Exp.Dev.), the area under the receiver operating characteristic curve (AUC), the true skill statistic (TSS), and the ratio of observed:predicted habitat suitability/density for the study area (Obs:Pred).
TA B L E 1 Summary of the final GAM and BRT habitat suitability (HS) and GAM density (Dens) models built with the 1991-2014 survey data and the number of sightings available for modeling (n)

| Predictive performance
The ability of the different model types to make accurate predictions during the novel 2014 year differed substantially, and based on the observed: predicted ratios, the density GAMs generally outperformed both the presence/absence GAMs and BRTs, particularly for northern right whale dolphin, fin whale, and humpback whale (Table 3). The density GAM had the best observed:predicted ratio (i.e., closest to 1) for six of the seven species, while the presence/ absence GAM had the best ratio for northern right whale dolphin (Table 3).
The plots of predicted 2014 distribution patterns as compared to the 2014 survey observations showed that both types of GAM were better able to predict shifts in distribution during the anomalously warm conditions in 2014 as compared to the BRTs ( Figure 5). For example, two warm temperate/tropical species in our study, shortbeaked common and striped dolphins, have continuous distributions southward into Mexican waters (Mangels & Gerrodette, 1994;Perrin, Scott, Walker, & Cass, 1985), and the distribution of both species expanded to the north during the warm 2014 conditions, increasing their abundance throughout the CCE study area (Barlow, 2016;Becker et al., 2018). The density GAM was able to capture the northward expansion of both short-beaked common and striped dolphins (i.e., swaths of higher-than-average density were predicted north of 40°N; Figure 5a,d). The presence/absence GAM was able to capture this northward shift for striped dolphin, while the BRTs predicted average to lower-than-average habitat suitability north of 40°N for both species (Figure 5a,d). The 2014 study area abundance estimate for short-beaked common dolphin was almost twice as high as in previous years. For waters off Oregon and Washington, average abundance was more than five times higher than in previous years (Barlow, 2016) due to the northward expansion in distribution during the warm 2014 conditions. Both types of GAMs for short-beaked common dolphin were able to capture the absolute increase in abundance/habitat suitability, while the BRT predicted average to lower-than-average habitat suitability throughout most of the study area (Table 3 and Figure 5a).
The long-beaked common dolphin BRT predicted suitable habitat extending north along the coast of the entire study area in 2014, but lower than what was predicted for previous years. The 2014 predictions from both types of GAMs better matched the known distribution of this species in the southern nearshore region of the study area, and areas with the highest predicted density were consistent with long-beaked common dolphin sighting locations during the 2014 survey (Figure 5b).
Among all the BRTs, the best observed:predicted habitat suitability ratio for the novel 2014 year was for northern right whale dolphin (0.88; Table 3). Interestingly, the difference plot for this species revealed that the BRT predicted lower-than-average habitat suitability for the majority of the study area, with a very small region near the northern border predicted to have higher-than-average habi- The density GAM also captured this northward shift in distribution in 2014, but the difference plot for the GAM contrasted sharply with that of the BRT, with higher-than-average density predicted for the northern portion of the study area and lower-than-average density predicted for the south (Figure 5c). For this species, the habitat suitability GAM had the worst observed:predicted ratio of the three models (1.17), and the difference plot was very similar to the BRT, with lower-than-average northern right whale dolphin habitat suitability predicted for the entire study area (Figure 5c).
All three types of models for Risso's dolphin overestimated density/habitat suitability in the study area in 2014, with observed:predicted ratios ranging from 0.52 to 0.61 (Table 3) Table 3).

TA B L E 2
Summary of the Spearman rank correlation coefficients (Rs) for geographic regions within the study area based on observed values (i.e., estimates from the 1991-2014 survey data) and predictions from the final GAM and BRT habitat suitability (HS) and GAM density (Dens) models The density GAM for humpback whale captured the increase in the number of individuals in the study area in 2014 (Barlow, 2016;Becker et al., 2018), with higher-than-average predictions for the region between approximately 34°N and 38°N where there were multiple sightings during the 2014 survey ( Figure 5g). Conversely, the presence/ absence GAM and BRT predicted lower-than-average habitat suitability for this region in 2014 (Figure 5g), and the observed:predicted ratios for these two models show that they substantially underestimated habitat suitability for humpback whales in 2014 (Table 3).

| D ISCUSS I ON
GAMs and BRTs have been established as two commonly used modeling frameworks to guide spatial management and conservation strategies for cetaceans (e.g., Abrahms et al., 2019;Gilles et al., 2016;Hazen et al., 2017;Redfern et al., 2019). Improving the application of SDMs for spatial planning and conservation efforts requires a better understanding of the strengths and weaknesses of these modeling methods. Both methods are used to model nonlinear covariate responses, but the mechanics of the two approaches differ, as GAMs use flexible smoothing functions while BRTs use binary splits (regression trees). Our study compared both the explanatory power (i.e., model goodness of fit) and predictive power (i.e., performance on a novel dataset) of habitat suitability GAMs and BRTs, as well as density GAMs, for a taxonomically diverse suite of cetacean species using a robust set of systematic survey data. Below we provide details on the models' performance and discuss species-specific characteristics that could have affected these results.

| Model comparison: explanatory versus predictive performance
The key environmental variables (i.e., those that had the most influence on the respective model) and general trend of their response curves ( Figure S1) were similar in the final 1991-2014 GAMs and BRTs, as were the study area ratios of observed: predicted density/ habitat suitability and the overall distribution patterns for the majority of species (Table 1, Figure 3). These results are similar to those of Scales et al. (2016), who found the ranking of variable performance, model response curves, and spatial predictions of GAMs, BRTs, and RFs similar when predicting the foraging habitats of gray-headed albatross (Thalassarche chrysostoma). The percentage of explained deviance, AUC, and TSS metrics were consistently higher for the BRTs (Table 1), suggesting that this type of model has higher explanatory ability than the GAMs; however, the predictive power of the BRTs was lower than both types of GAMs based on both the ratios of observed-to-predicted values for the novel 2014 year (Table 3)  showed excellent explanatory performance when discriminating between presence/absence, but poor performance when predicting on independent test data. They found a similar pattern when they used RF to develop density models, and attributed the inferior predictive performance to machine-learning techniques overfitting more than parametric models (Oppel et al., 2012). In our case, it is likely that overfitting in the BRTs made it more difficult to predict on the anomalous 2014 oceanic conditions that were not reflected in the training datasets, whereas the smooths in the GAMs were better able to handle such differences. Specifically, GAMs extend the splines between predictor and variable partial response to predict on novel conditions while BRTs assume a static relationship when predicting out of bounds (Zurell, Elith, & Schröder, 2012). However, with any modeling framework, when extrapolating outside the range of values used to build the models, results should be interpreted cautiously, particularly if data are not available for cross-validation (Becker et al., 2014;Mannocci, Roberts, Miller, & Halpin, 2017).
Year was included in both types of GAMs for short-beaked common dolphin, fin whale, and humpback whale, while the BRT only included year for humpback whale (i.e., year had lower relative variable contribution than the random variable in both the short-beaked common dolphin and fin whale models). Thus, the GAMs appeared to capture absolute changes in population size as populations recovered (fin and humpback whales) or moved into the CCE (shortbeaked common dolphin).
One of the advantages of BRTs is the implicit incorporation of interaction terms, such as between latitude and longitude, which must be explicitly defined in a GAM. However, for many of our BRTs, latitude and/or longitude created odd modeling artifacts in the prediction surfaces. For example, a spatial interaction term (latitude:longitude) was included in both types of GAM for long-beaked common dolphin, and these models accurately captured this species' limited distribution in the study area ( Figure 3b). However, when latitude and/or longitude was included in the long-beaked common dolphin BRT, these models produced ecologically unreasonable swaths of habitat suitability along latitude lines (e.g., Figure 4a,d). Further, latitude did not capture the expected patterns of long-beaked common dolphin habitat in the final 1991-2014 BRT, because low-to-moderate habitat suitability was predicted for areas along the entire U.S. west coast (Figure 3b), outside the normal range of this species (Carretta et al., 2011;Gerrodette & Eguchi, 2011). Spatial terms were effective in the BRTs for some of the species considered here (e.g., striped dolphin, Figure 3d), so we suggest that modelers use care when including spatial terms in BRTs.

| Model comparison: species with widespread vs. limited distribution
During summer and fall, short-beaked common dolphins and fin whales are known to occur throughout large portions of our study area (Barlow, 2016;Barlow & Forney, 2007;Becker et al., 2016;Calambokidis et al., 2015). Both types of the 1991-2014 GAMs and the BRTs successfully captured the distribution patterns of shortbeaked common dolphins and fin whales in the study area ( the BRTs underpredicted habitat suitability by more than 60% for fin whales and almost 90% for short-beaked common dolphins (Table 3).
In contrast, humpback whales and long-beaked common dolphins have more limited coastal distributions in our study area, with the latter typically occurring south of about 36°N (Barlow, 2016;Barlow & Forney, 2007;Becker et al., 2016;Calambokidis et al., 2015;Carretta et al., 2011). Elith et al. (2008) suggested that one of the advantages of BRTs over GAMs is that they could handle sharp discontinuities when modeling species with distributions that occupied only a small proportion of the sampled environmental space. Our results are inconsistent with this finding, as the humpback common dolphin yet the worst for humpback whale, and the BRTs for both these species exhibited poor predictive ability (Table 3). The BRT 1991-2014 spatial predictions for both species also had issues as evident from the habitat suitability plots; the long-beaked common dolphin BRT showed suitable habitat north of the typical range for this species (Figure 3b), and the humpback whale BRT showed low-to-moderate habitat suitability in areas to the northwest and well offshore, where there have been no sightings of this species during the SWFSC surveys ( Figure 3g). Surprisingly, the BRTs for both species had some of the highest explained deviance, AUC, and TSS values among all the models (Table 1). This illustrates that both threshold-independent (AUC) and threshold-dependent (TSS) measures can be misleading in cases when species "prevalence," that is, the proportion of the study area in which a species occurs, is low (Fiedler et al., 2018;Fourcade, Besnard, & Secondi, 2018;Somodi, Lepesi, & Botta-Dukát, 2017). Our results are consistent with others who have suggested that AUC alone is not a robust measure of SDM predictive performance because it does not provide information on the spatial distribution of model errors (Lobo, Jiménez-Valverde, & Real, 2007) and that model selection based solely on TSS can be misleading (Ruete & Leynaud, 2015). This result has important implications for management efforts in areas where the distribution of a species is poorly known, because reliance on the BRT AUC and TSS metrics alone could result in misguided conservation strategies (e.g., ill-defined boundaries for protected areas, ineffective mitigation measures, etc.).

| Model comparison: species occurring in more versus less heterogeneous habitats
The northern right whale dolphin is a cool-temperate species that occurs primarily in slope and shelf waters in the study area, which are oceanographically dynamic. Northern right whale dolphins exhibit southward distribution shifts into the Southern California Bight during cool-water periods, such as the winter months (Becker et al., 2014;Dohl, Norris, Guess, Bryant, & Honig, 1980;Forney & Barlow, 1998). The striped dolphin is a tropical species inhabiting warm offshore waters of the study area (Barlow, 2016;Becker et al., 2016;Forney et al., 2012), which are oceanographically less dynamic than the shelf and slope waters of the California Current Ecosystem (Chelton, Bernal, & McGowan, 1982;Haury, 1976;Hickey, 1979). Becker et al. (2010) found that the complexity of a species' habitat influenced the predictive ability of GAMs and that greater sample sizes were required to parameterize models for species that inhabit more heterogeneous or dynamic environments.
The 1991-2014 spatial distribution patterns of northern right whale dolphins and striped dolphins were successfully captured by all three models (Table 2, Figure 3c,d), suggesting that both model types have strong explanatory capability for species inhabiting habitats of varying complexity. However, the novel predictions for 2014 underestimated striped dolphin abundance (density GAM) and habitat suitability (presence/absence GAM and BRT) by over a factor of two (Table 3). The range of striped dolphin extends continuously from the study area south to waters offshore Mexico (Mangels & Gerrodette, 1994;Perrin et al., 1985). During the anomalously warm water conditions in 2014, the available striped dolphin habitat within The difference plots (Figure 5c) for the three models were dissimilar; however, as the density GAM predicted higher-than-average density in the north and lower-than-average density in the south. In contrast, the presence/absence GAM and the BRT predicted lower habitat suitability for almost the entire study area. In this case, both habitat suitability models erroneously implied a lower study area TA B L E 3 Summary of the final GAM and BRT models built with the 1991-2009 survey data, the number of sightings available for modeling (n), and their ability to accurately predict study area habitat suitability (HS)/density (Dens) for the novel year (2014) abundance of northern right whale dolphins, while the observed survey results indicated that there were nearly twice as many animals within a smaller (northern) area (Barlow, 2016 The anomalously warm conditions in 2014 provided a unique opportunity to assess the predictive ability of the models given the substantial shifts in distribution exhibited by many of the species considered here, including both striped and northern right whale dolphins. Our study area represents the northern range of striped dolphin (Mangels & Gerrodette, 1994;Perrin et al., 1985), and in

| Model comparison: a species for which previous GAMs have been challenging
Previous GAMs developed for Risso's dolphin using subsets of the data used here did not perform as well as expected, and there was poor correlation between predicted density patterns and sighting data used to build the models (Becker et al., 2010;Forney et al., 2012). Sighting data reveal a longitudinal hiatus in the distribution of Risso's dolphins within the study area, with sightings concentrated either along the continental shelf (mainly south of 38°N) or in offshore deep waters (Barlow, 2016;Barlow & Forney, 2007 (Figure 3e). The fact that none of the three models were able to capture the distribution patterns of Risso's dolphin likely indicates that the environmental variables offered to the models are not effective proxies for their habitat and prey. Large and small squid account for approximately 85% of the diet of Risso's dolphin (Pauly, Trites, Capuli, & Christensen, 1998). Squid are typically found at depths >200 m (Childress & Seibel, 1998), and identifying an available proxy that better captures the ecological processes driving squid distribution may improve the explanatory power of Risso's dolphin SDMs. Given their poor explanatory performance, it is not surprising that both the GAMs and BRT for Risso's dolphin also had poor predictive performance (Table 3).

| CON CLUS IONS
This study provided a unique opportunity to compare the performance of two commonly used SDM modeling frameworks, GAMs and BRTs, for a diverse suite of cetacean species to better understand strengths and limitations of each approach. All three models (density GAMs and presence/absence GAMs and BRTs) exhibited good explanatory performance and did well at predicting spatial patterns for species that have widespread distributions throughout the study area and for species that inhabit oceanographically diverse (i.e., more or less dynamic) environments. For species with limited distributions in our study area, the BRTs were not able to accurately capture their spatial distribution patterns despite strong performance as indicated by commonly used model evaluation metrics, confirming previous studies that have suggested that both AUC and TSS can be misleading when used to evaluate SDMs (Fiedler et al., 2018;Lobo et al., 2007;Ruete & Leynaud, 2015). Further, the inclusion of latitude and longitude in some of the BRTs, most notably for long-beaked common dolphin, resulted in odd modeling artifacts and predicted spatial distribution patterns opposite to those documented for this species.
When predicting on anomalous novel data, the density GAMs exhibited higher predictive performance than the presence/absence GAMs and substantially higher predictive performance than the BRTs. This is likely due to both the different response variables and the different fitting algorithms. Since the density GAMs are predicting absolute abundance, they are better able to respond to changes in the number of animals present in the study area, particularly for species whose distributions shrink but abundance increases (i.e., northern right whale dolphin in 2014). Similar to previous studies (Leathwick et al., 2006;Oppel et al., 2012), we found that BRTs had good explanatory power for most species but were not able to make accurate predictions on novel data, likely due to overfitting. Perhaps a better method of model selection to avoid overfitting could improve the predictive power of BRT models.
While there may be no single best modeling framework for predicting cetacean density or presence/absence, our results have provided an improved understanding of some of the strengths and limitations of both GAMs and BRTs. These findings support the use of both model types for describing species relationships, but suggest that a cautionary approach should be used when applying BRTs to anomalous novel data, and when including spatial terms (latitude, longitude) in the suite of potential predictors. Model ensembles have been shown to be a powerful tool for leveraging the weaknesses and strengths of different model types (Abrahms et al., 2019;Woodman et al., 2019) and may be a useful option for future species distribution modeling work. Continual efforts to evaluate and improve the predictive performance of species distribution models will aid in the conservation and management of cetacean species worldwide.

ACK N OWLED G M ENTS
The California Current Ecosystem survey data used in this analysis were collected by a large dedicated team from NOAA's Southwest Fisheries Science Center. We thank the cruise leaders, marine mammal observers, survey coordinators, ships' officers, and crew who all contributed to collecting these data. Chief Scientists for the ship surveys included Tim Gerrodette, Susan Chivers, and two of the coauthors (JB and KAF). We thank the UCSC Ocean Modeling group for providing ROMS output. The ROMS near-real-time system is supported by NOAA through a grant to the CeNCOOS Regional Association. This project was funded by the Navy, Commander,