Deciphering ecology from statistical artefacts: Competing influence of sample size, prevalence and habitat specialization on species distribution models and how small evaluation datasets can inflate metrics of performance

Sample size and species characteristics, including prevalence and habitat specialization, can influence the predictive performance of species distribution models (SDMs). There is little agreement, however, on which metric of model performance to use. Here, we directly compare AUC and partial ROC as metrics of SDM performance through analyses on the effects of species traits and sample size on SDM performance.


| INTRODUC TI ON
Species distribution models (SDMs) tie occurrence data to environmental covariates to predict a species' distribution over geographic space (Araújo & Peterson, 2012). SDMs are used to address conservation and management issues from the potential impacts of climate change on the efficacy of our current reserve systems (Araújo, Cabeza, Thuiller, Hannah, & Williams, 2004;Téllez-Valdés & D¡Vila-Aranda, 2003) to the spread of invasive species (Bae, Murphy, & García-Berthou, 2018). Given the growing use of SDMs for ecological research and conservation, developing clear guidelines for practitioners is increasingly important.
This quadratic effect of sample size on SDM performance has led to research on the minimum sufficient sample size with which to build accurate SDMs (Proosdij et al., 2015;Stockwell & Peterson, 2002;Wisz et al., 2008). Estimates of minimum sufficient sample size range from 3 (Proosdij et al., 2015) to 30 (Wisz et al., 2008) and appear to be affected by SDM algorithm (Stockwell & Peterson, 2002;Wisz et al., 2008), species prevalence and landscape complexity (Proosdij et al., 2015).
The negative relationship between species' prevalence and SDM performance is well documented (Guo et al., 2015;Sor et al., 2017).
Species' prevalence on a landscape is driven by a combination of the degree of species' habitat specialization and the degree of landscape heterogeneity. Prevalence and habitat specialization should be correlated. Species that can use diverse habitats are likely to be more prevalent on a landscape. When examined, species habitat specialization is positively correlated with SDM performance (Connor et al., 2018;McPherson & Jetz, 2007). Habitat specialization may be the driver behind the relationship between prevalence and SDM performance. Deciphering the interwoven influences of sample size, prevalence and habitat specialization on SDM performance may influence guidelines for effective use of SDMs.
Partial receiver operating characteristic (partial ROC) has been introduced as an alternative for model evaluation . Researchers can address many of the critiques of AUC by using partial ROC and replacing commission error on the x-axis with the proportion of area a species is predicted to be present. Furthermore, through bootstrapping and using partial ROC ratios, we can compare predictive performance to random models. AUC ranges from 0 to 1 and can be interpreted as the probability that a randomly chosen occupied site has a higher score than a randomly chosen unoccupied site. An AUC of 0.5 is a random model. Partial ROC ratios, on the other hand, range from 0 to 2 with 1 representing a random model. Partial ROC ratios can be interpreted as model accuracy compared to a random model. Though not commonly done, a similar ratio could be used for AUC. The comparative sensitivity of these two metrics of SDM performance to sample size, species prevalence and species specialization has, to our knowledge, not yet been explored.
In this study, we examine the effects of sample size, prevalence and habitat specialization on SDM discrimination, an important aspect of model performance, estimated with AUC and partial ROC.
We use a machine learning algorithm to predict the distributions of 22 species of bird from datasets of systematically reduced sizes. Our primary objective is to evaluate and compare AUC and partial ROC as metrics of model performance for real species on real landscapes.
We calculate AUC and partial ROC with subset and independent evaluation datasets to more comprehensively assess differences in results from the two metrics. Despite being essential to model evaluation, evaluation datasets are rarely explicitly studied. Our secondary objective is to compare the relative influence of sample size, prevalence and habitat specialization on model performance.
Given the abundance of literature on the effects of sample size and prevalence on model performance, the novelty in our study stems from deciphering between the competing effects of prevalence and habitat specialization, and the comparison of AUC and partial ROC as metrics of model performance.  (Thorson et al., 2003). Land cover in these counties is dominated by agricultural lands in the valley and wet coniferous forests in the mountains. Patches of riparian forest, oak forest, oak savanna, and wet and upland prairie speckle the landscape.
We conducted five-minute unlimited distance avian stationary sur- presence-absence data in this study to make our results comparable to data from the commonly used eBird database (Sullivan et al., 2009). We selected 22 passerine species representing a wide range of sample size (number of positive occurrences), prevalence (number of occurrences divided by the number of surveys) and habitat specialization within the study area (Table 1). Due to small territory size and frequent auditory cues (singing), passerines are well suited to short-duration stationary visual and auditory surveys.

| Environmental data
We compiled 25 environmental variables previously used to characterize land cover, topography and forest structure in our study area (T. A. Hallman, & W. D. Robinson, personal communication) (Table S1.1). Environmental variables were obtained in or converted to 30-m resolution rasters. Since birds can be detected far from point count locations, we used focal statistics in ArcMap to characterize environmental variables at four spatial scales: 165, 315, 615 and 1,215 m radii from cell centres. These radii span a range of scales that have previously been shown to be relevant to avian occurrence (Baladrón, Isacch, Cavalli, & Bó, 2016;Boscolo & Metzger, 2009;T. A. Hallman, & W. D. Robinson, personal communication). For all land cover covariates, we calculated percentage cover within each radius. For all forest structure and topographic covariates, we calculated mean values within each radius.

| Systematic reductions in sample size
To create datasets with systematically reduced sample sizes and relatively stable prevalence for all study species, we iteratively halved the full dataset three times ( Figure S1.1). Each iteration was a systematic thinning performed by removing every other count.
Where the distance between roadside counts was approximately 0.8 km in the full dataset, the distance between roadside counts was around 1.6 km in the least reduced dataset, 3.2 km in the second most reduced dataset and 6.4 km in the most reduced dataset ( Figure S1.2).

| Species distribution models
We used binomial boosted regression trees (BRTs) for all species distribution models ( Figure S1.1). Boosted regression trees frequently perform among the best of the commonly used SDM algorithms (Elith & Graham, 2009;García-Callejas & Araújo, 2016;Guisan, Graham, Elith, & Huettmann, 2007). For all BRTs, tree complexity was 3 (Johnston et al., 2015), bag fraction was held constant at 0.75, and learning rates were optimized to allow for a minimum of 1,000 trees (Elith, Leathwick, & Hastie, 2008). BRTs were constructed through a 10-fold cross-validation method. We used a multiscale framework we withheld a random 20% of the dataset for model evaluation (herein: subset evaluation data). We ran ten iterations of each model to address the stochasticity inherent in the data splitting process (into training and evaluation datasets) and the BRT algorithm.

| Model evaluation
We used withheld subset evaluation data and survey data from similar habitats in an adjacent county for model evaluation (herein: independent evaluation data). Though from the same Oregon 2020 dataset, no independent evaluation data were used in model training and within a species, the same independent evaluation data were used for each iteration. We calculated AUC (Pearce & Ferrier, 2000) and partial ROC   . To create confidence intervals and compare partial ROC from our models to random models, we resampled 50% of each dataset with replacement for 500 iterations. We used a Wilcoxon test to statistically test the difference between AUC ratios from our models to those of random models. We report partial ROC values as partial ROC ratios (compared to random models) throughout. We ran BRTs with the dismo package in R (Hijmans, Phillips, Leathwick, & Elith, 2017

0.676
Note: Training and evaluation sample sizes are the mean of ten iterations of the same model in the full and most reduced datasets. Habitat specialization from top models included. (NicheToolBox, 2017). Spatial autocorrelation can inflate metrics of model performance when evaluated with randomly withheld data.
Since subset AUC and partial ROC were calculated on the same subset evaluation data, they are subject to identical spatial autocorrelation, and comparison between metrics should be unaffected (the primary objective of this study). Independent AUC and partial ROC were calculated on independent data that should be free from issues of spatial autocorrelation in training and evaluation datasets.

| Metrics of habitat specialization
To quantify habitat specialization, we calculated indices of species habitat specialization for all study species. To calculate our indices, we created a 300 m buffer around each count location. We used the tabulate area tool in ArcMap to calculate the area of each of the six most common classes from the subclass level of the National GAP Project land cover classification database. For each species, we calculated the area of all six land cover classes at sites where the species occurred. From these data, we adopted a series of indices used for resource partitioning (Devictor et al., 2010): the Simpson's Diversity Index, the Shannon Index, the Simpson's Diversity Index weighted by habitat availability (herein: Weighted Simpson's Diversity Index) and a similarly weighted index of diversity proposed by Smith (Smith, 1982). We also calculated the coefficient of variation of the proportion of available habitat occupied (herein: CV of Proportional Habitat) by the species for each of the six land cover classes. Like the weighted indices above, the CV of Proportional Habitat is inherently weighted by habitat availability and accounts for differences in overall species prevalence.

| Mixed-effects models of prevalence, sample size and habitat specialization
A species' true prevalence on the landscape is an unknown latent variable. In this study, we used sample prevalence as a surrogate for landscape prevalence. We calculated sample prevalence as the sample size divided by the total number of surveys within a single dataset (e.g. full or reduced datasets). Within a single dataset, species prevalence and sample size were therefore inextricably linked and their independent effects on SDM performance could not be independently evaluated. Through the systematic reductions in dataset size, however, sample size was manipulated for each species with little effect on prevalence (see below).
We used mixed-effects models to evaluate the influence of species prevalence, sample size and habitat specialization on model performance. To do so, we used data from all full-and reduced-dataset SDMs of all study species within the same mixed-effects models.
We ran separate analyses with AUC and partial ROC as dependent variables. Furthermore, we ran separate analyses for model performance as estimated with subset and independent evaluation datasets. We scaled and centred all variables included in mixed-effects models. For all mixed-effects models, we treated species as a random effect. Before incorporating prevalence and sample size, we ran mixed-effects models that included each of our five indices of habitat specialization. We used the habitat specialization index included in the BIC top-performing model as the single metric of habitat specialization included in the subsequent model set. We then built a mixed-effects model set with habitat specialization, sample size and prevalence as independent variables. We also examined quadratic effects of sample size and prevalence. We included single-and additive multivariable models as well as models with an interaction effect of prevalence and sample size. We compared models through BIC.
The model with the lowest BIC was considered the best model, and models within four BIC were considered competitive. We conducted this analysis with both subset evaluation data and independent evaluation data.
To better understand the relationship between AUC and partial ROC, we ran regressions between the two for metrics calculated with both the subset evaluation and the independent evaluation data. To examine consistency of test metrics based on source of evaluation data, we ran a regression of AUC calculated with the subset evaluation data on AUC calculated with the independent evaluation data. Likewise, we ran a regression of partial ROC from each of our two sources of evaluation data. We ran these linear regressions on the mean values of the ten iterations of each SDM to avoid excessive pseudoreplication. To better visualize the effects of sample size on each metric of model performance, we calculated the percentage full AUC and percentage full partial ROC from models from each of the reduced datasets. Finally, we ran regressions to examine the relationship between prevalence and habitat specialization in the full dataset.

| Evaluation dataset size
Subset evaluation datasets decreased in size with dataset thinning.

| Prevalence, sample size and specialization
In the full dataset, mean species prevalence ranged from 0.07 to 0.49 while mean species sample size ranged from 193 to 1,430 (Table 1). In the most reduced dataset, prevalence across species ranged from 0.06 to 0.51 while sample size ranged from 21 to 185.
The mean within species standard deviation of prevalence across full and reduced datasets was 0.009, indicating negligible changes in prevalence with reductions in dataset size. Species habitat specialization ranged from highly generalist to highly specialist values in each of the indices (Tables 1, S1.2).

| AUC
Across datasets, mean AUCs of SDMs for a species calculated with the subset evaluation data ranged from 0.67 to 0.93 with an across species mean of 0.81 (Table S1.3). Within habitat specialization mixed-effects models, the model with CV of Proportional Habitat was over four BIC better than the next best model and had over 83% of the model weight (Table S1.4). In the subsequent mixedeffects models comparing the effects of prevalence, sample size and habitat specialization, the model including specialization and a quadratic effect of occurrence was over four BIC better than the next best model and had over 96% of the model weight (Table 2).
In this BIC Patterns were similar in the AUCs calculated with the independent evaluation data. Across species, AUCs ranged from 0.62 to 0.92 with a mean of 0.78 (Table S1.3). In the mixed-effects models comparing the effects of prevalence, sample size and habitat specialization, the model including specialization and a quadratic effect of occurrence was over four BIC better than the next best model and had over 86% of the model weight (

| Partial ROC
Across species, partial ROCs of SDMs calculated with the subset evaluation data ranged from 1.05 to 1.58 with a mean of 1.27 (Table S1.3).
Within habitat specialization mixed-effects models, the model with Weighted Simpson's Diversity Index was over two BIC better than the next best model and had over 77% of the model weight (Table S1.5).
In the subsequent mixed-effects models comparing the effects of prevalence, sample size and habitat specialization, the model including specialization, prevalence and a quadratic effect of occurrence was over four BIC better than the next best model and had over 88% of the model weight ( p-value = .0075). Weighted Simpson's Diversity Index is an index with lower values indicating greater habitat specialization. The negative slope of the effect on partial ROC therefore indicates greater model performance with greater habitat specialization. Aside from habitat specialization, these coefficients contradict those found for AUC.
Patterns in partial ROC calculated with the independent evaluation data starkly contrasted partial ROC calculated with subset evaluation data. Across species, partial ROCs ranged from 1.08 to 1.52 with a mean of 1.25 (Table S1.3). In the mixed-effects models comparing the effects of prevalence, sample size and habitat specialization, the model including specialization, prevalence and a quadratic effect of occurrence was over two BIC better than the next best model and had over 59% of the model weight (

| AUC and partial ROC comparison
There was a strong positive correlation between AUCs calculated with subset and independent evaluation data (R 2 = .501, p-value < .0001; Figure 1a). The correlation between partial ROCs calculated with subset and independent evaluation data was much weaker (R 2 = .056, p-value = .0149; Figure 1b). There was a weak correlation between AUCs and partial ROCs calculated with the subset evaluation data (R 2 = .151, p-value = .0001; Figure 2). In contrast, there was a strong correlation between AUCs and partial ROCs calculated with the independent evaluation data (R 2 = .790, p-value < .0001). There was a significant correlation between CV of Proportional Habitat and prevalence (R 2 = .212, p-value = .0179; Figure 3a), and between Weighted Simpson's Diversity Index and prevalence (R 2 = .316, p-value = .0038; Figure 3b). This relationship was nearly identical when model performance was calculated with subset or independent evaluation data.

| Evaluation dataset size
We examined evaluation dataset size as an explanation for the inconsistency of the results in partial ROC calculated with the subset evaluation data compared to all other results in this study. In the full dataset, evaluation dataset size in the subset evaluation data ranged from 26 to 304. In the most reduced dataset, evaluation dataset size in the subset evaluation data ranged from 5 to 40. Evaluation dataset size in the independent evaluation data, which was not subject to reductions in data-  (Figure 4a), the quadratic term for sample size became negative (Figure 4b), and the effect of prevalence became negative (Figure 4c). The effect of habitat specialization decreased, which may be due to TA B L E 4 BIC mixed-effects model set comparing the effects of habitat specialization, sample size and prevalence on partial ROC of species distribution models from 22 study species calculated with subset evaluation data the exclusion of habitat specialists with sample sizes below the threshold (Figure 4d).

| D ISCUSS I ON
We used a rigorous study design to compare AUC and partial ROC as metrics of SDM performance. We used subset and independent evaluation data to investigate the impacts of evaluation dataset on both metrics. Although our primary objective was to compare metrics of SDM performance, we did so by examining the comparative influence of prevalence, sample size and habitat specialization on performance estimated by each of these metrics. We used 22 species of bird that reflected a range of prevalence, habitat specialization and sample size in our analyses. Birds are "easily" detected during the breeding season, but reflect the messy reality of biodiversity surveys; many individuals and species go undetected even when present. We see this noise as an essential element in studying the effects of species and dataset characteristics on SDM performance.

| Model performance measured by AUC
Model performance for AUC compared well with literature reported values (Illán et al., 2014;Shirley et al., 2013). We found good agreement on the effects of sample size and habitat specialization on model performance measured by AUC calculated with subset and independent evaluation data. Similar to past studies increases in sample size increased model performance with a stronger effect at small sample sizes (Bean et al., 2012;McPherson & Jetz, 2007;Proosdij et al., 2015) (Figure 5a,b). Model performance was also higher in more specialized species (Connor et al., 2018;McPherson & Jetz, 2007;Soultan & Safi, 2017) (Figure 6a). Overall model performance and the strength of each effect were less pronounced in the independent evaluation data, where predictions are an extrapolation to a neighbouring landscape. These extrapolations, however, were geographic in nature, with both landscapes occupying similar environmental space (Franklin, 2013).
Prevalence appeared to have little effect on model performance as measured by AUC. As landscape prevalence is an unknown latent F I G U R E 1 Scatter plots of (a) AUC and (b) partial ROC from species distribution models calculated with subset and independent evaluation data. Colours represent the full and reduced datasets from dataset thinning. The black line represents a one-to-one ratio. All values below the line indicate higher model performance calculated with the subset evaluation data, while models above the line represent higher model performance calculated with independent evaluation data

F I G U R E 2
Scatter plots and regressions of AUC and partial ROC from species distribution models calculated with subset (blue) and independent (yellow) evaluation data. There was a weak correlation between AUCs and partial ROCs calculated with the subset evaluation data (R 2 = .151, p-value = .0001) and a strong correlation between AUCs and partial ROCs calculated with the independent evaluation data (R 2 = .790, p-value < .0001) variable, we used sample prevalence as a surrogate measure. Given that our surveys were conducted periodically throughout the landscape, it is reasonable to assume that sample prevalence would be positively correlated with landscape prevalence. That said, we conducted more counts in targeted rarer habitats within national wildlife refuges. These surveys could add noise to the sample prevalence-landscape prevalence relationship and could ultimately diminish the perceived effect of prevalence on model performance in our models (Meynard & Kaplan, 2012).  Ruiz-Sánchez et al., 2015;Yañez-Arenas et al., 2018).

| Model performance measured by partial ROC
Calculated with the independent evaluation data, the effects of sample size (Figure 5c,d) and habitat specialization (Figure 6b) on partial ROC matched results for AUC and literature expectations (Bean et al., 2012;Soultan & Safi, 2017). Unlike results for AUC, there was also a weak negative effect of prevalence on partial ROC.
There may be two explanations for the weak strength of the effect of prevalence on model performance compared to the literature (Guo et al., 2015;Sor et al., 2017). First, as discussed above, our substitution of sample prevalence for landscape prevalence may have introduced noise that obscured the effect. As we used real species on real landscapes, we do not know true landscape prevalence. Second, habitat specialization and prevalence are correlated (Figure 3a,b) Calculated with the subset evaluation data, the effects of prevalence and sample size on model performance measured by partial ROC strongly contradicted our results for partial ROC calculated with independent evaluation data, our results for AUC and our expectations from the literature (Bean et al., 2012;Guo et al., 2015;McPherson & Jetz, 2007;Proosdij et al., 2015;Sor et al., 2017).
Model performance decreased with increasing sample size and increased with increasing prevalence (Figure 5c when datasets were split into training and evaluation datasets. In some cases, evaluation datasets in this study had as few as five occurrences. This may seem unreasonably small, but when studies suggest that SDMs can be run on data with as few as three to ten occurrences (Hernandez et al., 2006;Proosdij et al., 2015;Stockwell & Peterson, 2002), the question becomes "what size would the related evaluation data be?" As we iteratively increased the minimum    (Elith et al., 2008) attempts to skirt this issue by withholding ten percentage of model data and using that withheld data to calculate model performance in ten separate folds (i.e. iterations) then averaging the ten folds. In this way, all data are used for model building and the evaluation data in each fold are independent of the training data. When compared to data that have been withheld overall, however, again metrics of model performance are inflated. For rarer species with small sample sizes, it may therefore be best to avoid the use of metrics that are artificially inflated by the sample size in evaluation datasets. In this study, AUC appeared much less sensitive to small sample sizes in evaluation datasets than partial ROC. Importantly, since these results are based on how partial ROC is calculated, we believe they are highly generalizable to other taxonomic groups and geographic locations.
That said, more research is needed before the generalizability of our results can truly be assessed.

| CON CLUS IONS
Much research has been conducted to evaluate the effect of sample size on the performance of SDMs (Bean et al., 2012; A. Jiménez-Valverde, Lobo, & Hortal, 2009;Mitchell, Monk, & Laurenson, 2016). Due to the cost of surveys and the importance of SDMs to conservation of rare species, identifying minimum sufficient sample sizes for SDMs has been of particular interest (Proosdij et al., 2015;Stockwell & Peterson, 2002;Wisz et al., 2008). Few studies, however, specifically investigate the effect of evaluation dataset sample size on SDM assessment. Though accurate SDMs can be created with small sample sizes in the training dataset, small sample sizes in the evaluation dataset can lead to inaccurate assessment of model performance. Specifically, we found that small sample sizes in the evaluation dataset can inflate partial ROC. At small evaluation dataset sample sizes, partial ROC may therefore be an inappropriate metric of model performance.
AUC appears to be less sensitive to sample size in evaluation datasets, though as noted elsewhere, low sensitivity does not always indicate greater accuracy (Fernandes et al., 2018). Without critical inspection of performance metrics, studies with small evaluation datasets should be wary of seemingly high-performing SDMs.

ACK N OWLED G EM ENTS
We appreciate assistance with surveys from R. Moore, database work and maintenance from R. DeMoss and data entry from the Robinson Lab. The Robinson graduate laboratory provided helpful comments and discussion. The work was supported by the generous endowment of the Bob and Phyllis Mace Watchable Wildlife Professorship.

DATA AVA I L A B I L I T Y S TAT E M E N T
All survey data are available in the eBird database (eBird.org).