- Top of page
- Materials and methods
Species distribution models should provide conservation practioners with estimates of the spatial distributions of species requiring attention. These species are often rare and have limited known occurrences, posing challenges for creating accurate species distribution models. We tested four modeling methods (Bioclim, Domain, GARP, and Maxent) across 18 species with different levels of ecological specialization using six different sample size treatments and three different evaluation measures. Our assessment revealed that Maxent was the most capable of the four modeling methods in producing useful results with sample sizes as small as 5, 10 and 25 occurrences. The other methods compensated reasonably well (Domain and GARP) to poorly (Bioclim) when presented with datasets of small sample sizes. We show that multiple evaluation measures are necessary to determine accuracy of models produced with presence-only data. Further, we found that accuracy of models is greater for species with small geographic ranges and limited environmental tolerance, ecological characteristics of many rare species. Our results indicate that reasonable models can be made for some rare species, a result that should encourage conservationists to add distribution modeling to their toolbox.
Effective conservation plans require accurate estimates of the spatial distributions of the species they are trying to protect. With such information conservationists can predict how a species’ distribution will respond to landscape alteration and environmental (climate) change. Species distribution modeling can provide a measure of a species’ occupancy potential in areas not covered by biological surveys and consequently is becoming an indispensable tool to conservation planning (Guisan and Zimmermann 2000, Corsi et al. 2000, Elith and Burgman 2003, Loiselle et al. 2003). These models combine points of known occurrence with spatially continuous environmental layers to infer ecological requirements of a species, generally using a statistical algorithm. The geographic distribution of a species is then predicted by mapping the area where these environmental requirements are met (Elith et al. 2006). Depending on data quality and the application at hand, these models can assist in identifying previously unknown populations, determining sites of high candidacy for reintroductions, guiding additional surveys, and informing selection and management of protected areas (Graham et al. 2004).
Of particular interest to conservation biologists are rare species. By definition, rare species have sparse and/or restricted spatial distribution patterns (Rabinowitz et al. 1986, Kattan 1992, Gaston 1997), which often means they are habitat specialists and that there is a limited number of sites of known occurrence. Species ecological characteristics (i.e. range sizes and ecological specialization) have been shown to influence model performance (Segurado and Araujo 2004, Brotons et al. 2004, McPherson et al. 2004, Elith et al. 2006). Generally, models for species with broad geographic ranges and environmental tolerances tend to be less accurate than those for species with smaller geographic ranges and limited environmental tolerance (Manel et al. 2001, Boone and Krohn 2002, Kadmon et al. 2003, Thuiller et al. 2004, Luoto et al. 2005, Elith et al. 2006).
Small sample sizes pose challenges to any statistical analyses and result in decreased predictive potential when compared to models developed with more occurrences (Stockwell and Peterson 2002, McPherson et al. 2004). As sample size increases, accuracy should also increase until achieving its maximum accuracy potential thereby reaching an asymptote. The maximum accuracy potential and the sample size at which the asymptote is reached will depend on the study area and species, the quality and spatial resolution of the environmental and species occurrence data used to develop the model, and the modeling method itself. While many previous researchers have investigated the effect of sample size on model accuracy, most have not explicitly manipulated sample size by species (Segurado and Araujo 2004, Brotons et al. 2004), making it difficult to evaluate the effects of species ecological characteristics versus sample size on model accuracy. For instance, if a species has a limited range it is likely that proportionally more of its environmental space is sampled with fewer points, than a species with a large range. Two studies, Stockwell and Peterson (2002) and McPherson et al. (2004) manipulated sample size and determined that models built with fewer points were generally less accurate, however the latter used sample sizes (50, 100, 300, and 500) much larger than those typically available for species of conservation concern. To further explore the variation among sample size and model performance, we generated models for 18 California taxa with varying degrees of habitat specialization using four modeling methods and a variety of sample sizes characteristic of rare species (n=5, 10, 25, 50, 75, or 100).
The potential for a predictive species’ distribution model to aid conservation planning will depend on the model's ability to accurately depict the species’ occupancy potential in the geographic region in question. Evaluating these models with presence/absence data is challenging (Fielding and Bell 1997, Pearce and Ferrier 2000, Manel et al. 2001, Elith and Burgman 2003). This task is further complicated when only presence occurrence data are available because it is very difficult to evaluate false-positive prediction (commission) errors. Given this limitation, many modelers opt to evaluate only omission errors and ignore commission errors. This is insufficient because a model that has no omission errors can also have high commission errors because as omission errors decrease, commission errors tend to increase and vice versa (Fielding and Bell 1997). Evaluating a model for omission alone will fail to identify models that balance commission and omission errors (or weights one slightly above the other depending on which is considered by the modeler to be a more serious problem) and therefore fail to identify models with the highest predictive ability. Given these complexities, research on model evaluation for presence only modeling is vital. Here we take a multifaceted approach using three different evaluation approaches that vary in how they measure omission and commission errors. To evaluate overall model fit we use 1) receiver operating curves (ROC), which are threshold independent and include both omission and commission error. 2) We use predictive success to measure omission error while 3) the spatial comparison of predictions by models generated with the full species occurrence data to those of models generated with fewer observations provides an assessment of both commission error and model stability. This final measure is essential for evaluating which modeling method performs well with incomplete species occurrence data sets, a common condition when modeling rare species distributions.
Numerous species distribution modeling methods exist, each unique with regard to their data requirements, statistical methods and overall ease of use (Guisan and Zimmermann 2000, Elith and Burgman 2003, Elith et al. 2006). These different modeling methods can produce clearly different geographic predictions and therefore resultant conservation strategies, even when using the same data (Loiselle et al. 2003). We chose four modeling methods potentially useful to conservation planning, Bioclim (Nix 1986), Domain (Carpenter et al. 1993), GARP (Stockwell and Peters 1999), and Maxent (Phillips et al. 2004, 2006). These methods were chosen because they are easy to use, batchable, produced useful predictions in other research (Lindenmayer et al. 1991, Gillison 1997, Anderson and Martínez-Meyer 2004, Phillips et al. 2004), and do not require an explicit quantification of absence to formulate a predicted distribution model. Further, they varied in predictive performance in a recent comprehensive model study (Elith et al. 2006) in which Bioclim performed relatively poorly, Domain and GARP had intermediate performance, and Maxent performed very well.
In sum, we test models that use presence occurrence data, which is of great utility because the vast majority of biotic data available to modelers are presence-only. We manipulate sample sizes to include those very small samples typical of rare species and quantify ecological characteristics of species to evaluate the relative influence of both species ecology and sample size on model performance. Further, we use a suite of evaluation procedures that complement each other to obtain as complete a picture as possible of the predictive ability of each model. Our comprehensive analyses should stimulate continued use of distributional modeling in conservation management, particularly when only limited occurrence data are available.
- Top of page
- Materials and methods
The distribution of AUC values for the four modeling methods for each of the six sample size categories are represented in the box plots of Fig. 2. AUC was generally highest for the 100-sample size models for all methods except GARP where models generated with a sample size of 50 performed very well. AUC generally scores increased with sample size for all modeling methods. The AUC values reached near maximal levels at 75 observations for Bioclim; 50 for both Domain and Maxent; 10 for GARP whereby they remained relatively consistent for that sample size category and all categories above it. In some instances the AUC scores decreased at larger sample sizes but for the most part these changes were insignificant. Overall, Domain and Maxent achieved the highest AUC values followed by GARP and then Bioclim across most sample sizes. The 18 taxa modeled displayed considerable variation in model performance as evaluated by AUC. This result was mirrored in the other two evaluation methods, which similarly had a large spread of evaluation measure values.
Figure 2. Box plot displaying the interquantile range and outliers around the median AUC values of ROC plots for each modeling method by sample size category.
Download figure to PowerPoint
Similar box plots for prediction success using thresholds identified using the ROC curve are presented in Fig. 3. Again prediction success increased with sample size though it did not reach a maximum achievable value except possibly for GARP and Maxent with data sets of 75 samples. At samples sizes between 5 and 25, the method performance from highest to lowest was: Maxent, GARP, Bioclim, and Domain. This order changed for sample size categories between 50 and 100 to be GARP, Maxent, Domain, and Bioclim, though differences between the performances of the four modeling methods were more marked at the smaller sample sizes. The average range in values for prediction success between models built with 5 and 100 occurrences was smallest for Maxent (sample size 5 mean: 61.2; sample size 100 mean: 90.9), followed by GARP (sample size 5 mean: 32.1; sample size 100 mean: 95.6), Bioclim (sample size 5 mean: 11.6; sample size 100 mean: 85.4) and then Domain (sample size 5 mean: 8.5; sample size 100 mean: 92.3).
Figure 3. Box plot displaying the interquantile range and outliers around the median prediction success values for each modeling method by sample size category.
Download figure to PowerPoint
These values should be considered in conjunction with the area predicted present as a percentage of the total study area (Fig. 4). Given that we are evaluating the models in a presence-only framework we should be concerned with minimizing the potential of gross commission errors. Therefore when absence data is not available to quantitatively evaluate commission errors a good model could be considered one that achieves low omission errors (i.e. high prediction success) while still generating the most parsimonious model with regard to the total area (i.e. number of 1 km2 cells) predicted positive. In general, Domain and Bioclim predicted the smallest area, followed by Maxent and then GARP, though this relationship varied with sample size. The area predicted as suitable for a given species by Maxent remained fairly level at sample sizes of 25 and above while the other methods predicted more area with increasing sample sizes. The most noticeable difference among methods across samples sizes was the placement of Maxent, which predicted the largest area of all modeling methods at sample size 5 (mean: 8.0) and the smallest area using 100 occurrences (mean: 13.2) and therefore had the smallest range of average area predicted of all modeling methods. Bioclim followed with the next smallest range of average area predicted (sample size 5 mean: 0.8; sample size 100 mean: 14.7), followed by Domain (sample size 5 mean: 0.4; sample size 100 mean: 14.7), and the range was largest for GARP (sample size 5 mean: 3.0; sample size 100 mean: 21.6). At the 100 sample size category GARP predicted on average 64 percent more area than Maxent at the same sample size category.
Figure 4. Box plot displaying the interquantile range and outliers around the median percent area predicted present for each modeling method by sample size category.
Download figure to PowerPoint
The spatial comparisons of each replicate to its full 150-sample size model summarized using kappa, are displayed in the box plots of Fig. 5. The kappa coefficient increased with larger sample size categories for all modeling methods and an asymptote was not reached for any method, suggesting that maximal concordance was not achieved and will likely continue to increase with sample size until the full data set is used. The order of the four modeling methods level of concordance as evaluated by the kappa coefficient again differed by sample size category, but Maxent achieved the highest values for every size category.
Figure 5. Box plot displaying the interquantile range and outliers around the median Kappa for each modeling method by sample size category when considering the full 150 model as the observed.
Download figure to PowerPoint
Correlations among the three ecological characteristics, marginality, tolerance and distributional spatial extent were as follows: extent and tolerance were positively related (Spearman's R=0.74, p<0.05); extent and marginality were negatively correlated, though not significantly (Spearman's R=−0.34); and tolerance and marginality were negatively correlated (Spearman's R=0.57, p<0.05). Graphical representation of the relationship between the taxa's marginality, tolerance and distributional spatial extent, and AUC for models generated with sample sizes of 100 occurrences are presented in Fig. 6. These relationships were consistent across a range of sample sizes (see Table 3 for sample sizes 5, 50, and 100), and therefore only data for the largest sample size are displayed graphically and a subset in Table 3. Marginality was positively related to AUC (Fig. 6a, Table 3). Generally as tolerance (breadth of environmental space used) increased, model predictive accuracy decreased (Fig. 6b, Table 3). Likewise, as taxa spatial distributional extent increased, model prediction accuracy decreased (Fig. 6c, Table 3), though Spearman's R was generally higher for tolerance than for spatial distributional extent. Similar results were obtained for the relationship between the three ecological characteristics and predicted success (results not shown), although the response was not as strong as for AUC.
Figure 6. Mean AUC values for models built with a sample size of 100 occurrences for each taxon and modeling method (n= 10). Taxa are sorted in ascending order by (a) marginality (b) tolerance and (c) estimated spatial extent in California.
Download figure to PowerPoint
Table 3. Spearman's R correlations between model performance as measured by AUC and species characteristics for models built with 3 different sample sizes; 5, 50 and 100 samples.
- Top of page
- Materials and methods
Model accuracy increased with larger sample sizes for all modeling methods across the 18 California taxa tested. Nonetheless, useful models were produced with as few as 5–10 positive observations, and models produced with 50 observations were similar to those created with twice as many locations. This result, along with the indication that ecologically specialized species are easier to model than wide ranging species, is especially encouraging for modeling rare species. Also, given that distribution modeling can be used for a variety of different objectives ranging from guiding future exploration of a species range to creating an accurate model for conservation planning, models built with few points, while not as accurate as those built with large datasets and potentially not appropriate for all applications, are still useful. Our results increase the relevance of data housed in museum or herbarium collections, or similar databases such as those maintained by NatureServe and its network of natural heritage programs (Stein et al. 2000, Graham et al. 2004). As such occurrence databases become more widely available, thereby making species distribution modeling more accessible to conservation planners, research such as that presented in this paper is imperative to guide modelers.
Maxent had the strongest performance of the methods tested here because it performed well and remained fairly stable in both prediction accuracy and the total area predicted present across all sample size categories. Further, it often had the highest accuracy and spatial concordance, especially for the two smallest sample size categories. These results indicate that Maxent can somewhat compensate for incomplete, small species occurrence data sets and perform near maximal accuracy level in these conditions. The success of Maxent is likely due to its regularization procedure that counteracts a tendency to over-fit models when using few species occurrences (Phillips et al. 2006). Our results support those obtained by Elith et al. (2006) who also found that Maxent was one of the strongest performers in a large model comparison study.
In contrast, Bioclim does not appear to be capable of maximizing its accuracy potential with small sample sizes and did not perform as well as the other modeling methods using the datasets of larger samples. For these reasons we would not suggest its use when modeling with small numbers of species observations. It is interesting to remark that Bioclim did attain relatively high concordance at the larger sample size categories when spatially compared to models generated with all 150-occurrences. This result is not surprising given that the Bioclim algorithm does not extrapolate beyond the bounds of the environmental conditions at known locations of occurrence. As additional observations are included in the development of the Bioclim model, the envelope defining the environmental conditions at known occurrences will by default expand from defining a small portion of the species’ full environmental envelope (here developed with all 150-occurrences) towards defining a larger portion of that full envelope.
In some respects Domain and GARP performed fairly similarly, achieving relatively high prediction accuracy values at large sample size categories with low prediction success and spatial concordance at small sample sizes. However, the two evaluation measures of prediction accuracy (AUC and prediction success) revealed conflicting assessments of GARP's performance. The AUC evaluation indicated that GARP models reached near maximum accuracy (ca 10% lower AUC than the full model AUC) when using sample sizes of 10 observations, a result that is supported by Stockwell and Peterson (2002), but when evaluating the models using prediction success, the maximum accuracy was reached at the 75-sample size category. The inconsistent assessment of model prediction accuracy supports the necessity to evaluate models with multiple evaluation metrics. Specifically, metrics such as prediction success should be reviewed with caution when not accompanied with total suitable area predicted. GARP had the largest prediction success values for the larger sample size categories but predicted a spatial area that far exceeds the other three modeling methods, thereby increasing its chances of correctly classifying positive occurrences but as a result most likely increasing its commission error rate as well. Manel et al. (2001) also cautioned against basing evaluations of presence-absence model performance solely on prediction success.
All things considered we interpret our results to indicate that overall Domain performed better than GARP because Domain had higher AUC values for all sample size categories. The AUC of a ROC plot generated with presence and background data evaluates a model based on its prediction success but also penalizes it for predicting proportionately larger spatial areas (Phillips et al. 2006), thereby evaluating both omission and commission errors simultaneously. Hence, the low AUC values obtained by GARP at low sample size categories likely reflect commission error, while the Domain models appear to have less commission error. Further, the threshold selection strategy most likely artificially decreased Domain's prediction potential for the smaller sample size categories as assessed by both the prediction success and spatial comparison evaluations. Since Domain derives a point-to-point distance for each pixel based on its proximity in environmental space to the most similar occurrence, it follows that when models are built with fewer occurrences this distance will be greater.
We confirm the results of other researchers that the ecological characteristics of model species affect model accuracy potential, where species widespread in both geographic and environmental space are generally more difficult to model than species with compact spatial distributions (Araujo and Williams 2000, Stockwell and Peterson 2002, Thuiller et al. 2003, Segurado and Araujo 2004). In all but one case (Table 3), significant relationships existed between model performance (AUC) and spatial extent of a species distribution, tolerance, and marginality. These relationships were consistent across all sample size treatments indicating that the ability to model species effectively is strongly influenced by species ecological characteristics independent of sample size. Tolerance generally had the highest correlation with AUC, indicating that environmental space occupied by a given species might be a better measure than geographic space occupied, although the kernel density estimator used to estimate the species’ spatial ranges likely overestimated the area occupied for some species creating artificial outliers. In particular the range size estimator undoubtedly overestimated the spatial range of the monarch butterfly (Danaus plexíppus, range size estimated as 105 874 km2, tolerance 1.64) a mostly coastal and patchily distributed wintering butterfly species.
In this study the species with the smallest geographic extent of occurrence and very low tolerance (small niche breadth), the California gnatcatcher Polioptila californica generally had the highest AUC and prediction success values, whereas the opposite was found for the western pond turtle Clemmys marmorata, which has the widest geographic range of the study species within California and a very high ecological tolerance. Our results provide support for the explanation offered by Stockwell and Peterson (2002) that local ecological adaptation by sub-populations is more likely to occur for widely distributed species resulting in different habitat preferences in discrete parts of the species’ range. In climatic modeling each sub-population would have a distinct climatic range in which it occurs and therefore when the species is modeled as a whole over its entire geographic range, the total climatic range encompasses climatic conditions not suitable for occupancy, thereby overestimating the species’ ecological climatic breadth. The fact that the models for the two species in this study that have considerable taxonomic confusion regarding their Californian distributions, the western pond turtle and the coast horned lizard Phrynosoma coronatum, performed poorly provides support to notion that local ecological adaptation results in a decrease in model accuracy. It would be interesting to partition the data for these two species based on the geographic boundaries of the proposed subspecies or genetic lineages to determine whether the resulting models do indeed result in an increase in model accuracy.
Other possible explanations for variation in model performance not related to geographic range size or ecological niche breadth are that some species are just not suited for climatic modeling and/or the spatial grain (pixel resolution) was inappropriate for modeling some taxa's distribution in the geographic study area of California. Models for the red-legged frog Rana aurora would likely have benefited from the inclusion of a variable describing the distribution pattern of introduced bullfrogs and the western pond turtle models may have been improved with a description of the amount of wetlands present within an area surrounding an occurrence. These are examples of cases where the climatic variables may be insufficient to model the species’ distribution and where important variables that either positively or negatively contributed to the observed spatial distribution pattern are missing from model formulation, thereby resulting in relatively poor predictive distribution models.
Of course, as with any comparative modeling method exercise, these results may differ in a new study area, at a different spatial scale (extent and/or grain), with varying qualities of model data (species and environmental), and for study species of different ecological characteristics. Our results clearly indicate that future studies should use multiple evaluation measures, because each measure provides only a portion of the elusive “truth” of the predictive ability of a species distribution model. Further, while we found that reasonable models could be generated with low sample sizes, we sub-sampled from a larger set of data and presumably obtained a relatively representative, albeit small, number of points. However, if decreasing sample size increases bias, which may be the case with data collected in an ad-hoc fashion, then models built with small samples may be quite poor. The fact that we could develop models with small samples for some species does not mean this will be possible for all species.
In general, practitioners should remember that models are simply an estimate of a species’ potential distribution. Species distribution modeling cannot replace fieldwork intended to collect more distributional data but can be a useful tool for data exploration to help identify potential knowledge gaps and provide direction to fieldwork design (Engler et al. 2004). By carefully evaluating models and including both species characteristics and sample size in our analyses our results indicate considerable promise for modeling rare species. This result should encourage conservation practitioners to explore the use of distribution modeling across a variety of applications.
Subject Editor: Miguel Araújo.