- Top of page
- Material and methods
Ecological niche models represent key tools in biogeography but the effects of biased sampling hinder their use. Here, we address the utility of two forms of filtering the calibration data set (geographic and environmental) to reduce the effects of sampling bias. To do so we created a virtual species, projected its niche to the Iberian Peninsula and took samples from its binary geographic distribution using several biases. We then built models for various sample sizes after applying each of the filtering approaches. While geographic filtering did not improve discriminatory ability (and sometimes worsened it), environmental filtering consistently led to better models. Models made with few but climatically filtered points performed better than those made with many unfiltered (biased) points. Future research should address additional factors such as the complexity of the species’ niche, strength of filtering, and ability to predict suitability (rather than focus purely on discrimination).
Ecological niche models (ENMs) – also called species distribution models or climatic envelope models (Araújo and Peterson 2012) – have been widely used to predict the potential geographic ranges of species. Publications using ENMs have doubled in the last 5 yr, while citations of those papers have increased over 5000% (data: ISI Web of Knowledge; search: ‘ecological niche model’; May 2013). Out of this boom, a variety of specific software tools and new ecological models have appeared (Elith et al. 2010, 2011).
Researchers have focused their attention on detecting and analyzing the differences in predictions of these diverse models (Elith and Graham 2009, Lobo et al. 2010, Ashcroft et al. 2011, Lobo and Tognelli 2011, Nenzén and Araújo 2011). Different numerical tools and evaluation strategies have been analyzed, and there is an on-going debate about model-prediction accuracy (Peterson et al. 2011, Anderson 2012). Here we aim to contribute to the current improvement in model predictions by applying a new perspective focused on the problems associated with sampling bias, which can hinder the production of high-quality models (Wintle et al. 2005, Araujo and Guisan 2006, Anderson and Gonzalez Jr 2011).
Biodiversity or citizen science databases offer the possibility of using thousands of species records to calibrate models and map species distributions (Guralnick et al. 2007). However, data contained in these databases are highly heterogeneous. Distribution data sets include information from museums, herbaria, university databases or amateur field work, and usually compile hundreds of different surveys, each one designed with a different goal. As a consequence of this they accumulate taxonomic and geographic sampling biases, which often result in environmental biases as well (Hortal et al. 2008, Boakes et al. 2010, Newbold 2010). For instance, taxonomic biases or uneven sampling effort affect current biodiversity databases (Loiselle et al. 2008) and taphonomic biases or dating biases affect fossil databases (Varela et al. 2011).
Despite these biases, it is essential that we take advantage of the huge quantity of accumulated data. Until now ENMs have been typically calibrated without explicit steps to reduce the effects of sampling bias, which is not a desirable methodological procedure because biased data sets can produce poor predictions (Kadmon et al. 2003, Barry and Elith 2006, Loiselle et al. 2008, Varela et al. 2009, Lobo and Tognelli 2011). As model predictions are affected by spatial and/or temporal biases in the calibration data sets, it is highly desirable to find methods to filter the data sets and determine an appropriate subsample for calibrating ENMs, regardless of the initial bias of the raw data.
Sampling is a sensitive step for any ecological analysis (Albert et al. 2010). However, few studies attempt to correct the sampling bias of species records when constructing ENMs. One of these attempts was done using Maxent model, biasing the background data sample with the same bias of the occurrence records (Phillips et al. 2009). Other papers have filtered occurrence records in geographic space (Hidalgo-Mihart et al. 2004, Iguchi et al. 2004, Anderson and Raza 2010). Both approaches hold promise, but their efficacies remain poorly documented.
Here we analyze the performance of two different types of filters for selecting calibration data when constructing an ENM. These were a geographic filter and an environmental filter (specifically, a climatic filter). Geographic filters have already been used as a tool to improve ENMs (Hidalgo-Mihart et al. 2004, Iguchi et al. 2004, Anderson and Raza 2010, Hijmans 2012, Rodríguez-Castañeda et al. 2012), while climatic filters remain generally unexplored. We aim to assess whether simple filtering rules in environmental space can improve model predictions. Our goal is to develop a procedure for improving model predictions that would work for many different kinds of sampling bias and across wide ranges of sample sizes.
To test the performance of this approach, we create a virtual (simulated) species with a geographic distribution related to three climatic variables. Virtual species have been used to test different methodological aspects of ENMs (Hirzel et al. 2001, Jimenez-Valverde and Lobo 2007, Meynard and Kaplan 2013). In our case, the virtual species allows us to circumvent complications regarding dispersal limitations and biotic interactions inherent to the studies that use real species. We generate different geographically biased data samples to illustrate several common biases in biological databases (e.g. distance to roads). Subsequently, we apply geographic and environmental filters to the biased data sets to obtain different subsamples that we use to calibrate the models. After that, model results are evaluated against the real distribution of the virtual species.
Model results allow us to analyse three different aspects. First, we test the difference in performance between the two filters. We hypothesize that geographic filters could fail to select the optimal calibration data set if they discard aggregated points with unique climatic conditions. On the other hand, climatic filters might select optimal calibration data sets by removing redundant information (points with similar climatic conditions). Second, we investigate whether filters work regardless of the initial bias in the data sets. Clearly, we desire methods that are robust and are not affected by the initial bias of the data set. Finally, we address the issue of sample size. We hypothesize that small but unbiased (or less biased) data samples should produce better predictions than large but strongly biased data samples.
Thus, for this first exploration we assess the robustness of the filters to changes in sampling biases, and the sensitivity of the filters in relation to sample size. The current study does not address sensitivity of the filters to variation in species’ niches (for instance, complex/simple relationship with variables, number of variables, or broad/narrow niches), or the performance of filters when using different modelling approaches and algorithms. These and other questions should be addressed in future works.
- Top of page
- Material and methods
First, results indicate better performance for the climatic filter than for the geographic one. The geographic filter did not increase model performance and even decreased it in some circumstances (Fig. 5). The Iberian Peninsula is environmentally highly heterogeneous (Gallardo et al. 2012). Thus, by using the geographic filter with the present data set, we discarded points with relevant (non-repetitive) climatic information (but aggregated in geographic space), and instead selected points tending to have similar climatic conditions (but located more distantly in geography) (Fig. 4). Therefore, the results suggest that species living in patchy or spatially heterogeneous environments could be negatively affected by using this kind of filter. Nevertheless, the intuitive reasoning of discarding geographically aggregated points might work only in some situations, and is likely related to the spatial distribution of the environmental variables and the manner in which the occurrence records were sampled (here, randomly from the binary map of the species’ distribution).
Figure 5. Average AUC scores from experiments with bias types pooled, showing differences among filtering treatments and across sample sizes. Generally, climatically filtered data sets (red points) led to models with very high performance (AUC > 0.95), and stable scores across the different sample sizes. Non-filtered (black squares) and geographically filtered experiments (blue diamonds) show an increase in model accuracy with increasing sample sizes. Small samples were the most sensitive to filtering. Interestingly, small samples using a climatic filter produced better models than did large non-filtered data sets.
Download figure to PowerPoint
Second, the climatic filter did improve model results. The models had higher discriminatory power when environmental biases were avoided by the use of climatic filters. This positive result suggests that at least under some circumstances researchers may be able to increase model performance by filtering by environmental variables. Here, climatic filters were effective in reducing redundant climatic combinations, likely especially those caused by biased sampling, without unduly removing the signal of the species’ niche (Fig. 4). Even when we did not use all three important variables for the species in the filtering, the results clearly demonstrate good performance of climatic filters. The efficacy of environmental filtering under other circumstances (e.g. for species with more complex niches, species with wide/narrow distributions, or when species records are more likely in increasingly more suitable areas) remains to be explored. In the future, the robustness of this method should be tested under different and more complex circumstances, including the possibility of the researcher adding n variables to the filters.
Additionally, the filters led to similar results for all five initial biases. Three of the treatments had a strong geographic pattern (distance to road, distance to nature reserves and distance to populated areas), while the other two had a much more diffuse geographic pattern (random and all biases together) (Fig. 3). The increase in model predictive power by using climatic filters was independent of the initial bias of the data sets (Supplementary material Appendix 1, Fig. A1). Future experiments could help understand the generality of this observed pattern, but, meanwhile, we conclude that this simple method increases discriminatory power regardless of the kind of bias. This is a key result, because it means that climatic filters likely could be used with heterogeneous data bases to improve ENM predictions.
Finally, the study shows notable results with regard to sample size. Models calibrated with few climatically filtered data points produced better results (on average) than did models calibrated with large biased data sets (Fig. 5). Biodiversity databases have large and typically biased data sets of species records (Hortal et al. 2007). The present results indicate that it can be better to calibrate models using a climatically filtered subsample of those occurrences, than using the whole set of available species records. Conversely, real occurrence records are scarce for some endangered and/or rare species. Here, we show that small data sets might be able to produce good predictions (at least for species with simple niches), as long as records were satisfactory representations of the species’ environ mental requirements.
On the other hand, large data sets indeed produced model results that were more consistent, with smaller standard deviations between experimental replicates than did small data sets (Supplementary material Appendix 1, Fig. A2). The minimum number of points needed to achieve maximal performance varied between filters (Table 1 and 2, and Fig. 5). Using a climatic filter, the optimal performance appeared with as few as 5 points, while when using non-filtered data 50 were necessary and using geographically filtered data required 100. Here, we defined optimal performance as that reached when increases in AUC were less than 0.01 after adding more points to the calibration data set. Although the specific number of points necessary to achieve optimal performance surely depends on the complexity of the niche and the number of environmental variables used, we predict that the overall pattern (of smaller, unbiased data sets outperforming larger, biased ones) will hold.
Conclusions and future directions
This exploration of geographic and climatic filtering of biased data sets with a virtual species allows several conclusions and points to various avenues for future research. Clearly, climatic filtering can improve model results, and here the improvement was independent of the initial biases. Furthermore, it allowed calibration of accurate models even when using extremely small data sets. In contrast, geographic filters generally did not improve model results. Nevertheless, the results presented here should be taken as a preliminary attempt to explore viable solutions for optimizing selection of calibration data when points are known or suspected to suffer from bias. Further investigation is necessary to reach general conclusions and produce guidelines regarding many issues, including: the optimal grain of the climatic and geographic filters, the optimal number of variables included in the filters, the level of sampling bias, and the filtering performance for different kinds of species (e.g. species with broad/narrow niches) or different modelling algorithms. Furthermore, the current experiments took points from the binary map of suitable vs. unsuitable areas (Fig. 2B). Future research should conduct parallel experiments where points are taken probabilistically from the suitability surface (e.g. Fig. 2A). Such models could be evaluated against the known suitability value of the virtual species, rather than compared with its binary geographic distribution, as here.