- Top of page
These models result in spatial predictions indicating locations of the most suitable (and unsuitable) habitats for a target species, community or biodiversity (i.e. indicating ‘hotspots’). Generalized linear and generalized additive models (GLM and GAM), implemented within a geographical information system (GIS), have become very popular for predicting such distributions (Guisan, Edwards & Hastie 2002).
However, as yet relatively few predictive models have been applied to rare and endangered species (Miller 1986; Myatt 1987; Carey & Brown 1995; Godown & Peterson 2000; Elith & Burgman 2002), despite their potential in conservation management, for instance in identifying sites with high potential for colonization. This may be because (i) data for rare and endangered taxa very often consist of a set of observed occurrences without sites of observed absences (hereafter called presence-only data); (ii) data for a single taxon are usually scarce (few observations); and (iii) often observations are not associated with any defined sampling unit (of known surface area) or they lack sufficient locational accuracy.
The first problem is commonly associated with data stored in large biological data bases. Such data have often been recorded by volunteers, usually without recourse to any predefined sampling strategy. Scarcity of data is specific to uncommon and rare species, for which prevalence in a data bank is, by definition, very low. Historical records, such as herbarium or museum collections, often lack precise details of location: at best they show proximity to a common site, a valley or village at a scale of a kilometre or more. These two problems make it more difficult to apply the usual statistical approaches. Such data contrast unfavourably with recent observations (≤ 10 years) sampled using a global positioning system (GPS) with a much higher spatial accuracy.
This highlights the dilemma of quantity (number of occurrences) vs. quality (locational accuracy). When the spatial accuracy associated with the geographical location of each observation site is known (e.g. the true site has a 95% probability of being within a 100-m radius), it becomes a major consideration in choosing the cell size (grain) of the study.
The choice of cell size may be determined by other criteria. A larger cell size might result in a more manageable data set or might be chosen if spatial autocorrelation is measured within the species’ data and, as a result, observations that cannot be considered independent need to be aggregated. In contrast, a smaller cell size might better represent the ecological processes. Here, we will focus mainly on situations where spatial accuracy is known and can vary from one observation to the other.
Data of varying spatial accuracy can be manipulated to avoid propagating measurement errors in the model (Elith, Burgman & Regan 2002) by either (i) aggregating all data in regular grid cells (or possibly other cell shapes) whose size still matches the poorest locational accuracy of observed occurrences, or (ii) dropping the most inaccurate data. A balance offering the best sample size vs. accuracy is usually found between these two options. This is illustrated in Fig. 1, which shows the decrease in the number of occurrences of the rare species Eryngium alpinum L. (Apiaceae) as the spatial resolution increases (i.e. decreasing cell size). This is due to the fact that fewer occurrences have a high locational accuracy associated with grid cells at high resolution (fine grain). Lowering the spatial resolution (coarser grain) allows less precise observations to be made, thus increasing their overall number. However, such a decrease is not straightforward because, when lowering the resolution (i.e. increasing cell size), distinct occurrences can also be aggregated in the same grid cell. Hence, the choice of method depends on the resolution of environmental layers available in the GIS, on the biology of the focus species and on the spatial distribution of its recorded occurrences.
Such data configuration results in severe limitations to the fitting of many statistical models, such as linear models (Guisan, Edwards & Hastie 2002). However, one alternative is to use models based on presence-only data. These are called profile techniques, as opposed to group discrimination approaches that need presence–absence or abundance data (Robertson, Caithness & Villet 2001). A well-known example of a profile-type model is the climatic envelope approach developed largely in the late 1980s by Australian scientists and implemented in the bioclim package (Busby 1991; now anuclim; Houlder et al. 2000). Another, more recent, example is the ecological niche factor analysis (ENFA) implemented in the biomapper package (Hirzel et al. 2002; Hirzel, Hausser & Perrin 2002). However, a common problem of profile methods is that they tend to generate overoptimistic predictions, i.e. they predict the species at too many locations. This is easily understood by the fact that, from a quantitative evaluation perspective, a ‘perfect’ model with such data would be a model predicting the species everywhere (i.e. ‘1’ would be attributed to all cells in the area), as all observations would be correctly predicted as ‘1’ and no discriminating absence would be available to restrict the predictions to zero where needed (i.e. at environmentally inappropriate locations).
In this regard, GLM constitute a better choice because they can deal with many types of predictors (continuous, binary, qualitative, ordinal), but on the other hand they must have presence and absence data. In order to use GLM when no absence data are available, one approach is to generate ‘pseudo-absences’ (Zaniewski, Lehmann & Overton 2002) and to use them in the model as absence data for the species. The manner in which pseudo-absences are generated is particularly important because it can have a significant influence on the final quality of the model (Zaniewski, Lehmann & Overton 2002).
The easiest way to choose pseudo-absences is simply to generate them totally at random over the study area (Hirzel, Helfer & Métral 2001; Zaniewski, Lehmann & Overton 2002). However, this method runs the risk of generating an absence in an area that is, in fact, favourable to the species. Indeed, when dealing with common species, choosing such a ‘wrong absence’ may not be too problematical because the numerous presence records will counteract its effect. However, when working with rare species, data are often scarce and choosing a wrong absence could significantly reduce the quality of a model.
To avoid, or at least reduce, this problem, more subtle methods can be employed to generate the pseudo-absences. For example, Zaniewski et al. (2002) first create a habitat suitability (HS) map of all fern species (a presence can be the occurrence of any species) using a GAM with totally random pseudo-absences. Then, a second set of pseudo-absences are randomly selected proportionally to the predictions by the first HS map and used to fit GAM models for every species. Selecting pseudo-absences proportionally to the overall sampling effort aims at avoiding sampling pseudo-absences in sites that were under-sampled in the field. However, multi-species data are not always available. In such situation, the first map – based on purely random pseudo-absences – is specific to the modelled species and pseudo-absences can be selected in areas below a certain threshold, in order to maximize the discriminating ability of the second model. The choice of this threshold must be defined as objectively as possible, for instance as the lowest value still encompassing 95% of observed species’ occurrences.
In this study, we propose another way to generate pseudo-absences, which combines the respective strengths of ENFA and GLM. It is also a two-step approach, but uses ENFA instead of a GLM with totally random pseudo-absences to calculate the first HS map that is used to weight the selection of pseudo-absences. The calculation of this first model is particularly straightforward with ENFA (e.g. no need to select predictors).
The aims of this study were twofold. The first was to evaluate different methods for predicting rare species distribution, using ENFA with presence-only data, GLM with presence and random pseudo-absences, and a combination of both approaches. The second aim was to assess the dilemma between quality and quantity, trying more specifically to answer the question: is it preferable to have a large number of observations, which is better from a statistical point of view, or should one favour locational accuracy of observations (dropping all inaccurate ones, thus using a reduced set to calibrate the model) to ensure a better correspondence with environmental predictors used to predict the observations? This part of the study was conducted by building models at two different resolutions (25 and 500 m) having a different number of occurrences associated with each (Fig. 1). Eryngium alpinum (Apiaceae), a flagship threatened species in the European Alps, was chosen as an illustration. Finally, results from field investigations demonstrate the usefulness of such a model for suggesting new observation sites for rare and endangered species.
- Top of page
Correlations between environmental predictors calculated at the 25-m resolution were very similar to those calculated at the 500-m resolution, so that the predictors retained by both ENFA analyses were the same. They were: slope, srad3, ddeg300, rain49 and topo500 (for their descriptions see Table 1).
The HS map was obtained by considering the first two components of the ENFA, which expressed, respectively, 92·8% and 88·1% of the variance at the resolutions of 25 m and 500 m. MPA values obtained for the two ENFA HS maps are given in Fig. 3d. For the two types of GLM models, box-plots in Fig. 3 show the distribution of (a) explained deviance (D2), (b) best kappa value (B-kappa), (c) Gini coefficient (Gini) and (d) MPA.
A Wilcoxon rank test confirmed that, for all these evaluation values, the averages were significantly different (P < 0·01) between the following pairs of models: GLM-ENFA 25 : GLM-R 25, GLM-ENFA 500 : GLM-R 500, GLM-ENFA 25 : GLM-ENFA 500 and GLM-R 25 : GLM-R 500. The number indicates the spatial resolution considered. Correlations between the different measures of evaluation calculated for each fitted model at each resolution are given in Table 2.
Table 2. Spearman rank correlation coefficients between the different evaluation values calculated for each fitted model. The correlation values are averages of the correlations obtained for GLM-ENFA and GLM-R methods at both 25-m and 500-m resolutions (n= 2 × 2 × 1500 = 6000 models)
| ||Explained deviance||Best kappa value||Gini coefficient||MPA|
|Explained deviance|| ||0·76||0·87||−0·01|
|Best kappa value|| || ||0·74||−0·21|
|Gini coefficient|| || || ||−0·12|
Based on the cartographic implementation (potential map; Fig. 4) of the GLM-ENFA model at a resolution of 25 m, four new populations of the species were discovered in the field, all of them in pixels characterized by a high to very high habitat suitability (probability values of 0·98, 0·93, 0·92 and 0·79).
Figure 4. (a) Potential distribution map for Eryngium alpinum in Switzerland drawn from one of the GLM-ENFA models at a resolution of 25 m. Black and dark grey tones indicate highly predicted areas, white circles indicate real presence points used to generate the map and white stars represent two new populations of the species discovered in the field. Highly predicted areas tend to be located in mountainous regions with higher rainfall (Jura and northern part of the Alps), which is consistent with the ecology of E. alpinum. (b) Magnified map showing the locations (star symbols) of the four new populations. Note that these are located within highly predicted areas (see text).
Download figure to PowerPoint
- Top of page
The first goal was to compare two existing and one new approach to predicting rare species with presence-only (occurrence) data. Due to the lack of true absences, a formal comparison between ENFA and GLM-based methods (i.e. GLM-ENFA and GLM-R) is difficult. Indeed, in our study, MPA is the only evaluation measure available for comparison. Results show that the HS maps obtained with ENFA predict a MPA value that is approximately twice the mean of the GLM-based methods, both at the 25-m and 500-m resolutions (Fig. 3d). This result seems to confirm the tendency of ENFA to over-predict species distribution (Zaniewski, Lehmann & Overton 2002), due to the lack of discriminating absences, as discussed in the introduction. Another possible explanation of the apparent (but not proved) lack of accuracy of ENFA models could result from a violation of the assumption of normality of predictors that is required by the ENFA algorithm, as many of our predictors were actually not normally distributed (Kolmogorov–Smirnov goodness-of-fit test). Further testing would be needed to assess the robustness of ENFA (Hirzel, Helfer & Métral 2001, Hirzel et al. 2002) in such circumstances. This situation is likely to prevail in many other similar studies.
However, although we observed a large difference in MPA values between ENFA and GLM-based methods, it should not be concluded that the latter methods always prove better than the former. For instance, Hirzel, Helfer & Métral (2001), using a virtual species with known absences in a real landscape, have shown that ENFA can prove superior to GLM in the specific case of invading species (for their quantity of seed, expansion power and spread, as well as considering the virtual species’ predefined niche), i.e. when species do not yet occupy all their potential habitats in the landscape. This might be less likely to be the case for many rare and threatened species, which tend to occupy most of their potential habitats, as these have usually been drastically reduced and, as a result, cover only a small proportion of the territory. However, although E. alpinum is truly a rare species (in the sense of Rabinovitz et al. 1986), it always yielded rather large predictions (compared with other species; O. Broennimann & A. Guisan, unpublished results), which might suggest either that its habitat is spatially not so restricted and that other reasons (like cutting) have limited its occurrence in the past, and/or that important environmental predictors are missing. None the less, the results suggest that the performance of these methods also depends on the type of organisms being modelled, on the type of environmental predictors that are being used, and on the grain and extent considered.
Further comparisons were possible between GLM-ENFA and GLM-R because absence data were included in their evaluations. However, care should be taken when interpreting these results, as such evaluation measures are based on pseudo-absences and not on observed absences. This is a recurrent limitation of evaluating models based on occurrence data and a research area where progress is still required.
The three first evaluation measures (Fig. 3a–c) are consistent with each other, showing that GLM using ENFA-weighted pseudo-absences provide significantly better results on average (Wilcoxon rank test) and less deviance than those using randomly chosen pseudo-absences. This is true at both the 25-m and 500-m resolutions, which confirms that choosing pseudo-absences in an ENFA-weighted way rather than totally at random will enhance the accuracy of an HS map.
Interestingly, results from MPA measures were not consistent with the other evaluation measures. Indeed, we did not expect such consistency because the MPA concept is based on the parsimony criterion that ‘the smaller the potential map the better the related model’, which does not necessarily fit the mathematical criterion of statistical evaluations. At the higher resolution (25 m) GLM-R models provided a smaller MPA average value (remember that for MPA, the smaller the value the better the model) and a smaller deviance, whereas at the lower resolution (500 m) GLM-ENFA models obtained a similar average MPA value as GLM-R models, but showed a much larger deviance.
Furthermore, the comparison of the four evaluation measures, based on all individual model predictions (1500) in each GLM experiment (two types of GLM at two resolutions), did not show any correlation between MPA and any of the three quantitative measures (D2, B-kappa or Gini). We do not believe that these results imply that MPA is a non-reliable value because the rule of parsimony used in MPA has a practical justification in conservation ecology, especially in the case of rare species that are, by definition, not widely distributed. Hence, further research is needed, at least in the case of these species, to assess whether the best HS map would not be the one that maximizes quantitative evaluation statistics while at the same time minimizing the predicted area. One identified limitation to the generalized use of MPA might be that it fails to evaluate appropriately those potential maps produced by certain modelling techniques, such as bioclim (Busby 1991), as this type of model always fits the maximum possible predictions at observed presence sites (J. Elith, personal communication).
The second goal of this study was to determine whether it is preferable to use (i) less data with higher spatial accuracy or (ii) more data with lower accuracy. Comparing the evaluations of the 25-m and 500-m resolution HS maps reveals that averages for all these evaluation values are always better with the 25-m resolution. Overall, this conclusion is still valid when considering the three different types of models, GLM-ENFA, GLM-R and ENFA, although their deviances do not differ significantly.
The lower performances observed at the 500-m resolution could result from the combination of three factors. First, a loss of information is inevitable when aggregating environmental maps. Secondly, the low accuracy associated with some species’ occurrences used at this resolution might still be underestimated and a greater measurement error (as defined by Elith, Burgman & Regan 2002) might actually prevail in the data. This is less likely to occur in the case of observations that have a higher spatial accuracy. Thirdly, plants being fixed organisms, they are highly influenced by the local microclimate. Therefore, relating species data that have a high geographical precision (of site location) with high-resolution environmental data should have a better predictive power because they reflect very local ecological conditions, while aggregated data reflect smoother environmental gradients in the area (e.g. mesoclimate). Furthermore, some important combinations of environmental predictors might not be appropriately expressed in such aggregated data.
In turn, such superiority of higher resolution predictors and less data might not be true for non-fixed organisms, as the required resolution for these is certainly dependent on a larger home range of target species, as suggested for bats by Jaberg & Guisan (2001). None the less, many of our potential maps have spatial predictions that cover a large proportion of the mountainous areas of Switzerland, even those with a good evaluation. This primarily reflects the fact that large areas are probably truly suitable sites for the target species E. alpinum, from the single perspective of those predictors that were used to build the model. Other factors, not included in the model but which might potentially have a more direct influence on plants (i.e. more proximal in the sense of Austin 2002), probably account for the unexplained spatial variation, and thus could enhance map precision. The problem remains, however, that data on many of these very important environmental factors, such as nutrient content in soils or precise physiologically meaningful microclimatic measurements, are still difficult to obtain in a spatially explicit way.
The best option for improving the HS maps would certainly be to obtain additional data for the target species, but this is difficult in the case of rare species. HS maps prove particularly useful in this regard, by suggesting new sampling sites for the species, such as those pixels of high prediction values that are not in the close vicinity of an observed population of the species. Visiting such suggested sites in the field at the optimum flowering time for the species may produce new presences or, at least, attested absences. This was done at the end of our study and four new populations of E. alpinum were found at sites of high predicted probability of presence. Such new data should then optimally be used to refine the models and generate new predictions that will need to be verified in the field. Theoretically, reiterating such processes should lead to equilibrium, when new data (presences or absences) no longer contribute to improvement of the model (reaching a plateau).
Another solution for improving the accuracy of HS maps could be to refine the choice of pseudo-absences. In this study, the GLM-ENFA method was indeed used in a relatively simple manner. We used the ENFA predictions to divide the study area into two parts, one with probabilities of presence greater than 30% and the other with lower probabilities. Pseudo-absences were then randomly chosen in the latter category. A more subtle way of choosing the pseudo-absences could be, for instance, to stratify their distribution along a suitability gradient, like mean annual temperature. This could be a more precise method for choosing suitable areas.
However, an alternative exists to the process of selecting as many pseudo-absences as there are presences (usually a very limited number), and ideally repeating the process a number of times (e.g. 1500). This might be to sample, once only, a larger number of pseudo-absences (say 10 000) and to assign a weight, w=k/10 000, in the GLM to each, so that the sum of the weight of all pseudo-absences adds up to give the number of presences k (i.e. ensuring a prevalence of 0·5) (M. Wisz, personal communication). This could prevent the inherent risk of inappropriately choosing a limited number of pseudo-absences (i.e. providing a low fit) when only one selection run is made.
Our third goal was to use these maps for suggesting new sampling sites for rare species. Although this study was not focused on this particular application of predictive models, a preliminary field campaign based on the selected HS map (Fig. 4) led to the discovery of four new populations of this highly endangered species. Indeed, this is a very promising result that strongly supports the use of predictive habitat distribution models for nature conservation planning.