Notice: Wiley Online Library will be unavailable on Saturday 30th July 2016 from 08:00-11:00 BST / 03:00-06:00 EST / 15:00-18:00 SGT for essential maintenance. Apologies for the inconvenience.
Dr S. Suarez-Seoane, Instituto de Recursos Naturales y Ordenación del Territorio, Universidad de Oviedo, C/Independencia 13, 33004-Oviedo, Spain (e-mail firstname.lastname@example.org).
1Predictive models of species’ distributions are used increasingly in ecological studies investigating features as varied as biodiversity, habitat selection and interspecific competition. In a pilot study, we based a successful model for the great bustard Otis tarda on advanced very high resolution radiometer (AVHRR) satellite data, which offer attractive predictor variables because of the global coverage, high temporal frequency of overpasses and low cost. We wished to assess whether the approach could be applied at very large spatial scales, and whether the coarse resolution of the imagery (1 km 2 ) would limit application to those bird species with large home ranges or to simple recognition of broad habitat types.
2We modelled the distributions of three agricultural steppe birds over the whole of Spain using a common set of predictor variables, including AVHRR imagery. The species, great bustard, little bustard Tetrax tetrax and calandra lark Melanocorhypha calandra , have similar habitat requirements but differently sized home ranges, and are all species of conservation concern. Good models would reveal differences in distribution between the species and have high predictive power despite the large geographical extent covered.
3Generalized additive models (GAMs) were built with the presence–absence of the species as the response variable. Individual species’ responses to the habitat variables were identified using partial fits and compared with each other. We found that this modelling framework could successfully distinguish the habitats selected by the three species, while the response curves indicated how the habitats differed. Model fits and cross-validations assessed using receiver operating characteristic (ROC) plots showed the models to be successful and robust.
4We overlaid the predictive maps to identify key areas for agricultural steppe birds in Spain and compared these with the present network of protected sites in two sample regions. In Castilla León the provision of protected sites appears appropriate, but in Castilla La Mancha large areas of apparently suitable habitat have no protection.
5These results confirm that large-scale models are able to increase our understanding of species’ ecology and provide data for conservation planning. AVHRR imagery, in combination with other variables, has sufficient resolution to model a range of bird species, and GAMs have the flexibility to model subtle species–habitat responses.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
The increased availability of powerful statistical techniques and geographic information systems (GIS) has led to the rapid development of predictive species’ distributions models in ecology (for recent reviews see Guisan & Zimmermann 2000; Corsi, Leeuw & Skidmore 2000). Such models are beginning not only to provide accurate predictions of distributions (Jarberg & Guisan 2001) but also to answer ecological questions about habitat selection, the co-occurrence of species and competition between them (Leathwick & Austin 2001). Several authors (Rogers & Williams 1994; Frescino, Edwards & Moisen 2001; Osborne, Alonso & Bryant 2001) have built models using predictors derived from advanced very high resolution radiometer (AVHRR) imagery. AVHRR imagery is routinely archived by universities and national bodies in several countries, which distribute it at low cost. Its high temporal frequency (daily coverage for the entire world) provides ample data for cloud-free images to be selected, while its spatial resolution down to 1 km2 is acceptable for species mapping at large scales. Osborne, Alonso & Bryant (2001) demonstrated how a combination of AVHRR imagery and cartographic data layers could accurately model the distribution of great bustards Otis tarda L. around Madrid Province, Spain. Their study, motivated by the need to predict the impact of changing agricultural land use on birds (Pain & Pienkowski 1997; Tella et al. 1998), was established as a pilot for eventual scaling-up to whole countries or continents. If coarse resolution AVHRR imagery is to be of general use in species’ models and at large spatial scales, certain issues remain to be addressed. First, the success of Osborne, Alonso & Bryant's (2001) model may have been due to the large home range size of the great bustard in relation to the resolution of the imagery and modelling scale used (1 km2). Secondly, it is possible that the model simply identified cereals and grasslands that comprise the land use occupied by great bustards rather than the preferred habitat. Thirdly, variation in both habitat availability and species’ behaviours could limit useable models to small geographical extents where the fit is likely to be better (Osborne, Alonso & Bryant 2001). In this study we addressed these issues by building national scale models for three bird species (great bustard, little bustard Tetrax tetrax (L.) and calandra lark Melanocorhypha calandra (L.)) with broadly similar habitat requirements but very different home range sizes. Our research focused on birds that occupy agricultural steppes and dry grasslands in Spain; 81% of the bird species breeding in this habitat are classified as species of European Conservation Concern and 76% of them have shown recent population declines (Suárez, Naveso & De Juana 1997).
In building large-scale models for these species, as opposed to a generic model for agricultural steppe habitats, we would expect subtly different distribution patterns, choice of predictor variables, and habitat–response curves to emerge. This requires a flexible modelling environment where the researcher predefines as little as possible. We demonstrate how, through modern statistical techniques, it is possible to examine differences in habitat selection that experiments cannot address because of the practical problems imposed by large spatial scales (Ormerod, Pienkowski & Watkinson 1999). This is an example of the use of statistical control in lieu of experimental control on large spatial data sets.
To illustrate the practical use of distribution models, we show how combining our predictive models for the three agricultural steppe species allows us to assess and advise on the suitability of the present distribution of protected areas for conservation. By adopting the Habitats Directive 93/42, the governments of the European Community committed themselves to the creation of the Natura 2000 ecological network with the aim of conserving an extensive range of European habitats types and wildlife species. Although in some areas extensive farming systems already have some kind of protection, there are still large areas with no legal safeguards (Viada & Naveso 1996; Beaufoy 1998). An example is presented for two Spanish regions, Castilla La Mancha and Castilla León, where we modelled bird distributions to identify candidate areas for protection.
Spain is the third largest country in Europe, with a total surface area of 505 957 km2. Our study area, peninsular Spain, occupies four-fifths (493 486 km2) of the total area of the Iberian Peninsula (Fig. 1). The relief is organized around the Meseta Castellana, which is divided into northern (altitude 800–850 m a.s.l.) and southern (500–700 m) subplateaux by the Central System (2300 m). An internal ring of mountains surrounds this morphologic nucleus, the Sierra Morena at the south-west, and the Iberian system at the north-east, isolating the Ebro and Guadalquivir Valleys, respectively, from the central plateau. On the external periphery of the central plateau there are other important and higher mountain systems, mainly the Cantabrian Mountains (2600 m), Pyrenees and Cordillera Bética (Sierra Nevada) (both around 3400 m).
Spain is probably the most biogeographically complex country in Europe due to its geology, variable climate and location between the Eurosiberian and Mediterranean faunal zones. The vegetation reflects climatic diversity, varying from green Spain in the north and west, with its lush, extensive deciduous forests and its rich grassy plains, to Mediterranean Spain characterized by xerophytic, untilled scrubland together with sparse woodland. Steppe-like habitats constitute one of the main characteristic landscapes of the Spanish Mediterranean Region, resulting from human activities such as deforestation, sheep grazing and burning (De Juana et al. 1988).
We gathered breeding distribution data for three bird species of different body sizes (as a correlate of home range size): great bustard (4–12 kg), little bustard (680–975 g) and calandra lark (40–50 g). Presence and absence records accurate to 1 km or better were assembled from Spanish researchers for 1993–2001, the time-span necessary to achieve good coverage of the entire country. We supplemented these through fieldwork in spring 2000 and 2001 to improve both the geographical coverage and sampling across the range of possible values for each environmental variable. Our sample points were surveyed by car or bicycle, by stopping for 15 min every 1·5 km along secondary routes and recording the species present. Censuses were conducted during the first 3 h after sunrise or before sunset and geographical positions were recorded using a global positioning system (GPS). Note that we selected only bird data that may be regarded as true presence–absence records; they were all gathered through systematic surveys that recorded absences (missing in so-called ‘presence-only’ data). In the analyses that follow, we are assuming that the combined data (1993–2001) indicate the likely distribution patterns throughout the late 1990s.
Some 7500 bird records were gathered and rasterized to 1-km2 resolution, yielding 1234, 1346 and 1450 pixels with great bustards, little bustards and calandra larks, respectively. We matched these with an equal number of pixels recording absence, randomly drawn from the database of surveyed points. Maintaining equal weights on the presence and absence data sets helps in the interpretation of model performance as the results of the analyses used here may be affected by prevalence (Hosmer & Lemeshow 1989; Fielding & Bell 1997; Manel, Williams & Ormerod 2001).
Our assumption was that the distributions of agricultural steppe birds may be modelled from habitat type (land cover and land use), topography (because it affects visibility and vulnerability) and human disturbance (Osborne, Alonso & Bryant 2001). Habitat was characterized using a 12-month time-series of normalized difference vegetation indices (NDVI), calculated from AVHRR data. Raw imagery for 1999 was received by the Natural Environment Research Council Satellite Receiving Station at Dundee, UK, and processed by the Remote Sensing Group at the Plymouth Marine Laboratory, Plymouth, UK. We used the 1999 imagery as indicative of the vegetation present when the bird data were collected. For each month, every satellite pass recorded was extracted from the archive and manually checked for quality. Any navigation problems were corrected and erroneous data were removed from individual images. All individual cloud-free images for each month that met the quality control requirements were then used to calculate the NDVI and composited to create a single monthly maximum value composite file (Holben 1986; Marçal & Wright 1997). NDVI is based on the reflectance difference that green vegetation displays between the visible region and the near infrared region of the electromagnetic spectrum in channels 1 and 2 of the AVHRR images. Values of NDVI lie between −1 and +1, but only the positive values correspond with vegetated zones. The value of NDVI can vary depending on land use, season and climate.
Osborne, Alonso & Bryant (2001 ) used sample monthly values along the 12-month time-series to characterize vegetation types. However, in scaling-up to the whole of Spain small differences in the timing of seasons and agricultural production in different regions affect this approach and we sought a more general solution. Andres, Salas & Skole (1994 ) and Olsson & Eklundh (1994 ) suggest using temporal Fourier analysis to summarize vegetation time-series, while Eastman & Fulk (1993 ) applied standardized principal components analysis (PCA). PCA is attractive because it decomposes the time-series into a sequence of spatial and temporal components that may often be interpreted as corresponding to particular environmental features or events. By including all the months in the definition of each component, small differences in timing between regions become masked. Typically the first component indicates the characteristic value of the variable, whereas subsequent components represent change elements of decreasing magnitude ( Eastman & Fulk 1993 ). In contrast to its usual use in ecology, PCA applied to image time-series is not used primarily for data reduction and we replaced our original 12-monthly variables with 12 components for inclusion in the predictive models ( Table 1 ).
Table 1. Predictor variables used for modelling the occurrence of agricultural steppe birds in Spain
Principal components (PC) 1–12
Standardized principal components obtained from the normalized difference vegetation index for each month based on a maximum value composite of AVHRR imagery at 1-km2 resolution
Mean altitude within a 5 × 5 array of 200 m-pixels
Topographic variability 5 (TOPOV5)
Variation in altitude in a 5 × 5 pixel array of 200-m pixels, where altitude is measured to 5-m vertical resolution. Calculated as TOPOVx= (n − 1)/(p − 1) where n= number of different altitude classes in the array, p= number of pixels in the array (i.e. 25), and x is the vertical resolution
Topographic variability 10 (TOPOV10)
As for TOPOV5 but with 10-m vertical resolution
Road density (ROADDEN)
Proportion of 200-m pixels in a 5 × 5 array containing roads
Road distance (ROADDIST)
Distance in km to the nearest 200-m pixel containing roads. Calculated at 200-m resolution and averaged to 1 km2
Town density (TOWNDEN)
Proportion of 200-m pixels in 5 × 5 array containing buildings or large built structures such as airfields
Town distance (TOWNDIST)
Distance in km to the nearest 200-m pixel containing buildings or large built structures such as airfields. Calculated at 200-m resolution and averaged to 1 km2
River density (RIVDEN)
Proportion of 200-m pixels in a 5 × 5 array containing rivers
River distance (RIVDIST)
Distance in km to the nearest 200-m pixel containing rivers. Calculated at 200-m resolution and averaged to 1 km2
A digital terrain model (DTM) was built at 200 m resolution in the Instituto de Recursos Naturales y Ordenación del Territorio (INDUROT, University of Oviedo, Oviedo, Spain) for all topographic variables. Line data (rivers, roads and housing) for the whole of Spain were obtained at 1: 200 000 scale from the Centro Nacional de Información Geográfica, Madrid, Spain. These vector data were rasterized to 200-m resolution in IDRISI (Eastman 2000) to provide a matrix of 5 × 5 cells per 1 km2 to be used in the modelling. Predictor variables were created at 1-km2 resolution by summarizing the information in the higher resolution layers into measures of average density, distance or spread as appropriate (Table 1). The topographic variability measures TOPOV5 and TOPOV10 (Table 1) were alternatives designed to assess responses to different vertical resolutions in elevation data. As they showed strong collinearity, only one was selected for modelling with each species on the basis of its rank correlation with the presence–absence data (TOPOV5 for great bustard; TOPOV10 for little bustard and calandra lark). Simple correlations were run between all variables before model building to retain only one variable per pair where correlations exceeded 0·8.
In the absence of interactions with other species or environmental factors, we might expect bird responses to vegetation, terrain and disturbance to follow one of two forms of relationship (Fig. 2). For the vegetation indices and topography, an optimum value is likely, whereas for disturbance birds may simply prefer the least disturbed areas (i.e. there is an asymptote beyond which an area is considered safe). Both forms of response may be modelled using the polynomial X+ X2, which allows for cases where the data range along the x-axis is too short to exceed the linear range (bold lines in Fig. 2) and optionally provides for a single curve in the data (faint lines). For presence–absence data, a generalized linear model (GLM; McCullough & Nelder 1989) with a logit link function and binomial error term is appropriate, and we achieved success in building exploratory models in this way (S. Suárez-Seone, P. E. Osborne & J. C. Alonso, unpublished data). However, the idealized form of the functions in Fig. 2 may rarely be observed in nature because of complex interactions. We found that the less restrictive modelling environment provided by generalized additive models (GAMs; Hastie & Tibsharinani 1990) produced better predictions and simpler models over the large area of Spain (Pearce & Ferrier 2000). In GAMs the shape of the response curve is modelled as a series of smoothing splines (in the same manner as a scatterplot smoother) dictated by the data. Furthermore, this freedom from an assumed functional form is enormously advantageous in ecology because observed differences in response curves actually reflect species-specific habitat selection.
We used S-plus 2000 (Venables & Ripley 1999) and the GRASP (Generalized Regression Analysis and Spatial Prediction) interface (Lehmann, Leathwick & Overton 2001) to fit cubic splines with four degrees of freedom for each predictor, using a logit link and binomial error structure. Parsimonious models were generated using backwards selection with a chi-squared value of 0·05 (Pearce & Ferrier 2000) for the variable to remain in the equation. For each selected variable we tested whether the smoothed term was significant over a linear model and replaced non-significant smooths with linear terms to prevent over-fitting to the data. Terms were then dropped one by one from the final equation and their contribution to the model assessed using a likelihood ratio test (Venables & Ripley 1999). We assessed the predictive performance of the final model through 10-fold cross-validation. The data were divided into 10 groups of points drawn at random from across the geographical range, then each group was dropped in turn and predictions made for the excluded group based on the remaining 90% of data points. This technique is similar to jack-knifing (leave-one-out assessment) (Osborne & Tigar 1992) but is more robust because dropping 10% of data points perturbs the model more and gives a better reflection of its performance on new data (Verbyla & Litvaitis 1989; Fielding & Bell 1997). Both the final model (i.e. the ‘fit’) and the cross-validated model (i.e. prediction success) were assessed using the area under the receiver operating characteristic (ROC) curve (AUC) (Beck & Shultz 1986; Fielding & Bell 1997; Osborne, Alonso & Bryant 2001).
Outputs from the GAMs were produced using IDRISI 32.11 (Eastman 2000) and consisted of a habitat suitability index (HSI) for each species with a scale from 0 to 1. The method for doing this is described in Appendix 1. To define an objective cut point above which to consider the species as present, we classified the HSI at all cut points between 0·0 and 1·0 with an interval of 0·1 into predicted presence and absence, and compared the result with the actual data. We chose to split the data at the cut point that maximized the percentage of correct presences and absences (Pereira & Itami 1991; Fielding & Bell 1997; Brito, Crespo & Paulo 1999). Important and priority areas for the conservation of steppe birds were identified by combining the HSI surfaces for the three species. First, important areas were defined by applying the cut points to the models and drawing a map indicating where three, two or only one species was predicted to occur. Priority areas proposed for special conservation were identified by defining areas with HSI values for all species of 0·9, 0·8, 0·7 and 0·6 or better.
Because the form of the response curve in a GAM is not constrained by a fixed equation, it indicates how species respond to their environments under conditions of competing species and interactions. To generate response curves we calculated the terms for each selected variable for each case in the data set. When added together with the constant, these equal the logit of the response. We averaged these for each variable to produce a mean logit response (to scale the response curves correctly). For each variable of interest in turn, we then replaced the average term by estimated terms for a series of i-values, Vi, between the maximum and minimum for the variable, and calculated the response using the equation:
response = 1/[exp(−logit expression) + 1]
where logit expression = Σ mean terms for variables not being examined + constant + Vi.
A plot of the response against the values Vi shows how the species respond to the variable when all others terms are held constant at their average values.
Results and interpretation
PRINCIPAL COMPONENTS ANALYSIS OF TEMPORAL NDVI CURVES
Figure 3 and Table 2 show the factor loadings and the variance explained for the 12 vegetation components extracted by standardized PCA of the NDVI time-series. Component 1 showed high and consistent loadings over the year and represented the weighted average pattern in NDVI. This is typical of PCA applied to image time-series and showed that the major source of variability is spatial rather than temporal ( Eastman & Fulk 1993 ). Component 2 showed that the next most important source of variation was the annual cycle in green biomass, with a strong difference between winter and summer productivity. A different annual cycle was apparent in component 3, which contrasted the spring with other times of year, and subsequent components similarly contrasted individual months or seasons throughout the year.
Table 2. Percentage of the original variance explained by the vegetation components extracted by standardized PCA of the 12-month NDVI time-series
% variance explained
For all three species, the GAM equations included terms for vegetation, disturbance and topography (Table 3). The great bustard model included three linear and 11 smooth terms, and was dominated by contributions from altitude and vegetation components 1 and 3. Three linear and seven smooth terms were significant for the little bustard but none particularly dominated the equation. Vegetation components 1 and 2 and altitude made the largest contributions. For the calandra lark the balance of the equation was different, with only five smooth terms relative to 10 linear variables. Again, vegetation components 1 and 3 dominated the model, followed by topographic variation. The models for all species fitted the data well, with ROC scores of 0·9 or better, which reduced by only small amounts when cross-validated (Table 3). This suggested that the models were robust and could accurately predict the occurrence of species elsewhere within the range of predictor variables used (e.g. in neighbouring Portugal).
Table 3. Significance of each selected variable in the predictive models. The terms are the results of F -tests for the change in likelihood ratio after dropping the variable from the model. All results significant at P < 0·05 except †, which was not significant when expressed as a linear term but retained because of its significance as a smooth term. NT, not tested; L+, positive linear term; L–, negative linear term; S, smooth term. The vegetation variables PC 8 and PC 11 were not selected in any model
The three species displayed very different predicted breeding distributions (Fig. 4). The calandra lark was the more widely distributed species, its presence across Mediterranean peninsular Spain corresponding to the main steppe areas (Suárez et al. 1991; Suárez, Hérranz & Yanes 1996). These are the northern and southern subplateaux, plateaux of Iberian and Central Systems, Ebro and Guadalquivir Valleys, and the steppes of the south-east peninsula and Extremadura. These seven core areas are relatively well connected by suitable habitat, as defined by our HSI.
The little bustard had a more scattered distribution, with core areas in the Ebro Valley, Madrid-Castilla La Mancha and the Extremadura steppes. Other suitable but more fragmented areas were Castilla León, the Guadalquivir Valley and southern steppes. It was remarkable that the model correctly identified the small suitable areas in Galicia (north-west Spain) as data were sparse for this region. The identified areas corresponded to Terra Chá and other scattered patches in Lugo and Coristanco-Bergantiños in La Coruña where a few remaining little bustards are thought to breed. La Limia, traditionally considered a good area for breeding, appeared to be unsuitable, probably because of recent landscape changes (i.e. intensification and land extensification; Arcos et al. 1998).
The predicted great bustard distribution was generally coincident with the map shown in Alonso & Alonso (1996). The bird was mainly present in the two subplateaux, showing smaller suitable areas in La Serena (Badajoz) to northern Córdoba, and the plateaux of the Iberian and Central Systems. Other marginal areas appeared to be the Guadalquivir and Ebro Valleys, where great bustard habitat was highly fragmented.
DATA DOMAIN MAPS
Errors in the distribution maps (Fig. 4) are most likely when predictions are made beyond the data domain used in the analysis. Figure 5 shows all locations where the values of one or more predictor variables were outside the data domain. The maps showed that we mainly undersampled the high mountains of the Pyrenees, the Picos de Europa and the Sierra Nevada, but also in the south-west near the Doñana National Park. Based on 411 522 pixels being modelled, one variable lay beyond the data domain in 3·6% cases for great bustard, 3·8% for little bustard and 2·8% for calandra lark. Fewer than 0·1% of cases had three or more variables out of range for the three species. Over much of Spain, and especially where the species might be expected to occur, our sampling was adequate in both geographical and environmental space.
Each species showed a preference for a particular altitude range with large overlaps but also distinct optima (Fig. 6). Little bustards showed a broad peak across the 300–700-m range with a long tail into higher altitudes, whereas great bustards showed a narrow symmetrical peak around 800 m. Calandra larks preferred higher ground, around 1100 m. Looking at topographic variation, both bustards favoured some undulations whereas calandra larks steeply declined in response to a small increase in variation over flat ground. The response to vegetation component 3 was similar for all species, with a marked decline in the suitability of habitats that did not show the sharp drop in the temporal curve in April depicted in Fig. 3. In contrast, responses to vegetation component 1 differed among species, particularly for the great bustard, which showed an optimum.
IMPORTANT AREAS FOR STEPPE CONSERVATION
Optimum cut points obtained for great bustard (0·62), little bustard (0·58) and calandra lark (0·60) correctly classified 89·2%, 82·3% and 81·6% of species’ presences, respectively (Fig. 7). Absences were predicted slightly better than presences for great bustard and calandra lark, and exactly equally for the little bustard (Table 4). These cut points were used to generate Fig. 8, which shows the areas where the three species were predicted to co-occur. These corresponded to the main agricultural steppe areas in Spain, plus a few isolated areas such as in Galicia. The species were almost wholly absent from other areas of Spain, and they can therefore be considered as good indicators of agricultural steppe habitats in general.
Table 4. Number of presences and absences correctly predicted at the optimum cut point for each species
% of presences correctly predicted
% of absences correctly predicted
Overall percentage of points correctly predicted
The largest continuous extent of high-quality habitat (i.e. where the HSI for all three was greater than 0·6) was predicted to occur in Castilla La Mancha, Madrid and Castilla León (i.e. the northern and southern subplateux). Other key but smaller areas were La Serena (Badajoz), southern Cáceres province, Genil Valley in Granada, western Murcia and some patches in the Ebro Valley (especially at the border between Zaragoza and Teruel). Note that the Guadalquivir Valley did not contain any high-quality areas according to our predictions.
Figures 9 and 10 illustrate the potential of the models as a tool for identifying priority areas for conservation. The example focuses on the two main Spanish agricultural steppe areas in Castilla La Mancha and Castilla León and compares existing dry cropland special protection areas (SPAs; Table 5 ) with the occurrence of high-quality habitat predicted by the models.
Table 5. The main special protection areas (SPAs) for agricultural steppe lands in Castilla León and Castilla La Mancha marked on Figs 9 and 10
Lagunas de Villafáfila
La Nava-Campos Norte
Tierra de Campiñas
Llanuras del Guareña
Tierra del Pan
La Nava-Campos Sur
Castilla La Mancha
Area Esteparia del Este de Albacete
Zona Esteparia de El Bonillo
Campo de Calatrava
Areas Esteparias del Campo de Montiel
Estepas cerealistas de La Campiña
Llanuras de Oropesa, Lagartera y Calera y Chozas
Area Esteparia de La Mancha Norte
High-quality habitat appeared to be patchily distributed in Castilla León but the network of existing SPAs covers a geographically representative selection of good areas. In Castilla La Mancha, however, larger areas of good habitat exist but without adequate European legal protection, the network of existing SPAs omitting the best areas to the west and south of Cuenca altogether.
Regression models have played an important role in data analysis by providing prediction and classification rules, and in identifying important predictor variables. The simple linear model, however, often fails in real life because effects are not linear and it has frequently been replaced by GLMs that achieve non-linearity through transformations. GAMs go one stage further; rather than using predefined functions as approximations to the data being modelled, the data themselves dictate the form of the function through spline smoothing. The GAM approach (as used here) has several advantages in ecology. First, by retaining the familiar regression structure, model building is relatively straightforward and the contributions of different predictors may be assessed. (This contrasts markedly with ‘black-box’ approaches such as neural nets that may not produce interpretable output.) Secondly, by providing a close fit to the original data, GAMs may produce simpler models than an equivalent GLM with several polynomial terms. Thirdly, because the data themselves characterize the non-linear functions in a GAM, these functions are realistic representations of the species’ responses to the environment and thus allow us to examine habitat selection. This is unreliable in a GLM because the researcher has predefined the modelled functions. The unconstrained way that GAMs fit models may not, however, always be ideal, for example when the form of species’ responses is already known and closely follows simple curves as in Fig. 2. Moisen & Edwards (1999) first used GAMs to find the form of the species’ responses curve but then modelled that form using GLMs (Venables & Ripley 1999). This approach is certainly advantageous when the sample data are sparse because GAMs cannot reliably predict beyond the range of the data in the models (Frescino, Edwards & Moisen 2001). Guisan & Zimmermann (2000) provide detailed comparison of the various approaches used in distribution modelling, while Hastie, Tibshirani & Friedman (2001) discuss many modern solutions to data modelling, including GAMs.
Our results show that a relatively simple set of predictor variables modelled using GAMs can accurately predict the occurrences of agricultural steppe birds at 1-km resolution over the whole of Spain. The inclusion of coarse resolution satellite imagery did not lead to noticeably better predictions for birds with larger home ranges. This may in part have been due to our inclusion of other predictors based on variation at 200-m resolution (e.g. topographic variability) although these were summarized to 1 km before modelling. No effect of spatial scale was apparent, however, for the species we modelled. Nor did the models simply identify all steppe-like habitats, because there were substantial differences between the predicted patterns of occurrence for the three species (Fig. 4). These results strengthen the findings of Osborne, Alonso & Bryant (2001) and suggest a general utility for modelling bird distributions with AVHRR data combined with other habitat variables.
Among the predictors most often selected were vegetation measures derived from a PCA of a time-series of monthly NDVI images. Ten out of 12 vegetation components were selected for at least one model, including the lower order components 10 and 12 that explained only a small fraction of the original variance. The practice of retaining lower order components as predictor variables may be unfamiliar to ecologists who have traditionally been taught that their low signal to noise ratio renders them worthless. As Eastman & Fulk (1993) found, however, these often feature local patterns on spatial data sets and may actually have better predictive power than higher order components. PCA may be used in remote sensing for land cover discrimination (Townshend 1984) and for removing noise (e.g. stripes) from images (Eastman 2000). The crucial point is that information held on low order components is not spatially random and can therefore be of value in predictive models.
Leathwick & Austin (2001 ) suggest that GAMs may be improved by constraining the range of the predictor variables to a limited number of cases (e.g. 100) above the maximum and below the minimum for locations where the species occurred. This is likely to sharpen predictive power within the most difficult data domain (i.e. where the species is either present or absent) and areas where the species is known not to occur may simply be masked out. Here we were trying to build nationwide predictions and attempted to sample across the full data range through additional fieldwork. Even so, the data domain maps in Fig. 5 indicate that we cannot be confident about predictions in some areas.
Different predictor variables might also produce better models, especially if they are more functionally related to the processes affecting the species’ occurrences (Austin & Meyers 1996). For example, altitude featured in the models for all three steppe birds but has no direct functional significance. Rather, it is presumably a proxy variable for climatic variables which themselves affect plant growth and the invertebrate food available to the birds. For a given altitude, we would expect rather different plant communities and insect fauna on north- and south-facing slopes, so altitude itself is an imperfect proxy for the ecological process being modelled. We are currently exploring the use of climatic predictors directly in large-scale models of bird distributions. Also, interaction terms might be added to GAMs. In our exploratory analyses with GLMs (S. Suárez-Seoane, P. E. Osborne & J. C. Alonso, unpublished data) interaction terms were necessary to achieve similar predictive performance to the GAMs here, suggesting that the response curves were inadequately modelled with polynomial equations in linear models. Interaction terms are, however, sometimes difficult to interpret ecologically and, given the predictive power of the models reported here, we did not include them.
The predicted distribution maps (Fig. 4), table of significant predictor variables (Table 3) and the response curves (Fig. 6) together indicate how three apparently co-occurring species subdivide environmental and geographical space. For the calandra lark and the little bustard, vegetation component 1 was the most important predictor. This variable is an evenly weighted average of the monthly NDVI values and therefore represents an overall index of ‘greenness’. Both species decreased in occurrence as overall greenness increased, while the great bustard preferred locations part way along the scale. We interpret this to mean that the calandra lark and little bustard are able to tolerate more arid areas. Indeed, great bustards show a preference for higher herbaceous cover, probably because if offers them more concealment. Other variables reveal different affinities. For example, component 3 (which represents the agricultural cycle) was more important in the models for great bustard and calandra lark than for the little bustard. The former two species prefer wide areas of unobstructed cereal fields, whereas little bustards select habitats with a higher diversity of strata, preferring mosaics of pastures, legumes, vineyards and fallow land (Tellería et al. 1988; Martínez 1994; Lane, Alonso & Martín 2001). Subtle differences in interspecific habitat use were also apparent in the topographic variables. Each species showed a peak response to altitude but at different elevations, the great bustard using the most limited range. Calandra larks preferred flat terrain and their occurrence dropped off quite sharply on undulating ground. In contrast, both bustards favoured terrain with gentle undulations. Although these findings are not necessarily new (Alonso & Alonso 1990; Martínez 1998; Purroy 1997) this is the first time that they have been identified objectively from the partial fits of predictor variables in large-scale distribution models. Indeed, confirmation of suspected responses is a strength of modelling with GAMs and gives confidence to the final distribution models.
IMPORTANT AND PRIORITY AREAS FOR CONSERVATION
Agricultural mosaics of low intensity cultivation and dry grassland constitute one of the most important landscapes of the Iberian Peninsula (De Juana et al. 1988), not only for socio-economic reasons but also for both ecological and cultural benefits (González Bernáldez 1991). The major threats to agricultural steppe birds are intensification and the opposite trend towards land abandonment, which can have equally serious effects (Suárez-Seoane, Osborne & Baudry 2002). Other risks are reafforestation of marginal lands and overgrazing (the increased sheep numbers of the late 1980s and early 1990s appear to be associated with a dramatic decline in bird populations, particularly the little bustard; Beaufoy 1998). Despite the fact that around 52% and 60% of the world populations of great and little bustards, respectively, occur in Spain, agricultural steppes are not a priority habitat in the Habitats Directive. The Natura 2000 sites are intended to include a selection of the most representative and best conserved areas, which will ensure that the habitat types and species are maintained in a favourable conservation status (Beaufoy 1998). While for many scientific purposes the habitat suitability indices generated by the models are most informative, some simplification through thresholding (Fig. 8) is often advantageous for communicating the value of the results. The derived maps in Figs 9 and 10 are a useful first approach for proposing important areas and special protection areas for birds as the European Union requests (79/409/CEE). They clearly indicate inconsistencies in the designation of areas in Castilla León and Castilla La Mancha. Furthermore, they suggest precise locations for fieldwork to verify the importance of the areas, and they define boundaries in an objective way that may be helpful in discussing conservation needs with landowners.
Spanish bird data were kindly provided by volunteers in the SACRE Program (co-ordinated by Ramon Martín and Juan Carlos del Moral from SEO-BirdLife), the Juntas of Castilla León and Castilla La Mancha, J. Bustamante, J. Seoane, E. de Juana, C. Martínez, J. Estrada, A. Folch, S. Mañosa, J. Bonfil, F. González, X. Vázquez Pumariño, G. Martínez, J. Rubio, C. Astraín, A. Etxeberría, B. Campos, M. López and J. Serradilla. B. Llamas assisted with our fieldwork; P. García-Manteca helped to build the DTM in INDUROT (Universidad de Oviedo); F. Gómez Correa measured several thousands of coordinates from maps and I. Prieto Sarro helped to project them; C. Viada and A. Onrubia from SEO-BirdLife provided recent data on Spanish SPAs and Important Bird Areas. Our modelling procedures benefited from discussions with Anthony Lehmann, John Leathwick, Simon Berry, Simon Ferrier, Jennie Pearce and Trevor Hastie. Neil Lonie and Luke Tudor are thanked for extracting and processing the satellite imagery, which was funded by the Natural Environment Research Council. Susana Suárez-Seoane was funded through a Marie Curie postdoctoral fellowship within the European Commission's Environment and Climate programme (ENV4-CT98-5130). The Carnegie Trust funded the colour figure.
Mapping the output from a GAM within a GIS is more complicated than mapping a GLM because the analysis is based on splines. GRASP (Lehmann, Leathwick & Overton 2001) provides a direct means for doing this using ArcView and our method is essentially the same for IDRISI. In both cases the analyses are performed using S-Plus and involve creating a look-up table that is exported to the GIS.
1In IDRISI, calculate the minimum and maximum value for each predictor variable (i.e. coverage or image).
2Use a spreadsheet to calculate 50 equally spaced values between these limits. Label each variable with the same name as used during the analysis.
3Import this spreadsheet file to S-Plus. Issue the command predict (modelobject, filename, type = ‘terms’), where modelobject is the name of the final GAM model and filename is the name of the spreadsheet file. The result is the calculated value of the logit term for the 50 possible values for each variable. Crucially, the sum of these plus the constant (output at the bottom of the table) equals the logit part of the model.
4In IDRISI, perform an equal interval reclassification of each predictor image or coverage between the maximum and minimum values from (1) above.
5Now use the IDRISI command Assign to assign the values output from S-Plus to the newly reclassed images from (4). The resultant images will now show the logit response for each variable.
6To calculate the response surface, simply sum the images from (5) together with the constant and apply the usual transform HSI = 1/[1 + exp(–image sum)].