The modelling approaches for (2) and (3) derive from different sampling motivations. In (2), biologists wish to contrast used or consumed resource units such as plots of land, denning or nesting sites, prey or food items, with characteristics of resource units that have not been used or where use has not been recorded. Plants provide the clearest example of this view, where individuals are either present or truly absent at any given point on the landscape, within a given time-frame. Models provide predictions of the relative probability of a resource unit being used, given its characteristics. This differs from the motivation behind (3), where all resource units within the sampling domain are assumed to be available to be used, but some are used more frequently than others. Radiotelemetry studies of species such as grizzly bears Ursus arctos provide an example of this view, where bears might potentially be recorded at any point within their home range, but some locations are used more frequently than others. The difference between these sampling motivations is subtle, but explains the historical development of different approaches for similar problems.
describing the distribution of the presence-only records
This first group of modelling techniques, termed profile techniques, seeks to characterize environmental conditions associated with the presence records without reference to other data points. Environmental envelope techniques are the most widely applied (e.g. Busby 1986; Caughley et al. 1987; Lindenmayer et al. 1991; Law 1994; Pearce & Lindenmayer 1998; Walther, Wisz & Rahbek 2004). Chief among these techniques have been bioclim (Busby 1986, 1991) and habitat (Walker & Cocks 1991). Environmental envelopes enclose presence records into a multidimensional envelope within environmental space. The various techniques use different classification algorithms, but often provide similar results. Predictions are summarized typically as the degree of classification within subenvelopes.
A recent variation on this approach has been the development of support vector machines (svm) for one-class problems (e.g. Guo et al. 2005). svms seek to identify an environmental envelope or hyperspace containing the data points, in which the envelope is optimized with respect to the number of points in the envelope and to the number of outliers. The distance between the point and the centre of hyperspace determines membership of the hyperspace. The advantage of this approach over bioclim, for example, is that the svm hyperspace can be any shape, whereas bioclim uses hyperboxes to enclose the presence data (Guo et al. 2005). habitat also is more flexible than bioclim, defining the environmental envelope using a convex hull and the relative density of observations within environmental space. svm, therefore, may be considered a refinement of the habitat approach.
Multivariate association methods such as domain (Carpenter, Gillison & Winter 1993) also require only presence data. domain defines the degree of similarity among presence sites in terms of environmental conditions. The method can be used to determine either environmental envelopes or a continuous map of similarity.
At a finer scale, utilization distributions (UD) can be used to characterize the distribution of animals. The UD is a probability density function that quantifies an individual's or group's relative use of space (van Winkle 1975). Marzluff et al. (2004) have extended this approach by modelling the intensity of use relative to environmental covariates.
Profile techniques summarize environmental characteristics at presence locations, and typically each record has equal weight within the model. Because of this, these techniques are highly dependent on biases in the presence records. Some approaches, such as bioclim, can be highly sensitive to the inclusion of outliers. Elith & Burgman (2002) provide a discussion of the pros and cons of geographical and climatic envelope-based techniques. Predictions from presence-only approaches are generally coarse, but may be useful at meso-scales to describe poorly understood species when species records, environmental predictors, and biological understanding are scarce.
contrasting the distribution of presence vs. pseudo-absence
Many studies have sought to apply presence–absence techniques to presence-only data by generating pseudo-absence data from background areas from which species data are missing. These sites may be selected without replacement from within the study region either randomly (Stockwell & Peterson 2002), randomly with case-weighting to reduce the effective sample size of pseudo-absences (Ferrier & Watson 1996; Ferrier et al. 2002), or by using environmentally weighted random sampling (Zaniewski, Lehmann & Overton 2002). Pseudo-absences are assumed to represent true absences, although because sites were not searched some pseudo-absences might represent presence locations (Graham et al. 2004). Generalized linear models and generalized additive models have been the most widely applied statistical methods (e.g. Ferrier et al. 2002). However, other approaches such as tree-based methods (e.g. Ferrier & Watson 1996) and genetic algorithms (e.g. GARP; Stockwell & Peters 1999) also have been considered.
Regression models have generally performed better than tree-based methods or genetic algorithms in predicting species presence (Ferrier & Watson 1996). Tree-based methods are expected to be highly sensitive to biases within the sample data (Hastie et al. 2001), and the underlying model used to make predictions in GARP is largely inaccessible and difficult to interpret (Elith & Burgman 2002).
When using presence-only data it is generally not possible to calculate probabilities of presence; instead we aim to predict the relative likelihood of presence. There are two reasons for this: (a) separate samples of presence and pseudo-absence data have been selected where sampling fractions are not known, and (b) the pseudo-absence data contains an unknown number of presences, and is thus a contaminated sample of absences. To understand this we examine the logistic function and its assumptions. The logistic regression model assumes that a sample is selected, and that this sample contains observations of either the presence (y = 1) or the absence (y = 0) of a species. For each observation there is a set of habitat measurements x. From this the probability of occurrence [P(y = 1|)] can be estimated:
- ( eqn 1)
This assumes that presence and absence observations were recorded from a sample of resource units in which the presence of the species at a resource unit was not known prior to sampling. Thus the sample contains presence and absence sites in approximate proportion to their occurrence on the landscape. In the absence of habitat information, the probability of occurrence then can be estimated directly from the proportion of observations in the sample at which the species was present. For example, if in a sample of 100 observations, 20 contain the species, the probability of occurrence is 0·2 [= 20/(20 + 80)]. However, with presence-only data, we sample the presence locations independently and then select a sample of pseudo-absence locations, and so the proportion of presences within the sample does not represent the true prevalence of the species in the population, but rather the relative proportion chosen by the researcher. For example, we have a sample of 20 presence records and we select independently a set of 80 ‘pseudo-absence’ records. In this case the probability of occurrence is also 0·2 [= 20/(20 + 80)]. However, if we select 200 pseudo-absence locations, then the probability of occurrence is 0·09 [= 20/(20 + 200)].
When samples for y = 1 and y = 0 are selected in advance, we need to modify the logistic model to account for the probability that a location has been sampled to obtain probabilities of occurrence. We do this by correcting the model using P1 and P0, the proportion of occupied and unoccupied locations, respectively, selected from the total number of occupied and unoccupied locations in the landscape. This also is known as a case–control design.
- ( eqn 2)
In practice we rarely know what proportion of the used and unused locations we have selected in our samples, and so P0 and P1 are unknown. Model predictions using the uncorrected logistic function are therefore only relative predictions. Alternatively, we can interpret model coefficients in terms of odds ratios, where the odds that a species will be present given covariate pattern x, is compared to a reference habitat, usually one in which the values for x1 to xp are set to zero (Keating & Cherry 2004). Thus:
- ( eqn 3)
A further complication of this sampling scheme is that the process of generating pseudo-absences randomly from the landscape of interest means that these locations are actually an unknown mixture of presence and absence locations, unless the species is very rare on the landscape. Keating & Cherry (2004) discuss the difficulties of deriving probabilities of occurrence in case–control designs under these circumstances. However, unless the level of contamination (proportion of presences within the absence sample) is very high, the model may provide acceptable predictions of the relative likelihood of occurrence, or odds-ratios. Based on simulations, Lancaster & Imbens (1996) obtained unbiased estimates of βis with contamination rates less than 20%. Also, they provide an algorithm for dealing with situations where greater contamination rates exist. This approach seeks to calculate the predicted probability of species presence where presence locations are contrasted with control sites, which are an unknown mixture of occupied and unoccupied locations. The implementation of this approach is complex, not available in standard statistical packages, and frequently fails to converge to a unique solution (Keating & Cherry 2004). Barry, Elith & Pearce (unpublished data) provide a worked example of this approach for habitat studies.
contrasting the distribution of presence sites with available sites
A slightly different approach has been applied in studies of wide-ranging animals. These studies do not refer to the presence or absence of a species, but rather to how well a habitat is ‘used’, usually determined through radiotelemetry studies (Frair et al. 2004). In these studies, the landscape is considered to be available to the species of interest and potentially used to some extent, but some habitats are occupied more frequently than others within a given time period. These models describe the relative probability of use for different resource units (e.g. a pixel) over the study area, as described by habitat characteristics. The distinction between this approach and the pseudo-absence approach is subtle, because in practice the sampling schemes are similar. However, the underlying conceptual difference between contrasting unoccupied-vs.-occupied locations, and used-vs.-available locations has resulted in the development of a wide range of alternative modelling approaches.
Four approaches have been used to model presence-availability. The first of these, ecological niche factor analysis (enfa) implemented in the biomapper package (Hirzel, Hausser & Perrin 2004) is similar to profile techniques. enfa uses factor analysis to quantify the environmental conditions of the presence sites by comparing them to the environmental conditions of the entire region of interest, and predictions are provided as a habitat suitability index (Hirzel et al. 2002; Dettki, Löfstrand & Edenius 2003; Reutter et al. 2003; Brotons et al. 2004; Chefaoui, Hortal & Lobo 2005). enfa considers the density of points within subenvelopes of data and is therefore an improvement on presence-only approaches. This technique is generally optimistic regarding species distribution, which may be an advantage when a species does not occupy all suitable habitats on the landscape (Hirzel, Helfer & Metral 2001; Brotons et al. 2004). The two-class svd model uses a similar approach to enfa, except that it does not assume a particular probability distribution for the data (Guo et al. 2005).
A second approach to modelling presence-availability involves using case–control logistic regression where used resource units are contrasted with random locations within an activity area available to individuals. There are different sampling designs available to conduct this, where cases may be matched or unmatched with controls (Collett 1991; Arthur et al. 1996; Manly et al. 2002). Examples of this approach include contrasting wood turtle Clemmys insculpta locations with paired random locations (Compton, Rhymer & McCollough 2002) and contrasting superb parrot Polytelis swainsonii nest trees with paired random trees (Manning, Lindenmayer & Barry 2004). Models estimated using case–control logistic regression are based on the contrasts between used and control resource units and can be interpreted as odds ratios or relative likelihoods of occurrence (Keating & Cherry 2004). The discussion in the previous section about the contamination of controls also applies here.
A third approach proposed by Manly et al. (2002) uses logistic regression to estimate relative likelihoods using an exponential model:
- ( eqn 4)
This model has been used widely in resource selection studies (e.g. Campos et al. 1997; Johnson et al. 2002; Nielsen et al. 2002; Boyce et al. 2003), rather than the logistic function, because it avoids the problem of different denominators encountered in the logistic model. However, as Manly et al. (2002: 101–102) point out, this approach assumes a particular sampling scheme. In particular, this approach requires that one sample of presence locations and one sample of available locations be taken, and that any single location selected that occurs in both the presence and the available samples be included only in the available sample. McDonald (2003) shows that the duplicate records can be removed from the available sample rather than the observed sample unless the number of duplicates is high. Manly et al. (2002) show how, with known sampling frequencies of presence and available samples, probabilities of occurrence can be calculated, although Keating & Cherry (2004) question this model. However, in practice sampling probabilities are unknown, and irrespective of the validity of the model formulation, model predictions provide relative likelihoods of occurrence (i.e. the RSF). When interpreted as relative likelihoods, it is not necessary that the predictions are constrained to lie below 1, a concern raised by Keating & Cherry (2004).
A fourth approach is to use the logistic regression algorithm to approximate a logistic discrimination model. Here we use the logistic model to estimate a function that discriminates between two distributions of habitat covariates, one set associated with locations where the species is present fy=1(x) and another set associated with random (available) locations fy=0(x) (Keating & Cherry 2004). We sample independently from each distribution, with probability π1 of a sampled observation (from the joint distribution of presence and available sites) being a presence record, and π2 of it being an available record. We can assume (Seber 1984: 308) that the probability of a species being present at a location with covariates x, given that it was sampled is:
- ( eqn 5)
We can combine the sampling constant log(π1/π2) with the intercept term β0. Because we have no information on the sampling proportions we can calculate the relative probability of occurrence (dropping the intercept term). This approach is suitable for discriminating between random sites and sites at which the species has been observed. Naturally the discriminant function cannot discriminate between sites at which a species was present and sites at which it was absent (from a contaminated sample of occupied and unoccupied locations) (Keating & Cherry 2004). Again, predictions need not be constrained to lie below 1, because predictions are relative likelihoods rather than probabilities of occurrence.
The logistic discrimination model is very similar to the exponential model suggested by Manly et al. (2002), and in practice its application differs only because resource units that appear in the used sample also can appear in the sample of available units (Johnson et al. 2006). The logistic discrimination model does not require as many assumptions as the exponential model: assumptions that Keating & Cherry (2004) suggest might sometimes be violated. Seber (1984: 309) suggests that the logistic discrimination model may be relatively robust to observations occurring in both the presence and the available sample.