Location-only and use-availability data: analysis methods converge



This Special Feature arose from a session on a topic of the same name that took place during The Wildlife Society meeting in Kona, Hawaii, from 5 to 10 November, 2011. The purpose of that session and this Special Feature is to compare methods for predictive modelling of species geographical distributions and the modelling of habitat (resource) selection by animals. The predictive modelling of species geographical distributions and the modelling of habitat selection based on the environmental conditions at sites where animals are known to occur are essentially the same problem. Presence-only and used-available data both consist of a sample of locations with known presence of a species or an individual. A separate sample of locations from a study area, with unknown presence (pseudo-absence), is also assumed to exist. The probability or relative probability of presence of a species or individual is modelled and estimated across a certain time implicitly defined by the sampling mechanism, for example, by the time period during which museum specimens or radiotelemetry data were collected. A number of modelling methods have appeared in the literature over the last couple of decades. Many of these methods were made feasible by the availability of geographical information systems (GIS), global positioning system (GPS) radiotelemetry and public online data access initiatives (e.g. global biodiversity information facility). The papers in this Special Feature are intended to present the state of the methodological art in their subject area, with particular attention paid to contrasting the advantages and disadvantages of alternative methods of analysis for data.

In this editorial for the Special Feature, we begin with a review of some early history of methods for analysing the selection of resources by animals and then briefly describe the methods covered in the Special Feature and some other methods that are relevant but not covered.

The early history of resource selection studies

In the 1980s, studies of resource selection by animals were being carried out, but data were usually analysed using standard statistical methods and the data were not usually thought of as requiring any specialized methods. However, there was a lot of interest in those and earlier years in modelling natural selection on animal and plant populations. Fitness functions were of special interest, defined as a function w(x1, x2,…, xp) of certain variables X1 to Xp such that if f(x1, x2,…, xp) is the relative frequency of individuals with X1 = x1, X2 = x2,…, Xp = xp before a selective event then the expected frequency after the event is w(x1, x2,…, xp) f(x1, x2,…, xp). Essentially, the fitness function gives the relative probabilities, w(x1, x2,…, xp), by which the event ‘selects’ individuals for later generations.

These fitness functions were discussed by O'Donald (1968, 1970, 1971) using quadratic functions of the X variables. Using quadratic functions can give impossible negative values of the fitness function. However, it is easy to show that if distributions are normal before and after selection, then the fitness function is an exponential function of the X variables. This then suggested that exponential fitness functions are appropriate for general use.

It was in 1987 that Lyman McDonald noticed that the ideas behind fitness functions could be applied to study resource selection by animals. With natural selection, there is interest in the animals that are ‘selected’ for survival; while with resource selection, there is interest in habitat or food units that are selected by animals. The idea of a fitness function was developed into the idea of a resource selection function and the first paper that used this idea was by McDonald, Manly & Raley (1990), with the function just called a selection function. Figure 1 illustrates the basic idea that when an approximately normal distribution of lengths of available insects is changed to a different normal distribution for length of individuals selected as food, then the relative probability of a resource unit (in this case an insect) being selected is proportional to an exponential function of the length of the insect.

Figure 1.

Relative probability of selection of insects by Tree Swallows plotted as a function of prey length. The graph is superimposed on the distribution of insect lengths in the samples of available and used prey. This is a reproduction of Fig. 2 in the paper by McDonald, Manly & Raley (1990).

These ideas were developed further and another early paper on the use of resource selection functions is one by Thomas, Manly & McDonald (1992) on a unified theory for the study of resource selection using samples of available and used resource units, eventually leading to the first edition of the book Resource Selection by Animals: Statistical Design and Analysis for Field Studies (Manly, McDonald & Thomas 1993).

One thing that has changed very much since those early days is the nature of the available data. In the early 1990s, it was common for the available data sets to be quite small. For example, one of the examples in the Manly, McDonald & Thomas (1993) book involved 117 observations of moose or moose tracks in four types of habitat. This can be compared with a study by Sawyer et al. (2006) that involved analysing the results obtained from 39 641 locations of 77 mule deer from 1998 to 2003. Large data sets were not common in 1993 but are now routinely available. This is one reason why it is important for those involved with the collection and analysis of data on the use of resources by animals to be aware of the methods discussed in this Special Feature.

The early history of modelling of species geographical distributions

Coincident with the development of habitat selection functions for individual animals has been an increasing but independent array of methods developed for understanding the distribution of species (Buckland & Elston 1993; Boyce & McDonald 1999, MacKenzie et al. 2005; Elith et al. 2006; Johnson et al. 2006; Phillips, Anderson & Schapire 2006). Many of these methods are derivatives of traditional regression approaches, including general linear models, generalized additive models and multivariate adaptive regression splines (Elith & Leathwick 2009; Franklin & Miller 2009). The most common regression approach for understanding patterns in presence-only data involves use of logistic regression (Elith, Leathwick & Hastie 2008). With this approach, randomly chosen ‘pseudo-absence’ are commonly added to complete the calculations (Keating & Cherry 2004).

A burgeoning field is location-only methods analysed by machine learning techniques (De'ath & Fabricius 2000; Elith et al. 2006; Elith, Leathwick & Hastie 2008; Olden, Lawler & Poff 2008). Researchers have turned to these methods because of their ability at handling complex nonlinear interactions, correlation, high-dimensionality and non-stationarity (Olden, Lawler & Poff 2008; Evans et al. 2011). These methods include decision tree-based approaches (De'ath & Fabricius 2000), artificial neural networks (Spitz & Lek 1999), evolutionary computation such as genetic algorithms (Stockwell 1999), support vector machines (Guo, Kelly & Graham 2005) and, probably the most common method, maximum entropy (usually as implemented in the software MaxEnt; Phillips, Anderson & Schapire 2006; Elith et al. 2011). A likelihood-based alternative to MaxEnt is MaxLike (Royle et al. 2012). Phillips, Anderson & Schapire (2006) described parallels between logistic regression and maximum entropy.

With maximum entropy, as with most presence-only methods, the quantity to be estimated is the probability of a species presence Pr(y = 1), conditioned on environmental characteristics z: Pr(y = 1|z). Maximum entropy relies on the assumption that the environmental characteristics of the unobserved presence have the same moments as the observed presence. As a consequence, because the observed moments of the environmental characteristics may not equal the true moments, unconditional calculations of maximum entropy can lead to overfitting of the data (Phillips, Anderson & Schapire 2006; also see Yackulic et al. 2013). The most common solution to such overfitting is to approximate a joint probability distribution for both species location data and environmental characteristic data, which requires the availability of both presence and absence data (Phillips, Anderson & Schapire 2006). By definition, presence-only data lack information on species absence. Thus, lacking data on the absence of a species, practitioners of both regression and machine learning approaches attempt to resolve this problem through use of pseudo- or background absence for completing the calculations (Keating & Cherry 2004; Elith et al. 2006; Phillips, Anderson & Schapire 2006; Ward et al. 2009). These background absence are drawn, often at random and with replacement, from the region of interest. Many argue that these background draws do not adequately condition subsequent calculations (e.g. Warton & Shepherd 2010; Yackulic et al. 2013). Thus, the result of many of these presence-only methods is not the calculation of a species probability of presence but rather an index of species presence. Alternative methods for estimating the probability of presence directly from the data are becoming increasingly available (Lele & Keim 2006; Royle et al. 2012, and papers in this Special Feature).

Methods covered by papers in the Special Feature

In the first paper that follows, Warton & Aarts (2013) deliver the most important message of this collection of papers, ‘The problems of analysing used-available data and presence-only data are equivalent ….,’ an observation made by several of the authors in this issue. McDonald (2013) points out that ‘Papers that analyse previously-collected or historical locations of an organism (e.g. museum samples, historical reports) tend to appear in the ecological literature and generally call their data presence-only… Papers that analyse an organism's locations collected during field studies (e.g., pellet surveys, telemetry) tend to appear in the wildlife literature and generally call their data use-availability.' McDonald (2013) also notes that close inspection of Aarts, Fieberg & Matthiopoulos (2012) and Fithian & Hastie (2012) reveals that the basic assumptions and estimated parameters under both methods are identical. Warton & Aarts (2013) continue to point out that individuals working to provide solutions to problems for either application, spatial distribution of a species or resource selection by animals can learn from the other's experience and literature, a position that we enthusiastically endorse.

Two papers in the Special Feature emphasize the use of point process models in analysis of use-available and location-only data. Johnson et al. (2013) consider telemetry data in the study of resource selection by animals to be a realization of a space–time point process. Under the point process paradigm, the times of the relocations are also considered to be random rather than fixed. They show that point process models are a generalization of weighted distribution functions, the basis of resource selection functions (RSF). We are unaware if point process models have been widely utilized in the study of species distributions and the location-only literature. If not, this method of analysis of used-available data may pay important dividends for individuals interested in modelling species distribution.

McDonald (2013) generalizes the used-available likelihood given in Johnson et al. (2006) to point process models. Resource selection functions are ratios of density functions as defined using the theory of weighted distributions. He argues ‘As simple ratios, RSFs must be positive and cannot be bounded above. Proper link functions must provide proportionality over their entire range. Given these conditions, the exponential link is the most logical and appropriate link function for estimating a RSF from use-availability data. These conditions exclude certain link functions, such as the logistic’. He also argues that RSFs require fewer assumptions and are more useful than functions which attempt to estimate true probability of use. This position is somewhat at odds with others, that is, Lele et al. (2013) and Rota et al. (2013) – pun intended because of the use of logistic models in the latter two papers.

Rota et al. (2013) and others have promoted case–control models (Lancaster & Imbens 1996) allowing for contaminated controls (i.e. available units may have been selected by study animals) to estimate the absolute probability a sample unit being used from use-availability data. They show by simulations that recent computational advances can obtain stable estimates of resource selection probability functions (RSPF). However, their methods require large sample sizes, particularly at low prevalence of use, limiting application primarily to analysis of modern GPS radiotelemetry data collected on animals. Application to modelling species distribution with location-only data with relatively small sample sizes will remain elusive.

Lele et al. (2013) emphasize that selection should be defined as strictly a binary decision with outcomes of use or non-use of a resource unit only when the unit is encountered. They continue, ‘This makes the probability of selection a fundamentally different metric than probabilities of use, choice, and occupancy….’ and define ‘… selection of a resource unit by an animal as the act of using a resource unit if it is encountered. The resource selection probability function… models this and is defined as the probability that a resource unit of type x is selected (or, becomes part of the use set) when encountered’. As usual, research into the latest and best way to understand a process (e.g. resource selection by animals) proceeds by breaking the process into finer elements and modelling those elements. Thus, research is never done! Clearly, if these issues exist in the used-available world, they must also exist in some form in the location-only world.

Aarts et al. (2013) explore the effect of habitat availability on modelling resource selection by animals, an issue long recognized in the study of resource selection with used-available data, but to our knowledge, little explored by individuals studying species distribution with location-only data. Given that the problems of analysing used-available data and presence-only data are equivalent, the most important contribution of Aarts et al. may be in identifying the problem more clearly for study of species distribution. They continue to explore the utility of a variety of existing and new methods that enable the influence of habitat availability (pseudo-absence) to be explicitly estimated.

Utilization distributions (UDs) are widely applied in animal use studies, but applications appear to be absence from the species distribution location-only literature. Hooten et al. (2013) consider the relationship of resource utilization functions (RUF) to resource selection functions and show that RUFs can serve as approximations to RSFs with modification and particular assumptions. In particular, they show that modified RUFs may provide less biased parameter estimates when the data are subject to location error, a situation that may be present in many species distribution location-only data sets.

We would be remiss if we did not mention other recent papers resulting from the original symposium which appear elsewhere. Nielson & Sawyer (2013) consider the problem of modelling resource selection using data collected on GPS radio tagged animals where relocations are collected at fine spatiotemporal scales. Their approach, along with others, for example Aarts et al. (2013), is to model intensity of use relating the relative frequency of relocations in sampled units to the habitat characteristics of those units. Simplicity and ease of model fitting are the most attractive characteristics of their methods.

Methods not covered by papers in the Special Feature

Envelope models and other similarity measures were among the first ones developed for species distribution modelling with presence-only data. These approaches employ the ranges, means and other distributional characteristics of environmental variables associated with species presence to predict areas for locations where presence information is absent. Examples of these so-called profile methods include environmental (usually bioclimatic) envelope methods such as BIOCLIM (Busby 1991), environmental distance methods such as Penrose (Nielsen & Woolf 2002) and Mahalanobis (Clark, Dunn & Smith 1993; Farber & Kadmon 2003), fuzzy classification (Robertson, Villet & Palmer 2004) and environmental niche factor analysis (Hirzel et al. 2002). Most profile methods have fallen out of favour as maximum entropy approaches have come to dominate (Elith et al. 2006; Phillips, Anderson & Schapire 2006).

Finally, another approach for modelling wildlife habitat relationships not sufficiently covered in the special feature is Machine Learning (ML; Breiman 2001; Hastie et al. 2009). The diversity of ML approaches (Fielding 1999; Stephens et al. 2007) provide several advantages, such as speed, ease of use, complementarity, and predictive power and performance. For instance, these methods can deal with complex relationships between predictors often arising within large quantities of data, can process non-linear relationships between predictors, and can accommodate abundant and noisy data (Hochachka et al. 2007; Drew et al. 2011). Though the wider use of ML approaches has occurred principally outside of wildlife applications (Hastie et al. 2009; Drew et al. 2011) its demonstrated utility with presence-only data (Elith et al. 2006; Araujo & New 2007; Hardy et al. 2011; Huettmann et al. 2011) and an increasing diversity of tools and algorithms available in R and other statistical platforms suggests that this approach will have a constituency of adherents.


Any use of trade, product or firm names are for descriptive purposes only and do not imply endorsement by the U.S. Government.