Location-only and Use-availability Data

# Location-only and use-availability data: analysis methods converge

Article first published online: 24 OCT 2013

DOI: 10.1111/1365-2656.12145

© 2013 The Authors. Journal of Animal Ecology © 2013 British Ecological Society

Additional Information

#### How to Cite

McDonald, L., Manly, B., Huettmann, F., Thogmartin, W. (2013), Location-only and use-availability data: analysis methods converge. Journal of Animal Ecology, 82: 1120–1124. doi: 10.1111/1365-2656.12145

#### Publication History

- Issue published online: 24 OCT 2013
- Article first published online: 24 OCT 2013
- Accepted manuscript online: 13 SEP 2013 10:48AM EST
- Manuscript Received: 6 SEP 2013
- Manuscript Accepted: 6 SEP 2013

- Abstract
- Article
- References
- Cited By

### Keywords:

- habitat selection;
- location-only data;
- resource selection;
- species geographical distributions;
- used-available data

### Introduction

- Top of page
- Introduction
- The early history of resource selection studies
- The early history of modelling of species geographical distributions
- Methods covered by papers in the Special Feature
- Methods not covered by papers in the Special Feature
- Acknowledgements
- References

This Special Feature arose from a session on a topic of the same name that took place during The Wildlife Society meeting in Kona, Hawaii, from 5 to 10 November, 2011. The purpose of that session and this Special Feature is to compare methods for predictive modelling of species geographical distributions and the modelling of habitat (resource) selection by animals. The predictive modelling of species geographical distributions and the modelling of habitat selection based on the environmental conditions at sites where animals are known to occur are essentially the same problem. Presence-only and used-available data both consist of a sample of locations with known presence of a species or an individual. A separate sample of locations from a study area, with unknown presence (pseudo-absence), is also assumed to exist. The probability or relative probability of presence of a species or individual is modelled and estimated across a certain time implicitly defined by the sampling mechanism, for example, by the time period during which museum specimens or radiotelemetry data were collected. A number of modelling methods have appeared in the literature over the last couple of decades. Many of these methods were made feasible by the availability of geographical information systems (GIS), global positioning system (GPS) radiotelemetry and public online data access initiatives (e.g. global biodiversity information facility). The papers in this Special Feature are intended to present the state of the methodological art in their subject area, with particular attention paid to contrasting the advantages and disadvantages of alternative methods of analysis for data.

In this editorial for the Special Feature, we begin with a review of some early history of methods for analysing the selection of resources by animals and then briefly describe the methods covered in the Special Feature and some other methods that are relevant but not covered.

### The early history of resource selection studies

- Top of page
- Introduction
- The early history of resource selection studies
- The early history of modelling of species geographical distributions
- Methods covered by papers in the Special Feature
- Methods not covered by papers in the Special Feature
- Acknowledgements
- References

In the 1980s, studies of resource selection by animals were being carried out, but data were usually analysed using standard statistical methods and the data were not usually thought of as requiring any specialized methods. However, there was a lot of interest in those and earlier years in modelling natural selection on animal and plant populations. Fitness functions were of special interest, defined as a function *w*(*x*_{1}, *x*_{2},…, *x*_{p}) of certain variables *X*_{1} to *X*_{p} such that if *f*(*x*_{1}, *x*_{2},…, *x*_{p}) is the relative frequency of individuals with *X*_{1} = *x*_{1}, *X*_{2} = *x*_{2},…, *X*_{p} = *x*_{p} before a selective event then the expected frequency after the event is *w*(*x*_{1}, *x*_{2},…, *x*_{p}) *f*(*x*_{1}, *x*_{2},…, *x*_{p}). Essentially, the fitness function gives the relative probabilities, *w*(*x*_{1}, *x*_{2},…, *x*_{p}), by which the event ‘selects’ individuals for later generations.

These fitness functions were discussed by O'Donald (1968, 1970, 1971) using quadratic functions of the *X* variables. Using quadratic functions can give impossible negative values of the fitness function. However, it is easy to show that if distributions are normal before and after selection, then the fitness function is an exponential function of the *X* variables. This then suggested that exponential fitness functions are appropriate for general use.

It was in 1987 that Lyman McDonald noticed that the ideas behind fitness functions could be applied to study resource selection by animals. With natural selection, there is interest in the animals that are ‘selected’ for survival; while with resource selection, there is interest in habitat or food units that are selected by animals. The idea of a fitness function was developed into the idea of a resource selection function and the first paper that used this idea was by McDonald, Manly & Raley (1990), with the function just called a selection function. Figure 1 illustrates the basic idea that when an approximately normal distribution of lengths of available insects is changed to a different normal distribution for length of individuals selected as food, then the relative probability of a resource unit (in this case an insect) being selected is proportional to an exponential function of the length of the insect.

These ideas were developed further and another early paper on the use of resource selection functions is one by Thomas, Manly & McDonald (1992) on a unified theory for the study of resource selection using samples of available and used resource units, eventually leading to the first edition of the book *Resource Selection by Animals: Statistical Design and Analysis for Field Studies* (Manly, McDonald & Thomas 1993).

One thing that has changed very much since those early days is the nature of the available data. In the early 1990s, it was common for the available data sets to be quite small. For example, one of the examples in the Manly, McDonald & Thomas (1993) book involved 117 observations of moose or moose tracks in four types of habitat. This can be compared with a study by Sawyer *et al*. (2006) that involved analysing the results obtained from 39 641 locations of 77 mule deer from 1998 to 2003. Large data sets were not common in 1993 but are now routinely available. This is one reason why it is important for those involved with the collection and analysis of data on the use of resources by animals to be aware of the methods discussed in this Special Feature.

### The early history of modelling of species geographical distributions

- Top of page
- Introduction
- The early history of resource selection studies
- The early history of modelling of species geographical distributions
- Methods covered by papers in the Special Feature
- Methods not covered by papers in the Special Feature
- Acknowledgements
- References

Coincident with the development of habitat selection functions for individual animals has been an increasing but independent array of methods developed for understanding the distribution of species (Buckland & Elston 1993; Boyce & McDonald 1999, MacKenzie *et al*. 2005; Elith *et al*. 2006; Johnson *et al*. 2006; Phillips, Anderson & Schapire 2006). Many of these methods are derivatives of traditional regression approaches, including general linear models, generalized additive models and multivariate adaptive regression splines (Elith & Leathwick 2009; Franklin & Miller 2009). The most common regression approach for understanding patterns in presence-only data involves use of logistic regression (Elith, Leathwick & Hastie 2008). With this approach, randomly chosen ‘pseudo-absence’ are commonly added to complete the calculations (Keating & Cherry 2004).

A burgeoning field is location-only methods analysed by machine learning techniques (De'ath & Fabricius 2000; Elith *et al*. 2006; Elith, Leathwick & Hastie 2008; Olden, Lawler & Poff 2008). Researchers have turned to these methods because of their ability at handling complex nonlinear interactions, correlation, high-dimensionality and non-stationarity (Olden, Lawler & Poff 2008; Evans *et al*. 2011). These methods include decision tree-based approaches (De'ath & Fabricius 2000), artificial neural networks (Spitz & Lek 1999), evolutionary computation such as genetic algorithms (Stockwell 1999), support vector machines (Guo, Kelly & Graham 2005) and, probably the most common method, maximum entropy (usually as implemented in the software MaxEnt; Phillips, Anderson & Schapire 2006; Elith *et al*. 2011). A likelihood-based alternative to MaxEnt is MaxLike (Royle *et al*. 2012). Phillips, Anderson & Schapire (2006) described parallels between logistic regression and maximum entropy.

With maximum entropy, as with most presence-only methods, the quantity to be estimated is the probability of a species presence Pr(*y* = 1), conditioned on environmental characteristics *z*: Pr(*y* = 1|*z*). Maximum entropy relies on the assumption that the environmental characteristics of the unobserved presence have the same moments as the observed presence. As a consequence, because the observed moments of the environmental characteristics may not equal the true moments, unconditional calculations of maximum entropy can lead to overfitting of the data (Phillips, Anderson & Schapire 2006; also see Yackulic *et al*. 2013). The most common solution to such overfitting is to approximate a joint probability distribution for both species location data and environmental characteristic data, which requires the availability of both presence and absence data (Phillips, Anderson & Schapire 2006). By definition, presence-only data lack information on species absence. Thus, lacking data on the absence of a species, practitioners of both regression and machine learning approaches attempt to resolve this problem through use of pseudo- or background absence for completing the calculations (Keating & Cherry 2004; Elith *et al*. 2006; Phillips, Anderson & Schapire 2006; Ward *et al*. 2009). These background absence are drawn, often at random and with replacement, from the region of interest. Many argue that these background draws do not adequately condition subsequent calculations (e.g. Warton & Shepherd 2010; Yackulic *et al*. 2013). Thus, the result of many of these presence-only methods is not the calculation of a species probability of presence but rather an index of species presence. Alternative methods for estimating the probability of presence directly from the data are becoming increasingly available (Lele & Keim 2006; Royle *et al*. 2012, and papers in this Special Feature).

### Methods covered by papers in the Special Feature

- Top of page
- Introduction
- The early history of resource selection studies
- The early history of modelling of species geographical distributions
- Methods covered by papers in the Special Feature
- Methods not covered by papers in the Special Feature
- Acknowledgements
- References

In the first paper that follows, Warton & Aarts (2013) deliver the most important message of this collection of papers, ‘The problems of analysing used-available data and presence-only data are equivalent ….,’ an observation made by several of the authors in this issue. McDonald (2013) points out that ‘Papers that analyse previously-collected or historical locations of an organism (e.g. museum samples, historical reports) tend to appear in the ecological literature and generally call their data presence-only… Papers that analyse an organism's locations collected during field studies (e.g., pellet surveys, telemetry) tend to appear in the wildlife literature and generally call their data use-availability.' McDonald (2013) also notes that close inspection of Aarts, Fieberg & Matthiopoulos (2012) and Fithian & Hastie (2012) reveals that the basic assumptions and estimated parameters under both methods are identical. Warton & Aarts (2013) continue to point out that individuals working to provide solutions to problems for either application, spatial distribution of a species or resource selection by animals can learn from the other's experience and literature, a position that we enthusiastically endorse.

Two papers in the Special Feature emphasize the use of point process models in analysis of use-available and location-only data. Johnson *et al*. (2013) consider telemetry data in the study of resource selection by animals to be a realization of a space–time point process. Under the point process paradigm, the times of the relocations are also considered to be random rather than fixed. They show that point process models are a generalization of weighted distribution functions, the basis of resource selection functions (RSF). We are unaware if point process models have been widely utilized in the study of species distributions and the location-only literature. If not, this method of analysis of used-available data may pay important dividends for individuals interested in modelling species distribution.

McDonald (2013) generalizes the used-available likelihood given in Johnson *et al*. (2006) to point process models. Resource selection functions are ratios of density functions as defined using the theory of weighted distributions. He argues ‘As simple ratios, RSFs must be positive and cannot be bounded above. Proper link functions must provide proportionality over their entire range. Given these conditions, the exponential link is the most logical and appropriate link function for estimating a RSF from use-availability data. These conditions exclude certain link functions, such as the logistic’. He also argues that RSFs require fewer assumptions and are more useful than functions which attempt to estimate true probability of use. This position is somewhat at odds with others, that is, Lele *et al*. (2013) and Rota *et al*. (2013) – pun intended because of the use of logistic models in the latter two papers.

Rota *et al*. (2013) and others have promoted case–control models (Lancaster & Imbens 1996) allowing for contaminated controls (i.e. available units may have been selected by study animals) to estimate the absolute probability a sample unit being used from use-availability data. They show by simulations that recent computational advances can obtain stable estimates of resource selection probability functions (RSPF). However, their methods require large sample sizes, particularly at low prevalence of use, limiting application primarily to analysis of modern GPS radiotelemetry data collected on animals. Application to modelling species distribution with location-only data with relatively small sample sizes will remain elusive.

Lele *et al*. (2013) emphasize that selection should be defined as strictly a binary decision with outcomes of use or non-use of a resource unit only when the unit is encountered. They continue, ‘This makes the probability of selection a fundamentally different metric than probabilities of use, choice, and occupancy….’ and define ‘… selection of a resource unit by an animal as the act of using a resource unit if it is encountered. The resource selection probability function… models this and is defined as the probability that a resource unit of type *x* is selected (or, becomes part of the use set) when encountered’. As usual, research into the latest and best way to understand a process (e.g. resource selection by animals) proceeds by breaking the process into finer elements and modelling those elements. Thus, research is never done! Clearly, if these issues exist in the used-available world, they must also exist in some form in the location-only world.

Aarts *et al*. (2013) explore the effect of habitat availability on modelling resource selection by animals, an issue long recognized in the study of resource selection with used-available data, but to our knowledge, little explored by individuals studying species distribution with location-only data. Given that the problems of analysing used-available data and presence-only data are equivalent, the most important contribution of Aarts *et al*. may be in identifying the problem more clearly for study of species distribution. They continue to explore the utility of a variety of existing and new methods that enable the influence of habitat availability (pseudo-absence) to be explicitly estimated.

Utilization distributions (UDs) are widely applied in animal use studies, but applications appear to be absence from the species distribution location-only literature. Hooten *et al*. (2013) consider the relationship of resource utilization functions (RUF) to resource selection functions and show that RUFs can serve as approximations to RSFs with modification and particular assumptions. In particular, they show that modified RUFs may provide less biased parameter estimates when the data are subject to location error, a situation that may be present in many species distribution location-only data sets.

We would be remiss if we did not mention other recent papers resulting from the original symposium which appear elsewhere. Nielson & Sawyer (2013) consider the problem of modelling resource selection using data collected on GPS radio tagged animals where relocations are collected at fine spatiotemporal scales. Their approach, along with others, for example Aarts *et al*. (2013), is to model intensity of use relating the relative frequency of relocations in sampled units to the habitat characteristics of those units. Simplicity and ease of model fitting are the most attractive characteristics of their methods.

### Methods not covered by papers in the Special Feature

- Top of page
- Introduction
- The early history of resource selection studies
- The early history of modelling of species geographical distributions
- Methods covered by papers in the Special Feature
- Methods not covered by papers in the Special Feature
- Acknowledgements
- References

Envelope models and other similarity measures were among the first ones developed for species distribution modelling with presence-only data. These approaches employ the ranges, means and other distributional characteristics of environmental variables associated with species presence to predict areas for locations where presence information is absent. Examples of these so-called profile methods include environmental (usually bioclimatic) envelope methods such as BIOCLIM (Busby 1991), environmental distance methods such as Penrose (Nielsen & Woolf 2002) and Mahalanobis (Clark, Dunn & Smith 1993; Farber & Kadmon 2003), fuzzy classification (Robertson, Villet & Palmer 2004) and environmental niche factor analysis (Hirzel *et al*. 2002). Most profile methods have fallen out of favour as maximum entropy approaches have come to dominate (Elith *et al*. 2006; Phillips, Anderson & Schapire 2006).

Finally, another approach for modelling wildlife habitat relationships not sufficiently covered in the special feature is Machine Learning (ML; Breiman 2001; Hastie *et al*. 2009). The diversity of ML approaches (Fielding 1999; Stephens *et al*. 2007) provide several advantages, such as speed, ease of use, complementarity, and predictive power and performance. For instance, these methods can deal with complex relationships between predictors often arising within large quantities of data, can process non-linear relationships between predictors, and can accommodate abundant and noisy data (Hochachka *et al*. 2007; Drew *et al*. 2011). Though the wider use of ML approaches has occurred principally outside of wildlife applications (Hastie et al. 2009; Drew *et al*. 2011) its demonstrated utility with presence-only data (Elith *et al*. 2006; Araujo & New 2007; Hardy *et al*. 2011; Huettmann *et al*. 2011) and an increasing diversity of tools and algorithms available in R and other statistical platforms suggests that this approach will have a constituency of adherents.

### Acknowledgements

- Top of page
- Introduction
- The early history of resource selection studies
- The early history of modelling of species geographical distributions
- Methods covered by papers in the Special Feature
- Methods not covered by papers in the Special Feature
- Acknowledgements
- References

Any use of trade, product or firm names are for descriptive purposes only and do not imply endorsement by the U.S. Government.

### References

- Top of page
- Introduction
- The early history of resource selection studies
- The early history of modelling of species geographical distributions
- Methods covered by papers in the Special Feature
- Methods not covered by papers in the Special Feature
- Acknowledgements
- References

- 2013) Quantifying the effect of habitat availability on species distributions. Journal of Animal Ecology, 82, 1135–1145. , , & (
- 2012) Comparative interpretation of count, presence-absence and point methods for species distribution models. Methods in Ecology and Evolution, 3, 177–187. , & (
- 2007) Ensemble forecasting of species distributions. Trends in Ecology & Evolution, 22, 42–47. & (
- 1999) Relating populations to habitats using resource selection functions. Trends in Ecology and Evolution, 14, 268–272. & (
- 2002) Evaluating resource selection functions. Ecological Modelling, 157, 281–300. , , & (
- 2001) Statistical modeling: the two cultures (with comments and a rejoinder by the author). Statistical Science, 16, 199–231. (
- 1993) Empirical models for the distribution of wildlife. Journal of Applied Ecology, 30, 478–495. & (
- 1991) BIOCLIM – a bioclimatic analysis and prediction system. Nature Conservation: Cost Effective Biological Surveys and Data Analysis (eds C.R. Margules & M.P. Austin), pp. 64–68. CSIRO, East Melbourne, Victoria, Australia. (
- 1993) A multivariate model of female black bear habitat use for a geographic information system. The Journal of Wildlife Management, 57, 519–526. , & (
- 2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81, 3178–3192. & (
- 2011) Predictive Species and Habitat Modeling in Landscape Ecology. Springer, New York. , & (eds) (
- 2009) Conservation prioritization using species distribution modeling. Spatial Conservation Prioritization: Quantitative Methods and Computational Tools (eds A. Moilanen, K.A. Wilson & H. Possingham), pp. 70–93. Oxford University, Oxford. & (
- 2008) A working guide to boosted regression trees. Journal of Animal Ecology, 77, 802–813. , & (
*et al*. (2006) Novel methods improve prediction of species distributions from occurrence data. Ecography, 29, 129–151. , , , , , , - 2011) A statistical explanation of MaxEnt for ecologists. Diversity and Distributions, 17, 43–57. , , , , & (
- 2011) Modeling species distribution and change using random forest. Predictive Species and Habitat Modeling in Landscape Ecology (eds C.A. Drew, Y.F. Wiersma & F. Huettmann), pp. 139–159. Springer, New York. , , & (
- 2003) Assessment of alternative approaches for bioclimatic modeling with special emphasis on Mahalanobis distance. Ecological Modelling, 160, 115–130. & (
- 1999) Machine Learning Methods for Ecological Applications. Springer, New York. (
- 2012). Statistical models for presence-only data: finite-sample equivalence and addressing observer bias. arXiv:1207.6950v2 [stat.AP] 15 Oct 2012 - Stanford University. & (
- 2009) Statistical models – modern regression. Mapping Species Distributions: Spatial Inference and Prediction (ed J. Franklin), pp. 113–208. Cambridge University Press, Cambridge. & (
- 2005) Support vector machines for predicting distribution of sudden oak death in California. Ecological Modelling, 182, 75–90. , & (
- 2011) Predicting the distribution and ecological niche of unexploited snow crab (
*Chionoecetesopilio*) populations in Alaskan waters: a first open-access ensemble model. Integrative and Comparative Biology, 51, 608–622. doi: 10.1093/icb/icr102. , , & ( - 2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn, Springer, New York. , & (
- 2002) Ecological niche-factor analysis: how to compute habitat-suitability maps without absence data? Ecology, 83, 2027–2036. , , & (
*et al*. (2007) Data-mining discovery of pattern and process in ecological systems. The Journal of Wildlife Management, 71, 2427–2437. doi: 10.2193/2006-503. , , , , , - 2013) Reconciling resource utilization and resource selection functions. Journal of Animal Ecology, 82, 1146–1154. , , & (
- 2011) Predictions of 27 Arctic pelagic seabird distributions using public environmental variables, assessed with colony data: a first digital IPY and GBIF open access synthesis platform. Marine Biodiversity, p41, 141–179. doi: 10.1007/s12526-011-0083-2. , & , and (
- 2006) Resource selection functions based on use-availability data: theoretical motivation and evaluation methods. The Journal of Wildlife Management, 70, 347–357. , , , & (
- 2013) Estimating animal resource selection from telemetry data using point process models. Journal of Animal Ecology, 82, 1155–1164. , & (
- 2004) Use and interpretation of logistic regression in habitat-selection studies. The Journal of Wildlife Management, 68, 774–789. & (
- 1996) Case-control studies with contaminated controls. Journal of Econometrics, 71, 145–160. & (
- 2006) Weighted distributions and estimation of resource selection probability functions. Ecology, 87, 3021–3028. & (
- 2013) Selection, use, choice, and occupancy: clarifying concepts in resource selection studies. Journal of Animal Ecology, 82, 1183–1191. , , & (
- 2005) Occupancy Estimation and Modeling: Inferring Patterns and Dynamics of Species Occurrence. Elsevier, San Diego, California, USA. , , , , & (
- 1993) Resource Selection by Animals: Statistical Design and Analysis for Field Studies, 1st edn, Chapman and Hall, London. , & (
- 1990) Analysing foraging and habitat use through selection functions. Studies in Avian Biology, 13, 325–331. , & (
- 2013) The point process use-availability or presence-only likelihood and comments on analysis. Journal of Animal Ecology, 82, 1174–1182. (
- 2002) Habitat-relative abundance relationship for bobcats in southern Illinois. Wildlife Society Bulletin, 30, 222–230. & (
- 2013) Estimating resource selection with count data. Ecology and Evolution, 3, 2233–2240. & (
- 1968) Measuring the intensity of natural selection. Nature, 220, 197–198. (
- 1970) Change of fitness by selection for a quantitative character. Theoretical Population Biology, 1, 219–232. (
- 1971) Natural selection for quantitative characters. Heredity, 27, 137–153. (
- 2008) Machine learning methods without tears: a primer for ecologists. The Quarterly Review of Biology, 83, 171–193. , & (
- 2006) Maximum entropy modeling of species geographic distributions. Ecological Modelling, 190, 231–259. , & (
- 2004) A fuzzy classification technique for predicting species distributions: applications using invasive alien plants and indigenous insects. Diversity and Distributions, 10, 461–474. , & (
- 2013) A re-evaluation of a case-control model with contaminated controls for resource selection studies. Journal of Animal Ecology, 82, 1165–1173. , , , , & (
- 2012) Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods in Ecology and Evolution, 3, 545–554. , , & (
- 2006) Winter habitat selection of mule deer before and during development of a natural gas field. The Journal of Wildlife Management, 70, 396–403. , , & (
- 1999) Environmental impact prediction using neural network modelling. An example in wildlife damage. Journal of Applied Ecology, 36, 317–326. & (
- 2007) A call for statistical pluralism answered. Journal of Applied Ecology, 44, 461–463. , , & (
- 1999) The GARP modelling system: problems and solutions to automated spatial prediction. International Journal of Geographic Information Science, 13, 143–158. (
- 1992) A unified theory for the study of resource selection (availability and use) by wildlife populations. Wildlife 2001: Populations (eds D.R. McCullough & R.H. Barrett), pp. 56–64. Elsevier, London. , & (
- 2009) Presence-only data and the EM algorithm. Biometrics, 65, 554–563. , , , & (
- 2010) Poisson point process models solve the pseudo-absence problem for presence-only data in ecology. Annals of Applied Statistics, 4, 1383–1402. & (
- 2013) Advancing our thinking in presence-only and used-available analysis. Journal of Animal Ecology, 82, 1125–1134. & (
*et al*. (2013) Presence-only modelling using MAXENT: when can we trust the inferences? Methods in Ecology and Evolution, 4, 236–243. , , , , ,