## Introduction

In wildlife and ecology literature, an equivalence exists between the terms presence-only and use-availability. Papers that analyse previously collected or historical locations of an organism (e.g. museum samples, historical reports) tend to appear in the ecological literature and generally call their data *presence-only* (Pearce & Boyce 2006; Warton & Shepherd 2010; Fithian & Hastie 2012). Papers that analyse an organism's locations collected during field studies (e.g. pellet surveys, telemetry) tend to appear in the wildlife literature and generally call their data *use-availability* (Hobbs & Hanley 1990; Aebischer, Robertson & Kenward 1993; Cherry 1996; Johnson *et al*. 2006; Beyer *et al*. 2010). Close inspection of these papers reveals that the basic assumptions and estimated parameters under both methods are identical, a fact noted by Aarts, Fieberg & Matthiopoulos (2012) and Fithian & Hastie (2012). The analyses implied by both terms ultimately involve two independent samples: one containing locations where an organism has been and the other containing locations where an organism could have been. Furthermore, both terms imply analyses that relate characteristics of the environment to the relative probability that an organism is located in a particular habitat. Consequently, the terms ‘presence-only data’ and ‘use-availability data’ should be treated as synonyms. This article will stick with the use-availability term because it clearly implies two samples and thereby seems more descriptive.

Presence-only data, or the used sample from use-availability data, consist of locations (in *n*-dimensional space, but commonly 2-D geographic space) where organisms have been located and observed in the past (Manly, McDonald & Thomas 1993; Johnson *et al*. 2006; Pearce & Boyce 2006; Warton & Shepherd 2010; Fithian & Hastie 2012). Broader definitions of ‘use’ that take into account the time spent or activity at a location are possible (Hebblewhite, Merrill & McDonald 2005; Buskirk & Millspaugh 2006). Indeed, this type of analysis can be used to analyse the locational characteristics of any event (e.g. locations of terrorist activity, earthquakes, human infrastructure development, etc.). As the names imply, presence-only or use-availability data do not contain the converse information on where organisms have not been located. Instead, these studies rely on data collected from locations where the organism *could have been*. The objective of the analysis is to identify characteristics of the environment, which influence where organisms were located (Johnson *et al*. 2006; Fithian & Hastie 2012). Mathematically, this objective amounts to estimating the relative probability of an organism using a particular habitat. In the wildlife literature, achieving this objective is generally called estimating *resource selection* (Manly *et al*. 2002) or *habitat use* (Johnson *et al*. 2006).

The analysis of presence-only and use-availability data has become common (Elith *et al*. 2006; Fithian & Hastie 2012). In the early 1990s, Manly, McDonald & Thomas (1993) devoted a chapter to analysis of use-availability data and made the analysis accessible to the average researcher. Since then, modelling of use-availability and presence-only data has seen increased application. Warton & Shepherd (2010) reported that 343 publications between 2005 and 2008 in the ISI Web of Science contained the term ‘presence-only data’. A search of Google Scholar in April 2013 for the terms ‘presence-only’ and ‘use-availability’ revealed an approximate fivefold increase in the number of papers with the terms presence-only and use-availability since 1991 and a twofold increase between 2001–2005 and 2006–2010.

Estimating the relative probability that an organism used a habitat fundamentally involves estimating two densities (Fithian & Hastie 2012). One density is for the characteristics of used or presence points, the other is for the characteristics of available or background points. The fact that two densities are involved is easy to overlook because the densities themselves are estimated simultaneously and ‘behind the scenes’ as part of the likelihood. Furthermore, these two densities are not typically interesting; rather, interest lies in the relative heights of these densities for a fixed set of characteristics. That is, researchers are typically interested in the ways that these two distributions differ, not in the densities themselves. A useful way to express the differences between these distributions is to take their ratio over all unique sets of characteristics. Johnson *et al*. (2006) defined the set of all such ratios to be the *resource selection function*, or RSF, although they were not the first to use that term. Manly, McDonald & Thomas (1993) was among the first to use the term RSF, although they did not explicitly write the RSF as a ratio of densities.

Several analyses estimate a RSF (Neu, Byers & Peek 1974; Aebischer, Robertson & Kenward 1993; MacKenzie1 *et al*. 2002). Manly *et al*. (2002) and Johnson *et al*. (2006) approached the RSF estimation problem from a finite sampling point of view. A finite sampling point of view, in general, implies that a finite population of sampled entities exists, and that it is theoretically possible to observe every single one. Manly *et al*. (2002) and Johnson *et al*. (2006) defined habitat units to be discrete geographic regions with positive area (e.g. pixels or quadrats), and they proceeded to estimate either the resource selection *probability* function (RSPF) or the RSF. In most cases, the RSPF is not estimable because it requires knowledge of the exact sampling mechanisms. Estimates of the RSF can always be produced. The RSF estimates the relative probability of selecting one habitat unit with a particular set of characteristics relative to another unit with different characteristics. Keating, Cherry & Lubow (2004) pointed out that the RSPF defined by Manly *et al*. (2002) was not constrained to the interval [0,1] and questioned the validity of the entire method despite the fact that the RSPF is less than 1 in the vast majority of cases. Johnson *et al*. (2006) extended the finite population method by directly estimating the RSF (rather than the RSPF) and empirically demonstrated the utility of the method. Since 2004, the finite population method and data type have been the focus of several papers (Johnson *et al*. 2006; Lele & Keim 2006; Lele 2009; Baddeley *et al*. 2010; Warton & Shepherd 2010; Aarts, Fieberg & Matthiopoulos 2012).

Warton & Shepherd (2010) approached the problem of RSF estimation from an infinite population point of view. In this view, an organism's locations are true mathematical points that have no area. Warton & Shepherd (2010) proposed modelling such point locations as an *inhomogeneous poisson process* (IPP), and in so doing was able to apply methods from other bodies of statistical theory (specifically, spatial statistics). At first, the IPP view seems to require only the sample of used locations (hence the name presence-only). However, the IPP likelihood involves the integral of the intensity surface over the entire study area (Cressie 1993; Warton & Shepherd 2010; Fithian & Hastie 2012). To estimate this integral, a second sample is required that disregards the used locations. So, in practical terms, both the finite population approach and the infinite population approach require two samples.

An additional comment about one aspect of the theoretical justification for both the finite and infinite population approaches is appropriate. Animals in general select some type, size or shape of discrete habitat unit in the wild. Animals cannot actually select a single point, they must select some region surrounding a point. A theoretical problem arises under both the finite and infinite population approaches because the true size and shape of the regions being selected (i.e. the habitat units) cannot be known unless the animal's cognitive processes can be measured. Uncertainty surrounding the true habitat units is a theoretical problem for the finite population view because the true size and shape of habitat units is needed to simply define the population. If the true size and shape of habitat units is unknown, the hypothesized population of habitat units is an approximation of the true population of habitat units and estimated relative probabilities of selection are approximations to true relative probabilities. Likewise, the infinite population view is an approximation to reality simply because animals cannot actually be located at a single massless point. Because both approaches are approximations to reality, one approach cannot always be favoured. Both approaches have positive and negative characteristics. One positive characteristic of the finite population approach is that different sizes and shapes of habitat units can be posited and studied. A positive characteristic of the infinite population approach is its connection to other bodies of statistical theory and the additional flexibility this affords. From a practical point of view, the approximations made by both approaches do not impede implementation of the method.

The purpose of this article is twofold. First, the use-availability likelihood derivation of Johnson *et al*. (2006) is generalized to the infinite population formulation. Fithian & Hastie (2012) derive the same result for fixed samples from a case–control perspective after employing Bayes' rule. The derivation presented here is different because it defines the two-sample likelihood and maximizes it using a Lagrangian multiplier method. Both derivations make the connection between analysis of use-availability data and presence-only data, and the different techniques provide additional perspective for both. The second purpose of this article is to present simple examples of the analysis in hopes that some of the controversy surrounding this analysis (Keating, Cherry & Lubow 2004; Lele & Keim 2006) can be put to rest. Specifically, it is hoped that readers will realize that RSPFs are not generally useful and that the exponential function is an appropriate form for RSFs. The derivation also confirms that standard logistic regression software maximizes the use-availability likelihood if the link function is exponential, a fact noted by Manly *et al*. (2002), Johnson *et al*. (2006) and Fithian & Hastie (2012). Furthermore, the logistic link function proposed by Lele & Keim (2006) is inappropriate, even for estimating the RSPF, because it cannot produce a function that is everywhere proportional to the RSF.