## Introduction

The study of resource selection is essential for describing relationships between animals and their environment, understanding factors that determine the distribution of species and managing wildlife populations. Resource selection studies are often motivated by a need to understand what factors increase (or decrease) the probability an animal will use a sample unit. A use-availability design is a common sampling design in resource selection studies. We define ‘use’ as physical presence within a sample unit, which is often used synonymously with the term ‘presence’. We define ‘sample unit’ as the basic unit from which data are collected. In a resource selection context, sample units can range from trees a woodpecker may forage on to resource patches of similar vegetation.

Under a use-availability sampling design, resource attributes (denoted *x*) are recorded from a random set of sample units that were used by an animal (denoted *z *=* *1), and resource attributes are also recorded at a random set of sample units considered available to an animal. ‘Available’ sample units are synonymously called ‘background’ (Royle *et al*. 2012), ‘contaminated controls’ (Lancaster & Imbens 1996) or ‘pseudo-absences’ (Phillips, Anderson & Schapire 2006), though in practice it is unknown whether such sample units were used. Although these data are often referred to as ‘use-availability’ data (sensu Manly *et al*. 2002), some authors synonymously use the term ‘presence-only’ data. Estimating the absolute probability, a sample unit is used (i.e. a resource selection probability function; RSPF) from such data is difficult because the number of used sample units is not proportional to the occurrence of used sample units in the population of interest.

A common solution to this problem is to treat available sample units as if they were true absences. For example, Manly *et al*. (2002, p. 100) advocate fitting a logistic regression model to use-availability data. The resulting parameter estimates can then be substituted into a log-linear function that is assumed proportional to the absolute probability of use:

This function is commonly referred to as a resource selection function (RSF), because it is assumed proportional to the absolute probability of use. Machine learning algorithms such as Maxent (Phillips, Anderson & Schapire 2006; Phillips & Dudík 2008) and Random Forests (Cutler *et al*. 2007) are also commonly used to construct RSFs from use-availability data. Machine learning methods focus primarily on maximizing predictive capability (Elith *et al*. 2006) rather than parametric estimation and can estimate highly complex relations between resource attributes and the relative probability a sample unit is used. We note that while some of the techniques outlined above, such as Maxent, are frequently referred to as species distribution models, they address problems identical to those encountered in resource selection studies, namely what environmental variables are associated with the spatial distributions of species. For more detailed reviews of RSFs (and species distribution models), see Guisan & Zimmermann (2000), Manly *et al*. (2002), Guisan & Thuiller (2005), and Pearce & Boyce (2006). An important problem with treating available sample units as true absences is an inability to estimate the absolute probability a sample unit is used. The resulting RSF is assumed proportional to the absolute probability of use, though such proportionality is not guaranteed (Keating & Cherry 2004; Royle *et al*. 2012). Additionally, relative probabilities may be meaningless if baseline probabilities are close to 0 or 1. For example, even if a sample unit is 5 times more likely to be used when a particular attribute is present, if the baseline probability of use is 0·0001, an animal is still highly unlikely to use that sample unit.

Given the shortcomings described above, practitioners tasked with wildlife management and ensuring biodiversity should prefer to build RSPFs that produce unbiased estimates of the absolute probability a sample unit is used. Recall that under a use-availability study design, resource attributes, *x*, are recorded at a random set of used locations, *z *=* *1. The central statistical problem is then estimating Pr(*x*|*z *=* *1). Applying Bayes rule, we get:

Notice that the right-hand side of equation 1 contains the term Pr(*z* = 1|*x*). This can be modelled via the logit link as:

and is the RSPF that is typically of interest to practitioners. Notice also that the denominator of equation 1 denotes the average probability any available sample unit is used, commonly referred to as ‘prevalence’. This equation, and the associated likelihood function, has been obtained by several authors (Lele & Keim 2006; Dorazio 2012; Royle *et al*. 2012). Maximizing the likelihood function with respect to the parameters involves approximating Pr(*z *=* *1) with large samples of available sample units (e.g. Lele & Keim (2006) suggest recording resource attributes at ≥ 10 000 available sample units). Although the maximum likelihood estimator associated with equation 1 provides unbiased estimates of RSPF parameters, problems persist. Recording resource attributes from enough available sample units to adequately approximate prevalence may be difficult, particularly if a large spatial area is considered available and resource attributes are measured in person on the ground. Additionally, Lele (2009) described numerical maximization difficulties with the maximum likelihood estimator proposed by Lele & Keim (2006).

Instead, one can obtain maximum likelihood estimates (MLEs) of RSPF parameters using a partial likelihood estimator derived from equation 1. Lancaster & Imbens (1996) proposed this model in the context of case–control sampling (hereafter called the case–control model with contaminated controls), Lele (2009) proposed the same model in the context of resource selection studies and this model is also the ‘observed’ likelihood described by Ward *et al*. (2009). The primary difference between the case–control model with contaminated controls proposed by Lancaster & Imbens (1996) and Lele (2009) and the full likelihood derived from equation 1 is that prevalence is treated like a parameter in the case–control model with contaminated controls. Although Lele (2009) demonstrated that MLEs of RSPF parameters obtained by maximizing this model with respect to the parameters are unbiased, widespread misconceptions exist, which has likely precluded widespread implementation. Keating & Cherry (2004) encountered difficulties fitting the case–control model with contaminated controls, including failure of optimization algorithms to converge to a unique solution when using categorical covariates or if starting values were far from actual values and lack of commercial software for fitting this model. Unfortunately, the difficulties encountered by Keating & Cherry (2004) have led others to dismiss this model as unstable and difficult to implement (e.g. Johnson *et al*. 2006; Pearce & Boyce 2006; Li, Guo & Elkan 2011). Another common misconception is that prevalence cannot be estimated from use-availability data (Elith *et al*. 2011).

Solutions to all of these problems have been proposed in the literature, but widespread use of the case–control model with contaminated controls suffers from poor linkages among relevant advancements, a divergent terminology and thus continued misconceptions. For example, Lele & Keim (2006) describe the circumstances under which parameters associated with categorical covariates can be estimated. However, they do not reference the problems encountered by Keating & Cherry (2004), and thus, their solution may have gone widely unnoticed. Similarly, Royle *et al*. (2012) dispel the notion that prevalence cannot be estimated from use-availability data. However, Keating & Cherry (2004) refer to prevalence as the ‘unconditional probability of use’, and Lele (2009) simply refers to prevalence as ‘*α*’ (noting the constraint *α* ∈ (0, 1)). Thus, it may be unclear to many readers that the advancement made by Royle *et al*. (2012) even applies to the models considered by Keating & Cherry (2004) and Lele (2009). Finally, there are few linkages among relevant literature. For example, Lele (2009) neither cites Lancaster & Imbens (1996) with the original formulation of the case–control model with contaminated controls, nor suggests the model he proposes is the same one evaluated by Keating & Cherry (2004). Thus, many practitioners may fail to notice that Lele (2009) provides solutions to many of the problems encountered by Keating & Cherry (2004).

Here, we address commonly held misconceptions regarding Lancaster & Imbens (1996) and Lele's (2009) case–control model with contaminated controls. Using simulations, we demonstrate that parameters associated with categorical covariates and prevalence can be estimated from use-availability data. We also show that modern computational advances can be used to obtain stable estimates of RSPF parameters. We go beyond demonstrating the basic feasibility of the case–control model with contaminated controls and evaluate model behaviour over a variety of realistic field conditions, which can help guide future studies. We also provide R and WinBUGS code (Appendix S1, Supporting information) to make the model accessible to potential users. By demonstrating the basic feasibility of this model, using simulations to help guide study design and providing model code, we hope to encourage widespread application of a promising model in studies of resource selection.