### Introduction

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion
- Acknowledgements
- Data accessibility
- References
- Supporting Information

The study of resource selection is essential for describing relationships between animals and their environment, understanding factors that determine the distribution of species and managing wildlife populations. Resource selection studies are often motivated by a need to understand what factors increase (or decrease) the probability an animal will use a sample unit. A use-availability design is a common sampling design in resource selection studies. We define ‘use’ as physical presence within a sample unit, which is often used synonymously with the term ‘presence’. We define ‘sample unit’ as the basic unit from which data are collected. In a resource selection context, sample units can range from trees a woodpecker may forage on to resource patches of similar vegetation.

Under a use-availability sampling design, resource attributes (denoted *x*) are recorded from a random set of sample units that were used by an animal (denoted *z *=* *1), and resource attributes are also recorded at a random set of sample units considered available to an animal. ‘Available’ sample units are synonymously called ‘background’ (Royle *et al*. 2012), ‘contaminated controls’ (Lancaster & Imbens 1996) or ‘pseudo-absences’ (Phillips, Anderson & Schapire 2006), though in practice it is unknown whether such sample units were used. Although these data are often referred to as ‘use-availability’ data (sensu Manly *et al*. 2002), some authors synonymously use the term ‘presence-only’ data. Estimating the absolute probability, a sample unit is used (i.e. a resource selection probability function; RSPF) from such data is difficult because the number of used sample units is not proportional to the occurrence of used sample units in the population of interest.

A common solution to this problem is to treat available sample units as if they were true absences. For example, Manly *et al*. (2002, p. 100) advocate fitting a logistic regression model to use-availability data. The resulting parameter estimates can then be substituted into a log-linear function that is assumed proportional to the absolute probability of use:

This function is commonly referred to as a resource selection function (RSF), because it is assumed proportional to the absolute probability of use. Machine learning algorithms such as Maxent (Phillips, Anderson & Schapire 2006; Phillips & Dudík 2008) and Random Forests (Cutler *et al*. 2007) are also commonly used to construct RSFs from use-availability data. Machine learning methods focus primarily on maximizing predictive capability (Elith *et al*. 2006) rather than parametric estimation and can estimate highly complex relations between resource attributes and the relative probability a sample unit is used. We note that while some of the techniques outlined above, such as Maxent, are frequently referred to as species distribution models, they address problems identical to those encountered in resource selection studies, namely what environmental variables are associated with the spatial distributions of species. For more detailed reviews of RSFs (and species distribution models), see Guisan & Zimmermann (2000), Manly *et al*. (2002), Guisan & Thuiller (2005), and Pearce & Boyce (2006). An important problem with treating available sample units as true absences is an inability to estimate the absolute probability a sample unit is used. The resulting RSF is assumed proportional to the absolute probability of use, though such proportionality is not guaranteed (Keating & Cherry 2004; Royle *et al*. 2012). Additionally, relative probabilities may be meaningless if baseline probabilities are close to 0 or 1. For example, even if a sample unit is 5 times more likely to be used when a particular attribute is present, if the baseline probability of use is 0·0001, an animal is still highly unlikely to use that sample unit.

Given the shortcomings described above, practitioners tasked with wildlife management and ensuring biodiversity should prefer to build RSPFs that produce unbiased estimates of the absolute probability a sample unit is used. Recall that under a use-availability study design, resource attributes, *x*, are recorded at a random set of used locations, *z *=* *1. The central statistical problem is then estimating Pr(*x*|*z *=* *1). Applying Bayes rule, we get:

- (eqn 1)

Notice that the right-hand side of equation 1 contains the term Pr(*z* = 1|*x*). This can be modelled via the logit link as:

and is the RSPF that is typically of interest to practitioners. Notice also that the denominator of equation 1 denotes the average probability any available sample unit is used, commonly referred to as ‘prevalence’. This equation, and the associated likelihood function, has been obtained by several authors (Lele & Keim 2006; Dorazio 2012; Royle *et al*. 2012). Maximizing the likelihood function with respect to the parameters involves approximating Pr(*z *=* *1) with large samples of available sample units (e.g. Lele & Keim (2006) suggest recording resource attributes at ≥ 10 000 available sample units). Although the maximum likelihood estimator associated with equation 1 provides unbiased estimates of RSPF parameters, problems persist. Recording resource attributes from enough available sample units to adequately approximate prevalence may be difficult, particularly if a large spatial area is considered available and resource attributes are measured in person on the ground. Additionally, Lele (2009) described numerical maximization difficulties with the maximum likelihood estimator proposed by Lele & Keim (2006).

Instead, one can obtain maximum likelihood estimates (MLEs) of RSPF parameters using a partial likelihood estimator derived from equation 1. Lancaster & Imbens (1996) proposed this model in the context of case–control sampling (hereafter called the case–control model with contaminated controls), Lele (2009) proposed the same model in the context of resource selection studies and this model is also the ‘observed’ likelihood described by Ward *et al*. (2009). The primary difference between the case–control model with contaminated controls proposed by Lancaster & Imbens (1996) and Lele (2009) and the full likelihood derived from equation 1 is that prevalence is treated like a parameter in the case–control model with contaminated controls. Although Lele (2009) demonstrated that MLEs of RSPF parameters obtained by maximizing this model with respect to the parameters are unbiased, widespread misconceptions exist, which has likely precluded widespread implementation. Keating & Cherry (2004) encountered difficulties fitting the case–control model with contaminated controls, including failure of optimization algorithms to converge to a unique solution when using categorical covariates or if starting values were far from actual values and lack of commercial software for fitting this model. Unfortunately, the difficulties encountered by Keating & Cherry (2004) have led others to dismiss this model as unstable and difficult to implement (e.g. Johnson *et al*. 2006; Pearce & Boyce 2006; Li, Guo & Elkan 2011). Another common misconception is that prevalence cannot be estimated from use-availability data (Elith *et al*. 2011).

Solutions to all of these problems have been proposed in the literature, but widespread use of the case–control model with contaminated controls suffers from poor linkages among relevant advancements, a divergent terminology and thus continued misconceptions. For example, Lele & Keim (2006) describe the circumstances under which parameters associated with categorical covariates can be estimated. However, they do not reference the problems encountered by Keating & Cherry (2004), and thus, their solution may have gone widely unnoticed. Similarly, Royle *et al*. (2012) dispel the notion that prevalence cannot be estimated from use-availability data. However, Keating & Cherry (2004) refer to prevalence as the ‘unconditional probability of use’, and Lele (2009) simply refers to prevalence as ‘*α*’ (noting the constraint *α* ∈ (0, 1)). Thus, it may be unclear to many readers that the advancement made by Royle *et al*. (2012) even applies to the models considered by Keating & Cherry (2004) and Lele (2009). Finally, there are few linkages among relevant literature. For example, Lele (2009) neither cites Lancaster & Imbens (1996) with the original formulation of the case–control model with contaminated controls, nor suggests the model he proposes is the same one evaluated by Keating & Cherry (2004). Thus, many practitioners may fail to notice that Lele (2009) provides solutions to many of the problems encountered by Keating & Cherry (2004).

Here, we address commonly held misconceptions regarding Lancaster & Imbens (1996) and Lele's (2009) case–control model with contaminated controls. Using simulations, we demonstrate that parameters associated with categorical covariates and prevalence can be estimated from use-availability data. We also show that modern computational advances can be used to obtain stable estimates of RSPF parameters. We go beyond demonstrating the basic feasibility of the case–control model with contaminated controls and evaluate model behaviour over a variety of realistic field conditions, which can help guide future studies. We also provide R and WinBUGS code (Appendix S1, Supporting information) to make the model accessible to potential users. By demonstrating the basic feasibility of this model, using simulations to help guide study design and providing model code, we hope to encourage widespread application of a promising model in studies of resource selection.

### Discussion

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion
- Acknowledgements
- Data accessibility
- References
- Supporting Information

Our results demonstrate that the case–control model with contaminated controls originally proposed by Lancaster & Imbens (1996) and subsequently proposed by Lele (2009) is a stable and unbiased method for estimating the parameters of RSPFs from use-availability data. We overcame all of the previously reported shortcomings of this model, including sensitivity of optimization algorithms to starting values and an inability to estimate prevalence and parameters associated with categorical covariates. Keating & Cherry (2004) reported failure of optimization algorithms to converge to a unique value if starting values were far from MLEs. However, this result was a function of the optimization algorithm rather than a flaw in the model itself. If the likelihood surface contains local maxima, gradient-based optimization algorithms may converge on local maxima rather than global maxima if starting values are far from data-generating values. Modern computational advances, such as Lele, Dennis & Lutscher's (2007) data-cloning algorithm, help overcome these optimization issues. Data cloning relies on Markov chain Monte Carlo (MCMC) techniques often used for Bayesian estimation. As a result, data cloning will converge on MLEs, even if starting values are far from MLEs and local maxima exist (Gelman *et al*. 2004; Lele, Dennis & Lutscher 2007). Our results also address the commonly held misconceptions that neither categorical covariates (Keating & Cherry 2004) nor prevalence (Elith *et al*. 2011) can be estimated from use-availability data. Our simulations indicate that the case–control model with contaminated controls produces unbiased estimates of both categorical covariate parameters (the *β*_{2} parameter) and prevalence (π). Although Lele & Keim (2006) described the conditions under which categorical covariate parameters can be estimated, and Royle *et al*. (2012) dispel the notion that prevalence cannot be estimated from use-availability data, we explicitly link these solutions to the problems encountered by Keating & Cherry (2004).

We believe the case–control model with contaminated controls offers several advantages when modelling resource selection of animals. Absolute probabilities are more intuitive to interpret than relative probabilities. Indeed, probabilistic interpretations are so intuitive that many software programs that construct RSFs from use-availability data (e.g. Maxent; Phillips, Anderson & Schapire 2006) produce output scaled between 0 and 1 (which is often erroneously interpreted as absolute probabilities). The case–control model with contaminated controls offers managers the ability to estimate the absolute probability a sample unit is used, facilitating straightforward comparisons between species and studies. Furthermore, models commonly used to estimate the parameters of RSFs, such as the exponential model (Manly *et al*. 2002, p. 100) or Maxent (Phillips, Anderson & Schapire 2006), produce resource selection ‘indices’, which may not be proportional to the absolute probability of use (Keating & Cherry 2004; Royle *et al*. 2012). In contrast, we demonstrated the case–control model with contaminated controls produces unbiased estimates of RSPF parameters. Finally, this model facilitates estimation of RSPF parameters with modest sample size requirements relative to alternative methods (e.g. Lele & Keim 2006; Royle *et al*. 2012), particularly if resource variables at available sample units are to be measured in the field. We thus believe the case–control model with contaminated controls will provide a practical method for estimating the parameters of RSPFs from field data.

Our simulations revealed potential sources of bias in the case–control model with contaminated controls. We expected some bias at high prevalence, since this leads to many available sample units that were actually used (i.e. ‘contaminated controls’). In practice, we do not expect contamination rates at the level explored in simulated data (π = 0·75) to be a problem, since common species (those with high prevalence) are more efficiently sampled using different protocols. For example, estimating the probability that a common species uses a sample unit may be more efficient by simply surveying a random selection of sample units and recording detection/nondetection. Indeed, a use-availability design is likely most efficient when the species of interest is relatively rare or difficult to detect, such that few observations would be made from a selection of sample units made without regard to use.

Our application of the case–control model with contaminated controls to hellbender use-availability data highlights the utility of this model when applied to a real data set. Recovery of Ozark hellbenders, like many rare habitat specialists, depends on conservation of specific resources that may naturally occur at low densities. In such circumstances, conservation planning can benefit from tools designed to identify habitat characteristics of high conservation priority, as well as species prevalence. For example, our application of this model was useful for identifying resource characteristics likely to be important to hellbenders as well as their rarity in a biologically relevant spatial extent (i.e. a river). Our estimates of the relation between probability of use and coarse substrate and distance to cover are consistent with Bodinof *et al*. (2012). However, our implementation had the advantage of estimating the absolute probability a hellbender would use a particular section of stream as a function of substrate and distance to cover. Estimating absolute probabilities of use is particularly useful for species that occur at low or high prevalence, since relative probabilities may be uninformative in this context. Indeed, we found that Ozark hellbenders were approximately 2·6 times as likely to use sections of stream that contain coarse substrate (because the odds ratio of using coarse substrate = ). However, the low prevalence estimated by our model indicates that they are still unlikely to use any portion of the NFWR. These findings emphasize the importance of identifying patches of densely arranged coarse substrate in NFWR as a conservation strategy for Ozark hellbenders.

In addition to estimating probabilities of use within a study area, parameters estimated using the case–control model with contaminated controls can also be used to predict the absolute probability of use at new sample units. This represents a major advantage of the case–control model with contaminated controls relative to modelling approaches that estimate the parameters of RSFs, since predictions of absolute probability of use are straightforward to interpret and compare across species. Accordingly, all of the tools commonly used to evaluate predictive performance (e.g. AUC, Fielding & Bell (1997), *k*-fold cross-validation, Boyce *et al*. 2002) can be used to validate RSPFs. Evaluating the predictive performance of a model with independent data is often the most useful way to evaluate that model's generality.

Our implementation of the case–control model with contaminated controls assumes independence of used and available samples. Assuming independence of used samples may be problematic if observations of space use are highly correlated. However, certain sampling protocols may help alleviate spatial autocorrelation in used samples. For example, ensuring that successive locations are adequately spaced in time may help alleviate concerns with spatial autocorrelation (Swihart & Slade 1985). Hellbender locations were separated by at least 24 h, which was assumed to be an adequate period for successive locations to be spatially independent. If spatial correlation is believed to be present in used samples, models that allow spatially correlated errors can be used. An autologistic model (Augustin, Mugglestone & Buckland 1996) may prove particularly useful in this context, since an autologistic model and the case–control model with contaminated controls rely on the same underlying RSPF. Another way to address spatial correlation in used samples is to model resource selection at the level of the individual animal and scale individual estimates to the population level (Marzluff *et al*. 2004; Thomas, Johnson & Griffith 2006). Spatial correlation represents a form of pseudoreplication (Hurlbert 1984), leading to overly precise, but unbiased, parameter estimates (Kutner *et al*. 2005). Thus, when population-level estimates are based on individual-level mean responses, spatial autocorrelation becomes irrelevant because individual-level means remain unbiased.

A critically important step in modelling use-availability data is defining what resources (or sample units) are available. In principle, all used resources represent a subset of available resources (Buskirk & Millspaugh 2006). Depending on the scale of the study, availability may be defined based on movement paths or home ranges of individual animals, up to the distributional limits of a species (Buskirk & Millspaugh 2006; Thomas & Taylor 2006). Additionally, availability is often defined by study site, political boundaries or by the limits of GIS coverage (e.g. when defining the ‘background’ in Maxent), though such arbitrary definitions can strongly affect inference regarding general patterns of resource selection (Johnson 1980). Definitions of what is available to an animal will necessarily differ to reflect study goals, though we recommend definitions that are biologically meaningful to a species rather than definitions based on convenience (e.g. conveniently available GIS layers).

Sample size should be considered when estimating the absolute probability of use from use-availability data. Even at sample sizes considered large for some field studies (*n*_{1} = *n*_{a} = 500), the case–control model with contaminated controls exhibited nontrivial bias at high prevalence. A one-size-fits-all sample size recommendation is potentially problematic, since biases may operate as a function of underlying parameters such as prevalence or strength of resource selection. Nonetheless, we recommend samples no smaller than 500 or 1000 used sample units. We encourage potential users to conduct prospective simulations to guide appropriate sampling design when using this model, including exploration of various nonlinear response functions (e.g. quadratic, threshold) and link functions (e.g. probit link).

Given the relatively large-sample requirements, the case–control model with contaminated controls will probably be most useful when applied to data collected from animals fitted with radiotelemetry or satellite GPS technology. However, we note that this model is not restricted to such data. This model may also be suitable for large-scale survey efforts that generate reliable presence points, but do not generate reliable absences. For example, the case–control model with contaminated controls may prove useful for modelling breeding bird survey data, which generates reliable detections of breeding birds, but has been plagued by uncertain absences.

Our results tie together pieces of a disparate literature and demonstrate the unbiased nature of the case–control model with contaminated controls. We address the misconceptions that have prevented widespread use of this model and discuss how they can be overcome. Further, we identify conditions when the case–control model with contaminated controls may not be appropriate, helping guide the appropriate application of this model. Although presented in a resource selection context, this model can be extended to any context where a researcher wishes to compare a group with a known feature to the population as a whole. By demonstrating the unbiased nature of the case–control model with contaminated controls, we hope to spur further research into a model that promises to be a powerful tool in studies of resource selection.