Advances and applications of occupancy models

Authors


Summary

  1. The past decade has seen an explosion in the development and application of models aimed at estimating species occurrence and occupancy dynamics while accounting for possible non-detection or species misidentification.
  2. We discuss some recent occupancy estimation methods and the biological systems that motivated their development. Collectively, these models offer tremendous flexibility, but simultaneously place added demands on the investigator.
  3. Unlike many mark–recapture scenarios, investigators utilizing occupancy models have the ability, and responsibility, to define their sample units (i.e. sites), replicate sampling occasions, time period over which species occurrence is assumed to be static and even the criteria that constitute ‘detection’ of a target species. Subsequent biological inference and interpretation of model parameters depend on these definitions and the ability to meet model assumptions.
  4. We demonstrate the relevance of these definitions by highlighting applications from a single biological system (an amphibian–pathogen system) and discuss situations where the use of occupancy models has been criticized. Finally, we use these applications to suggest future research and model development.

Introduction

Since the seminal work by MacKenzie et al. (2002, 2003), there has been an explosion in the development and application of models aimed at estimating species occurrence and occupancy dynamics while accounting for possible non-detection or species misidentification (MacKenzie et al. 2006, 2009; Miller et al. 2011). Since 2002, well over 1000 papers have cited occupancy models (Google Scholar) with a plethora of studies investigating ecological questions and processes such as species distribution modelling (e.g. Royle, Nichols & Kéry 2005; Kéry, Guillera-Arroita & Lahoz-Monhort 2013), habitat relationships (e.g. Ball, Doherty & McDonald 2005), metapopulation dynamics (e.g. Ferraz et al. 2007), invasive species dynamics (e.g. Bled, Royle & Cam 2011; Yackulic et al. 2012), multispecies relationships (e.g. competition or predation, MacKenzie et al. 2006; Miller et al. 2012a) and community dynamics (Zipkin, Dewan & Royle 2009). The studies have involved numerous vertebrate taxa as well as applications to plants (e.g. Kéry 2004), invertebrates (e.g. Govindan, Kéry & Swihart 2012) and pathogens (e.g. Gomez-Dias et al. 2010). Uses of occupancy models have also extended to applications in human medicine, palaeontology (taxonomic ranges based on fossil data, Liow 2013) and even political science (probabilities of incidents reflecting political unrest).

Various extensions of the original static and dynamic models have been proposed to accommodate multiple occupied states (Royle 2004; Royle & Link 2005; Nichols et al. 2007; MacKenzie et al. 2009), estimate community-level metrics and dynamics (Dorazio & Royle 2005; Kéry & Royle 2009), simultaneously model habitat and occupancy dynamics (Martin et al. 2010; MacKenzie et al. 2011; Miller et al. 2012a), estimate species occurrence at multiple spatial or temporal scales (Nichols et al. 2008; Kendall 2009; McClintock et al. 2010b; Mordecai et al. 2011; Pavlacky et al. 2012), and model occupancy dynamics as a function of the occupancy states of nearby (neighbouring) sites (Royle & Dorazio 2008; Bled, Royle & Cam 2011; Yackulic et al. 2012). Other model development has been aimed at relaxing model assumptions by allowing heterogeneous detection probabilities (e.g. MacKenzie et al. 2006; Royle 2006), including abundance-induced heterogeneity in detection probability (Royle & Nichols 2003; Royle 2004; Royle & Dorazio 2008), dealing with lack of independence among repeated detection surveys at a sampling unit (Nichols et al. 2008; Hines et al. 2010; Guillera-Arroita 2011), accommodating misidentification or false-positive detections (Miller et al. 2011, 2013), and developing methods to address violations of the closure assumption (Rota et al. 2009; Kendall et al. 2013).

Occurrence data have the advantage of being relatively easy to collect, and the availability of various free software packages has contributed to proliferation of the use of occupancy models (e.g. Program PRESENCE, Hines 2006; Program MARK, White & Burnham 1999; the R package unmarked, Fiske & Chandler 2011; OpenBUGS, Lunn et al. 2009). Still, the ability to draw strong biological inference from occurrence data depends upon the clear articulation of study objectives and an appropriate sampling design. Investigators utilizing occupancy models have the ability, and responsibility, to define the following terms based on their biological questions, logistical constraints and other study specifics: sample units (i.e. sites), the time period over which occurrence is assumed to be static, replicate surveys and even the criteria that constitute ‘detection’. Several recent papers have been critical of the use of occupancy models when the above terms are ambiguous (Efford & Dawson 2012) or defined in a manner that leads to violation of critical model assumptions (e.g. Kendall & White 2009; Guillera-Arroita 2011). While the importance of study design has always been emphasized in the occupancy arena (MacKenzie & Royle 2005; MacKenzie et al. 2006; Bailey et al. 2007; Guillera-Arroita & Lahoz-Mohort 2012), most occupancy design papers concentrate on developing cost-efficient designs for a generic application (e.g. MacKenzie & Royle 2005; Guillera-Arroita, Ridout & Morgan 2010) or evaluate the relative merits of different detection methods (e.g. Nichols et al. 2008). To provide general recommendations on appropriate sample size and optimal allocation of effort among sites and surveys, these papers have necessarily assumed that investigators appropriately define key components to address their biological objectives. Specific recommendations about how to define these key components are difficult to provide because objectives, logistical constraints and other key determinants of study design differ among studies. Such factors determine how well the study system corresponds to model assumptions and thus the strength of inference provided by that application of occupancy models.

In this paper, we discuss some of the newer occupancy models and the biological systems that motivated their development. Collectively, these models offer tremendous flexibility and exciting new ways for practitioners to address biological questions related to species occurrence. We emphasize that biological inference and interpretation of model parameters depend upon the study system and the ability to meet model assumptions. To illustrate the importance of defining key occupancy components, we show how these definitions can vary within a given biological system based on different study objectives. A clear objective that specifies exactly what biological aspect of the system is being represented by ‘occupancy’ should lead naturally to reasonable options for elements of the field design (e.g. sample unit, season length, etc.). This, in turn, will lead to analyses that yield strong, defensible biological inferences about the system of interest. Failure to specify a clear objective will result in weaker inferences.

Occupancy models: key components and assumptions

Occupancy is usually defined as the probability that the focal taxon occupies, or uses, a sample unit during a specified period of time during which the occupancy state is assumed to be static. A typical occupancy study design involves identifying the complete set of sample units of interest (of size S) and selecting a sample (of size s) in a manner that allows investigators to generalize conclusions based upon the sample to the specified population of units. During designated points in time that relate to the time-scale at which occupancy states are likely to change, the s units are repeatedly surveyed within a relatively short time period (during which the occupancy state at each unit is static), and the observed occupancy state is recorded during each survey of each unit, or ‘site’. This design clearly resembles Pollock's robust design in mark–recapture models (Pollock 1982; Kendall, Pollock & Brownie 1995), but within the occupancy literature, primary periods are often referred to as ‘seasons’ and secondary sessions as ‘surveys or visits’. Within each season, the occupancy state of each unit does not change (i.e. closure assumption); hence, the repeated surveys provide multiple opportunities to observe the true occupancy state for a given season. Between seasons, the occupancy state may change at the sites: occupied sites may become unoccupied (i.e. local extinction) and unoccupied sites may be colonized. Model parameters under this simple dynamic model (MacKenzie et al. 2003) include the following:

  • ψi1 = the probability that unit i is occupied by the target species during the first season.
  • pijt = the probability of detecting the species at an occupied unit i during the jth independent survey of the site during season t.
  • εit = the probability that occupied unit i in season t becomes unoccupied in season + 1 (local extinction).
  • γit = the probability that an unoccupied unit i in season t is occupied by the target species in season + 1 (colonization).

Model likelihoods for the data can be developed that explicitly incorporate the biological and sampling processes either by accounting for all possibilities when there is ambiguity in the true occupancy state (due to imperfect detection, i.e. integrating across the occupancy states), or using a hierarchical modelling approach. Parameter estimates may be obtained using maximum likelihood or Bayesian methods of inference.

There are several critical assumptions for the model described above, including (i) the closure assumption mentioned above; (ii) the probability of initial occupancy (season 1, ψ1) and subsequent vital rate parameters (εt, γt) are constant among sites, or differences are modelled using covariates (e.g. usually via a logit link); (iii) no unmodelled heterogeneity in detection probabilities; (iv) survey outcomes are independent of one another; and (v) species are not misidentified or falsely detected when a site is unoccupied (MacKenzie et al. 2006). If these assumptions are not met, estimators may be biased, precision may be overstated, and inferences about factors influencing model parameters may be incorrect. For greater detail on the basic modelling approaches, readers should see MacKenzie et al. (2002, 2003, 2006).

Motivation for new model development

During the past decade, several new models have been developed to expand upon the basic dynamic model or relax its associated assumptions. In this section, we discuss some of these models and refer readers to other extensions presented in this session.

Beyond Two States

Soon after the publication of the initial occupancy models, there was a desire to extend models to include more than one occupied state. There are many biological systems where it is advantageous to classify different subcategories of species occurrence (e.g. breeding/non-breeding or relative abundance classes) or simultaneously model the joint dynamics of habitat and species occurrence. Motivated by these biological systems, several dynamic multistate models have been developed to provide inferences about the probabilities of sites being in any one of the occupancy states and making dynamic transitions among them. These models allow for ambiguity not only in the presence or absence of the species (for example) from the field observations, but also in assignment of the correct subcategory. We begin by describing a general model and then highlight the flexibility of this class of model using a diverse group of innovative biological applications.

Motivated by the case of anuran call index data, Royle (2004) and Royle & Link (2005) developed models for assessing patterns in multiple occurrence states at a single point in time. Nichols et al. (2007) developed a reparameterized version of the Royle & Link (2005) model to assess reproductive success at a unit, given species occurrence. MacKenzie et al. (2009) provided a general multistate framework, within which these previous developments represented special cases (parameterizations), and then extended these methods to allow estimation of parameters governing the dynamic processes responsible for change in these occupancy states between seasons. For generality, we present both the conditional binomial parameterization and the multinomial parameterization of MacKenzie et al. (2009), but we acknowledge that most applications have utilized the conditional form. The conditional binomial approach may be biologically more reasonable when progression from one occupancy state to another is considered as a series of steps (e.g. species is present or absent at a unit and then given presence, breeding or no breeding occurred). Numerically, this parameterization can be more stable, particularly when covariates are incorporated. Importantly, in all these cases, the observed occupancy states are defined hierarchically, such that the lowest observed state (non-detection) has the greatest ambiguity about the true occupancy state (all states, occupied and unoccupied, are possible), but there is no uncertainty regarding the true state at a unit where the highest state is observed.

We develop the model first in terms of three possible states (unoccupied and two occupied states) to be consistent with many model applications and then mention how the model can be extended to more states to accommodate other extensions such as habitat–occupancy dynamics and species co-occurrence models.

Let φ[m] be the probability that a unit is in occupancy state m, where math formula. In the case of three mutually exclusive occupancy states, we can write the probability of a unit being unoccupied as 1− φ[1]− φ[2]. Most applications of the multistate occupancy model have used a conditional binomial parameterization where the probability of the higher state is conditional on species occurrence. Here, the probability of occupancy is defined as ψ = φ[1] + φ[2], and the probability a unit is in state 2 (the highest state) can be written as conditional on occupancy, = φ[2]|ψ. Thus, the initial state probability vector for the first season of sampling can be defined as ϕ0 = [1−φ[1]−φ[2] φ[1] φ[2]] = [1−ψ ψ(1−R) ψR]. A transition probability matrix ϕt. is used to describe change in the true state of a unit between seasons t and + 1. This matrix can be written in terms of parameters math formula, the probability of a unit transitioning from state m at time t to state n at time + 1 or as the product of two conditional probabilities: for example math formula, the probability of a unit being occupied at time + 1, given that it was in state m at time t, and math formula, the probability a unit is in the highest state (2) at time + 1, given the unit was in state m at time t and is occupied at time + 1.

display math

Conditional on the true state of a unit at a given time, detection probabilities are defined for each parameterization (Table 1).

Table 1. Detection probabilities associated with dynamic multistate occupancy models. In the multinomial parameterization (A), math formula is the probability of observing a unit in occupancy state l during survey j in season t, given the unit is in true occupancy state m. In the conditional binomial parameterization (B), math formula is the probability of detecting the target species during survey j in season t, given the unit is in true state m, and δt,j is the probability of correctly classifying the true state as state ‘2’, given the species was detected during survey j in season t
True stateObserved state
012
(A)
0100
1 math formula math formula 0
2 math formula math formula math formula
(B)
0100
1 math formula math formula 0
2 math formula math formula math formula

Following MacKenzie et al. (2009), the probability of the observed detection histories h collected over T seasons can be determined succinctly using matrix notation, that is, math formulawhere ph,t is the detection probability vector for the portion of the full detection history observed in season t and D (ph,t) is a diagonal matrix with the elements of ph,t on the main diagonal (top left to bottom right) and zero elsewhere. Using a matrix formulation, ambiguity in the true occupancy state from the observed data is resolved through the matrix multiplication, which is essentially just the sum of the probabilities of the various possible outcomes. Assuming the detection histories are independent for each unit, the joint probability for the data (and the model likelihood) is

display math

where θ denotes all the parameters in the model. Alternatively, the same underlying modelling structure can be developed within a state-space or hierarchical model (MacKenzie et al. 2009).

Most avian applications of the above model have focused on evaluating factors influencing occupancy and reproductive success of nesting raptor species. Martin et al. (2009) used the method to explore potential negative impacts of recreational activities on Golden Eagles (Aquila chrysaetos) in Denali National Park, Alaska, and found that while there was some evidence of reduced colonization (i.e. math formula was lower at highly accessible sites), conditional reproduction was better modelled as a function of prey abundance. MacKenzie et al. (2012) applied the model to potential nesting territories for California Spotted Owls (Strix occidentalis occidentalis) in California. This species is relatively long-lived and exhibits high fidelity to nesting territories. In this case, successive territory occupancy should be correlated with adult survival, and probability of reproductive success should be highly correlated with per capita reproductive rates, making it possible to investigate population dynamics without marking individuals. A similar multievent approach was also taken by Lorentzen, Choquet and Steen (2012) to estimate survival and hatching success at occupied nests of colonial seabird species.

Interestingly, several authors have applied the model to estimate habitat and occupancy dynamics by redefining the true states as unsuitable habitat (and thus unoccupied), suitable habitat occupied by no or only few individuals of the target species or suitable habitat occupied by some or many individuals of the target species. Such an approach was applied to study the use of water holes by African elephants (Loxodonta africana) in Hwange National Park, Zimbabwe (Martin et al. 2010), and the occurrence of larvae of a suite of plains fishes in the Arikaree River, eastern Colorado, USA (Falke et al. 2012). The aim of these studies was to examine how habitat suitability and factors that affect habitat suitability (e.g. rainfall, snowpack) influence the distribution and abundance of target species. Actively manipulating the availability of surface water (habitat) is one option that managers have to influence these species' distributions, and the resulting parameter estimates can be used to predict responses to potential actions under differing environmental conditions to identify optimal management decisions.

Applications such as these prompted the development of more flexible models to separate the fundamental components of habitat and species occurrence dynamics to better understand ecological processes. MacKenzie et al. (2011) developed a model that permits variable levels of species occurrence probabilities among multiple habitat types, where the species occurrence may also influence the habitat dynamics (e.g. overgrazing). Likewise, Miller et al. (2012a) conceptualized a model that included combinations of habitat (suitable/unsuitable) and predator and prey occurrence. Both papers modelled the dynamics of target amphibian species that rely on ephemeral habitats for long-term persistence, hence the emphasis on modelling habitat and species dynamics simultaneously. These studies provide examples of the flexibility of occupancy models, specifically investigating multiple simultaneous effects on focal species dynamics. At the same time, multiple effects necessitate the development of complex models and require that investigators carefully consider only those models that are relevant for their biological system. We will return to these considerations in later sections.

False-Positive Detections

Arguably the most vital assumption of the occupancy models mentioned thus far is that species are not misidentified or falsely detected when a site is unoccupied. While the possibility of false-positive errors had been acknowledged in various studies (e.g. Simons et al. 2007; Shea et al. 2011), the problem had been largely ignored in model development until recently (but see Royle & Link 2006). A series of experimentally based papers on birds and anurans highlighted the pervasiveness of the problem in aural detections, noting that false-positive errors existed for nearly all species and observers (Simons et al. 2007; McClintock et al. 2010a; Miller et al. 2012b). Moreover, there was limited ability to reduce these errors with additional, targeted training (Miller et al. 2012b). Using standard dynamic occupancy models, low levels of false-positive errors (< 5% of all detections) caused severe overestimation of site occupancy, colonization and local extinction probabilities, as well as spurious relationships between these parameters and explanatory variables (Royle & Link 2006; McClintock et al. 2010a; Miller et al. 2011, 2013). It should also be noted that biases introduced by species misidentification are not limited to analyses that account for imperfect detection. If a species may be misidentified, but is detected perfectly at a sample unit otherwise, ‘false presences’ will result and occupancy estimates are likely biased.

Miller et al. (2011, 2013) developed models that accommodate possible false-positive detections, provided a subset of the detections is certain (i.e. a species may be present and not detected, but detections have no false-positive errors; also see Hanks, Hooten & Baker (2011) for a similar Bayesian hierarchical model). The models resemble the multistate models described above and are analogous to the multievent models employed in mark–recapture studies to deal with state uncertainty (Pradel 2005). Initial occupancy and rate parameters are identical to those of the multistate model, where the true occupancy state of a unit i in season t, mit, is one of K discrete occupancy states. In the simple case of only two true states (occupied, mit = 1, or unoccupied mit = 0), possible observations on survey j are non-detection (denoted ytj = 0), uncertain detection (meaning detection with the possibility of a false positive; ytj = 1) and certain detection (ytj = 2). The expected probability of recording an observed state y, given the true state m is given in Table 2. Notice that if a detection is considered ‘certain’ (y = 2), the unit is assumed to be occupied (i.e. if a detection is considered ‘certain’, the species is detected without error). In addition to this model permitting both types of detections at any survey, Miller et al. (2011) also developed a complementary model for cases where a subset of sample units is surveyed by different methods on different sample occasions, with one method admitting possible false positives and the other method being certain (detections included no false positives).

Table 2. The probability of recording an observed state y, given the true state m, using occupancy models that allow for ‘false-positive’ detections. math formula is the probability of incorrectly detecting the species during survey j of season t at an unoccupied unit (possible only for uncertain detections), math formulais the probability of detecting the species during survey j of season t at an occupied unit, and bt,j is the probability that a detection is classified as certain during survey j of season t, given that the unit is occupied and the species is detected
True stateObserved state, y
012
0 math formula math formula 0
1 math formula math formula math formula

False-positive models are relatively new and have only been applied to the ecological studies that helped motivate their development, namely anuran studies that rely heavily on aural detections (Miller et al. 2011) and a study investigating occupancy dynamics of wolf packs in Montana by combining hunter observations and radiotelemetry information (Miller et al. 2013). We believe that any study that relies on indirect animal detection, such as animal sign (e.g. tracks, scats Karanth et al. 2011; Molinari-Jobin et al. 2012) or interviews of local experts (e.g. Zeller et al. 2011), will benefit from these new models. Additionally, the models could be applied in studies that utilize computer algorithms (e.g. Waddle, Thigpen & Glorioso 2009) or laboratory assays (e.g. McClintock et al. 2010b) to determine species identification from survey results. Given the severe bias that can result from ignoring false-positive detections, we hope that any study that may suspect such errors would use these models to formally test whether math formula. Utilizing a combination of design modifications to lower the prevalence of false positives and model-based approaches to deal with problems that remain should reduce the bias caused by false-positive detections.

Multiscale Models

The work described above details methods that allow for the expansion of the number of occupied states, but another arena of rapid development has focused on expanding the number of hierarchical scales of species occurrence. Motivation for these models includes relaxing model assumptions, such as independence among surveys and closure, and differentiating between species occurrence at local and larger scales. Again, we begin by describing a single-season multiscale model and then highlight variations of this model theme using a diverse group of innovative biological applications.

Early multiscale occupancy models were developed to address lack of independence, or correlation, among surveys (Nichols et al. 2008; Hines et al. 2010). Often multiple detection devices are deployed at the same location within a sample unit to detect multiple species, or individuals of various life-history phases for a given species, or to compare the efficiency of multiple detection devices (see citations within Nichols et al. 2008). If detections from each device are used as surveys, lack of independence may exist if individual animals detected by one device are more likely to be detected by another device (i.e. detections among surveys are not independent). Nichols et al. (2008) exploited this dependence to permit inference about species occurrence at two hierarchical scales, the small scale of the location at which sampling devices were deployed and the larger scale of the sample unit within which the devices were located. The basic sampling design is identical to the general framework described above, but L different surveys (e.g. detection devices, observers or timed observations) are collocated in each sampled unit and sampled at T occasions or subunits. The occupancy state of the unit is assumed static over this time period, but the species local availability (e.g. presence at the specific location of the detection devices) may change over time or subunits. Model parameters under this model include the following:

  • ψi = the probability that sample unit i is occupied by the target species
  • θit = the probability the species is locally present (available for detection) at occasion or subunit t, given the unit i is occupied.
  • pitj = the probability of detecting the species with survey j, given that it is locally present at occasion or subunit t.

The two occupancy parameters, ψi and θit, permit the modelling of occupancy at two different scales (spatial or temporal): ψi corresponds to species occurrence at the larger scale, while θit refers to the presence of the target species at the local scale, conditional on species presence in the sample unit (the larger scale). The product ψiθit represents the unconditional probability of small-scale occupancy, indicating presence of individual(s) of the species at the local spatial or temporal scale, and ψi (1− θit) represents the probability that the species is present in the unit, but unavailable for detection at occasion or subunit t.

Pavlacky et al. (2012) employed this approach to estimate occupancy and local availability for two avian species thought to differ in characteristics that make them rare. They used a common sampling approach where sample units (1 km2 plots) were chosen in a probabilistic manner and then multiple point count stations were systematically placed at equal distances (250 m) from one another within chosen units. Each fixed-radius (125 m) point count station was surveyed using a time-to-detection method during the breeding season. Lark sparrow (Chondestes grammacus) were fairly scarce among sample units (math formula = 0·2), but were locally common when they occurred (math formula = 0·35). Conversely, brown creepers (Certhia Americana) occupied more units across the landscape (math formula = 0·3), but were locally rare (math formula = 0·1), identifying this species as more susceptible to future declines at the regional scale. In a similar example, Mordecai et al. (2011) investigated factors influencing occurrence and temporal availability of Louisiana waterthrush (Seiurus motacilla) at point count locations along stream transects in West Virginia, USA. While Pavlacky et al. (2012) assumed conditional independence among their point count stations, Mordecai et al. (2011) accounted for possible spatial dependence among point counts by including a random intercept for aggregates (transects) of point count stations.

A severe form of spatial dependence may occur for species that are detected along transects or trails, when units are only surveyed once (= 1) and trail segments are used as spatial replicates. Such designs are often employed for large carnivore species, such as tigers, and two different modelling approaches have been proposed to deal with this type of problem (Hines et al. 2010; Guillera-Arroita et al. 2011). One approach describes the detection process as a continuous point process, where detections occur randomly along a continuous axis (Poisson process) or potential clustering in detections are accounted for via a Markov modulating Poisson process (Guillera-Arroita et al. 2011). Another approach is to discretize the trail or transect into spatial subunits of equal length and then model spatial dependence as a first-order Markov process by defining two parameters for local occurrence (Hines et al. 2010):

  • θit = the probability the species is present (available) at subunit t within an occupied unit i, given the species was not present at the previous subunit.
  • θ'it = the probability the species is present (available) at subunit t within an occupied unit i, given the species was present at the previous subunit.

Detection histories for each sampled unit consist of detection–nondetection data from each successive spatial subunit, for example, hi = 01011 denotes a unit where 5 successive spatial subunits were surveyed. Assuming constancy in model parameters among units, the probability statement associated with this history is Pr(hi = 01011) = ψ[(1−θ1)θ2 + θ1(1−p1) θ′2]p2[(1− θ′3)θ4 + θ′3(1−p3)θ′4]p4θ′5p5. Note the terms in square brackets account for the ambiguity associated with the non-detection of the species in the first and third subunits. Every detection history can be modelled in this manner, and the likelihood under this model can be expressed as follows: math formula. Covariates can be used to model variation in any of the model parameters, but with only a single survey at each spatial subunit, there is limited ability to distinguish between factors influencing local occurrence (availability) and those influencing the conditional detection probability. To date, most applications have assumed that local occurrence parameters are constant over subunits and modelled detection probability as a function of covariates (e.g. Karanth et al. 2011). If repeated surveys are conducted at each subunit, better differentiation of factors influencing local occurrence and detection probability is possible, while still accommodating spatial dependence in occupancy (availability) among spatial subunits (J.E. Hines and L.L. Bailey, unpublished data).

The applications mentioned above have all involved surveys of spatial subunits within a larger unit, but it is easy to imagine the same general framework for temporal subunits or occasions within a longer time period of interest. For example, previous avian studies often involve a single visit to each unit, with detections recorded in multiple, successive time periods (i.e. species recorded in 3·5-min periods during a 15-min point count). Even if a unit is visited multiple times within a season, the closure assumption is violated by the non-random occurrence (availability) of a species during the season. Investigators have addressed this issue by (i) creating extra replication (surveys) for each occasion t, then applying the standard dynamic occupancy model (MacKenzie et al. 2003) to directly estimate temporal availability, θit, and deriving a ‘large-scale’ parameter corresponding to the longer time period of interest (Rota et al. 2009), or (ii) modelling species arrival and departure times directly for species with staggered entry and exit times during the period of interest (Kendall et al. 2013).

Like the multistate models described in a previous section, multiscale models offer flexibility that is often necessary to address important model assumptions, but they are not meant as a fix for poor study design. Their application is essential to control for biases in certain biological and sampling scenarios, but the additional model complexity may lead to poorer precision and weaker inference.

Model flexibility and study design: disease system example

Occurrence data have the advantage of being relatively easy to collect, and historic records can often be converted to detection–nondetection data. The availability of numerous data sets and a variety of flexible occupancy models have led to many occupancy-based papers in the literature. For many of these applications in which the underlying ecological and data collection processes were well approximated by occupancy models, reasonable inferences were obtained. However, when data collection and study system do not correspond well to the processes for which occupancy models were developed, reasonable inferences are not necessarily expected. It is the responsibility of investigators utilizing any model based on detection–nondetection data to clearly define the following terms as applied to their study objectives: sample units (i.e. sites), the time period over which occurrence is assumed to be static, replicate surveys and even the criteria that constitute ‘detection’. Still, many practitioners overlook the importance of defining key terms (e.g. site, survey, season) with respect to their biological question(s) and focus instead on the practical trade-offs related to the optimal number of surveys per site. Most occupancy design papers have followed suit, concentrating on developing cost-efficient designs for a generic application (e.g. MacKenzie & Royle 2005). But while there are some aspects of study design that can be usefully treated in a generic manner, other aspects require a tailoring of design to ecological and sampling specifics. Such specifics typically involve first specifying study objectives, which lead to specific definitions of key model terms. Conditional on these objectives and definitions, appropriate data are collected and model(s) are developed to correspond to the underlying processes of interest.

In the following section, we focus on a single biological system (a host–pathogen system) to illustrate how different biological hypotheses (objectives) result in dramatically different study designs. In each study, we emphasize the definition of key occupancy components and the associated model assumptions that are most relevant to the biological questions being addressed.

Disease System: Background

Many amphibian declines world-wide have been attributed to the emerging infectious disease chytridiomycosis, caused by the fungal pathogen Batrachochytrium dendrobatidis (hereafter Bd, e.g. Berger et al. 1998; Muths et al. 2003). Bd is transmitted between individuals and the environment via an aquatic flagellated zoospore (Berger et al. 2005). When the load of zoospores on an individual is high enough, it can alter electrolyte transport across the epidermis, disrupting ion homeostasis, and lead to cardiac arrest (Voyles et al. 2009). Susceptibility of amphibians to chytridiomycosis is variable among species; vulnerable species often decline rapidly, while resistant species may function as a reservoir for the pathogen.

Despite numerous papers focusing on amphibian–Bd interactions, few have considered imperfect detection of either the host or the pathogen (but see Adams et al. 2010; Miller et al. 2012c). The following occupancy-based examples are a mixture of published works specific to the Bd disease system and proposed designs, some motivated by different ecological systems.

Pathogen Prevalence in a Single Host Population

Often the most fundamental parameter in disease studies is prevalence, the proportion of infected individuals in a defined population of organisms (i.e. disease frequency). Studies of Bd dynamics have focused on both prevalence and the infection intensity, defined as the abundance of Bd found on infected individuals (e.g. Briggs, Knapp & Vredenburg 2010; Miller et al. 2012c). In these studies, the ‘area of interest’ is usually a single amphibian population, where a subset of individuals (sample units) is randomly selected (presumably). Multiple surveys are obtained from all or a subset of captured individuals, and the manner in which these surveys are collected define the time period over which prevalence is assumed to be static. Typically, Bd is detected on a captured individual by gently rubbing the surface of the skin with a cotton swab. Multiple PCR samples (surveys) are prepared from each swab and analysed using quantitative PCR (qPRC) techniques, yielding detection–non-detection information and a quantitative measure of zoospore equivalents for each survey (Hyatt et al. 2007). Under this sampling scenario, the resulting estimates of occupancy (prevalence) represent the probability of Bd occurrence among individuals in the population and apply to the time period over which individuals were captured, often only a single visit. As with most disease assays, the sensitivity of the PCR is <1 (Hyatt et al. 2007), and while most authors acknowledge this fact, they attempt to account for non-detection by simply aggregating results for multiple PCR surveys rather than estimate (and correct for) detection probability (but see Miller et al. 2012c).

Several previous studies utilized occupancy models with this type of data to estimate prevalence and address biological questions related to the individual characteristics (e.g. species, life stage) that may influence the probability of pathogen occurrence or detection (Gomez-Dias et al. 2010; Cooch et al. 2012). These applications rely on the basic assumptions outlined in the previous Occupancy Models section. Many of these assumptions are likely met for the sampling design described above (e.g. closure assumption, no false detections), and others can be addressed by modelling heterogeneity in pathogen occurrence and detection as a function of covariates specific to the individual (unit). However, in many disease systems, the detection of the pathogen is likely a function of the intensity of the pathogen on/in the host. In cases where no index of infection intensity is available, utilizing an approach that models detection probability as a function of the latent distribution of pathogen abundance should reduce the bias in prevalence caused by heterogeneity in Bd detection among individuals (Royle & Nichols 2003; Lachish et al. 2012). Many Bd studies now estimate an index of infection intensity (zoospore equivalents) for each survey. Miller et al. (2012c) described two analytical methods to accommodate the relationship between pathogen detection and infection intensity: an ad hoc approach using closed population abundance estimators (Huggins 1991) and a hierarchical Bayesian estimator that extended previous occupancy models to account for observational error in the detection of Bd and sampling error in measuring the associated Bd zoospore equivalents.

An important assumption in these scenarios/applications is that the sampled units (captured individuals) are a random sample of the population of interest. Studies of other disease systems have emphasized that estimates of prevalence can be biased if infected and uninfected individuals have different capture probabilities (reviewed in Cooch et al. 2012). Establishing whether such detection differences exist among individuals usually requires multiple detections and observations of the disease state for individuals over time and typically utilizes mark–recapture methods (Cooch et al. 2012). Such studies are rare for amphibian–Bd systems, and none has found differences in individual capture probability associated with Bd occurrence, but none has properly accounted for uncertainty in individual infection state (Murray et al. 2009; Pilliod et al. 2010).

Estimating the Proportion of Infected Host Populations

Understanding factors that influence a pathogen's distribution and determining when and how it is transmitted among seemingly isolated host populations is a major theme in disease ecology and geographical epidemiology. Numerous studies have found that Bd is widely distributed geographically; however, the utility of these studies is limited due to the opportunistic nature of the sampling (Muths, Pedersen & Pedersen 2009). A more robust sampling design might define an ‘area of interest’ as a collection of perhaps-isolated amphibian populations within a specified region (e.g. amphibian populations at wetlands within a national park). A sample of these populations (units) is drawn in a manner that allows for generalization to the entire collection of amphibian populations. Populations are visited once to capture and swab individuals, often of different species, without replacement: these individual swabs can be viewed as surveys of the unit. Typically in these large-scale studies, a single PCR assay is conducted on each individual swab, or multiple swabs may be pooled, due to budget constraints. Inevitably, the number of surveys (PCR assays) varies among units, though a maximum is usually defined.

Adams et al. (2010) used this approach to investigate factors influencing spatial patterns in Bd occurrence in amphibian populations in Oregon and northern California, USA. They expected the processes that resulted in the presence of Bd in amphibian populations (e.g. variables associated with Bd thermal tolerances) would differ from the processes that result in the presence of Bd among individuals within an infected population (e.g. species, life-history stage, date). The authors recognized several limitations of applying occupancy models to this design. First, there is a finite number of surveys, determined by the number of amphibians available in the local population, and sampling individuals without replacement creates dependence among surveys. They used simulations to determine that bias resulting from such dependence was negligible for most populations in their study (i.e. for populations >50 individuals with true prevalence values for infected populations > 0·05). Additionally, the probability estimates represent a combination of at least two processes: the probability the pathogen is present on the individual in an infected population (i.e. true prevalence) and the probability of detecting the pathogen using standardized field and laboratory techniques, given it is present on the individual. While these authors believed that the probability of detecting the pathogen on an infected individual was quite high, the resulting detection probability should be interpreted as an index of prevalence (Miller et al. 2012c).

Investigating Factors Influencing Host–Pathogen Dynamics

Little is known about the long-term dynamics of Bd once established in an area of interest. The fungus has low mobility and is considered vulnerable outside of a host (Piotrowski, Annis & Longcore 2004). Recent work has shown differential susceptibility among amphibian species suggesting that some non-target species may function as reservoirs, or vectors, leading to the persistence or spread of Bd among habitats and populations (Briggs, Knapp & Vredenburg 2010). Even among susceptible species, it is apparent that Bd is not invariably lethal and that the pathogen persists in an enzootic state in hosts that survived infection during an initial epizootic (Briggs, Knapp & Vredenburg 2010; Pilliod et al. 2010). Understanding amphibian–Bd dynamics after an epizootic and determining whether amphibians may persist or recolonize affected areas requires the ability to sample Bd at sites that have few or no amphibians.

In this case, a sampling design might define an ‘area of interest’ as a collection of habitats (e.g. ponds) that may serve as amphibian breeding locations within a specified region. A sample of these habitats (units) is surveyed multiple times to detect both Bd and target amphibians during a time period where the occurrence of both pathogen and host is considered static. Captured amphibians can be sampled for Bd in the manner described above, but if amphibians are not detected, a water sample could be used to detect Bd in the environment (Kirshtein et al. 2007). Here, target amphibian detection informs the amphibian state (occupied) at the unit and serves as a survey for the detection of the pathogen. Both swabs and water samples are considered independent surveys of Bd at a unit, but the survey methods are likely to have different Bd detection probabilities (Kirshtein et al. 2007; Hyman & Collins 2012), and these can be accommodated in the modelling.

To our knowledge, no such study exists for an amphibian–Bd system, but McClintock et al. (2010b) outlined a study design for such a system. In this case, occupancy corresponds to the spatial prevalence of the pathogen across habitats. More interesting are the host–pathogen dynamics that resemble a dynamic species co-occurrence model that may incorporate neighbourhood autologistic effects (McClintock et al. 2010b; Yackulic et al. 2012, in press). Alternatively, similarities could also be drawn to existing habitat–occupancy dynamic models with two types of habitats: those with and without the pathogen (MacKenzie et al. 2011). Relevant biological hypotheses would involve differences in the occurrence and dynamics of the target amphibian species conditional on the habitat type, and importantly, the occurrence of the target species may influence the dynamics of the pathogen (habitat).

Summary

Each of the studies above addresses different biological questions related to a common host–pathogen system. Accordingly, the definition of a sample unit varies from a single individual to a patch of potential breeding habitat. Investigators have more control over sample unit selection in the latter case, and the inference to all sample units is more defensible. Surveys are defined as individual swabs or groups of swabs, water samples or combinations of these methods. The time period over which these surveys are obtained, ranging from a single visit to an entire breeding season, defines the time period over which the pathogen state, or amphibian and pathogen state, is assumed to be static among units. The longer time periods separating such ‘primary sampling occasions’ (MacKenzie et al. 2003) then define the periods to which the occupancy–dynamic rate parameters apply. It is imperative to align the biological hypotheses of the study with a comparable sample design such that selected models are believed to correspond reasonably well to the processes that generated the data; thus, model assumptions are likely to be met.

Conclusions

The applications presented in this paper are intended to demonstrate the breadth of flexibility possible with current occupancy models. We have chosen to focus on those models that involve a single to a few target species, but we acknowledge that there is a rich literature involving the occurrence of multiple species (e.g. Dorazio & Royle 2005; Kéry & Royle 2009; Zipkin, Dewan & Royle 2009). Ecologists recognize that multiple effects are likely relevant to most studies of occupancy dynamics, and a desire to include these effects motivated the development of the models described here. Cries for even more flexible models to deal with system complexity place added responsibility not only on model developers (biostatisticians), but also on ecologists to clarify and communicate their hypotheses about the dynamics motivating the desire for increased model complexity. We view this added responsibility as a good thing, but simply note that the flexibility of occupancy models requires ecologists to try to restrict the investigated model set to those combinations of effects that represent plausible a priori hypotheses.

We recognize that there are relatively few papers that focus on the design of occupancy studies, due in part to the flexibility of existing models. It is difficult, if not impossible, to develop generic recommendations for all aspects of study design. Any attempt to do so would require using language that is so vague (to be inclusive) that it would lose its utility. What is general and conserved is the process that one should go through when designing an occupancy study. Typically, this process involves first specifying study objectives, which then direct definitions of key model terms. Conditional on these objectives and definitions, the process then entails collection of appropriate data and utilization or development of models to correspond to the underlying processes of interest. We hope that our abbreviated demonstration of this process using a single ecological system will assist other practitioners when considering the use of occupancy models to address hypotheses related to occupancy dynamics.

Many methodological extensions have been developed to address interesting biological questions, but some extensions have emerged from study designs where assumptions of simpler modelling approaches are not met. Such extensions should not, necessarily, be seen as encouraging the particular design used in that study as the best way to address those (or similar) objectives. Whenever possible, one should opt for the simplest and most appropriate study design to address the biological question(s) of interest, instead of relying on model-based solutions to correct for certain aspects of the study design after the data have been collected. In the latter situation, inferences are going to be more model dependent than when potential issues are identified and dealt with during the study design phase.

Ancillary