1. A Bayesian analysis of site-occupancy data containing covariates of species occurrence and species detection probabilities is usually completed using Markov chain Monte Carlo methods in conjunction with software programs that can implement those methods for any statistical model, not just site-occupancy models. Although these software programs are quite flexible, considerable experience is often required to specify a model and to initialize the Markov chain so that summaries of the posterior distribution can be estimated efficiently and accurately.
2. As an alternative to these programs, we develop a Gibbs sampler for Bayesian analysis of site-occupancy data that include covariates of species occurrence and species detection probabilities. This Gibbs sampler is based on a class of site-occupancy models in which probabilities of species occurrence and detection are specified as probit-regression functions of site- and survey-specific covariate measurements.
3. To illustrate the Gibbs sampler, we analyse site-occupancy data of the blue hawker, Aeshna cyanea (Odonata, Aeshnidae), a common dragonfly species in Switzerland. Our analysis includes a comparison of results based on Bayesian and classical (non-Bayesian) methods of inference. We also provide code (based on the R software program) for conducting Bayesian and classical analyses of site-occupancy data.
The class of site-occupancy models developed independently by MacKenzie et al. (2002) and Tyre et al. (2003) is widely used in the analysis of presence-absence data collected in surveys of natural populations. These models extend conventional types of binary-regression models to account for errors in detection of individuals, which are common in surveys of animal or plant populations. Site-occupancy models use repeated surveys within sample locations or other measures of survey effort to resolve the ambiguity of an observed zero, which can occur if a species is absent at a sample location or if a species is present but undetected. Therefore, the probabilities of species presence (occurrence) and species detection given presence are estimated together when site-occupancy models are fitted to presence-absence data (more correctly, detection/non-detection data).
The collection and analysis of site-occupancy data may be used to address a variety of ecological inference problems that require accurate predictions of species occurrence. For example, metapopulation models (Hanski & Gilpin 1997) are often specified in terms of patch occupancy (site occupancy). In this context, the proportion of area occupied (PAO) by a species in a collection of sites may be relevant. Similarly, species distribution models (Scott et al. 2002; Elith & Leathwick 2009) are used to predict the spatial pattern of species occurrences over a species’ geographic range or over a subset of that range that has scientific or operational relevance. In both examples, a quantitative (functional) relationship between species occurrence probability and one or more aspects of its environment must be estimated accurately (i.e. free of bias from detection errors). Given sufficient data, site-occupancy models can be used to estimate this relationship accurately (MacKenzie et al. 2006) and to predict species occurrence probability at sampled or unsampled locations (Kéry et al. 2010). Other species distribution models that do not account for the effects of detection errors (e.g. binary-regression models) generally produce biased predictions of species occurrence probability.
Classical methods, such as maximum likelihood, can be used to estimate the parameters of site-occupancy models, and software exists for calculating these estimates [see programs presence (http://www.mbr-pwrc.usgs.gov/software/presence.html)] and unmarked (Fiske & Chandler 2011)]. Once computed, the maximum likelihood estimates (MLEs) of the parameters can be used to predict species occurrence probability at sampled or unsampled locations, although it may be challenging to obtain accurate estimates of the uncertainty of these predictions. For example, parametric bootstrapping can be used to estimate the uncertainty of the predictions (Laird & Louis 1987), but this approach generally requires substantial computational effort.
In a Bayesian analysis, a model's parameters and its predictions are treated identically in the sense that all inferences are based on the posterior distribution of the model's parameters (Gelman et al. 2004). Inferences about predictions account for uncertainty in the model's parameters because the distribution of these predictions is obtained by averaging (marginalizing) over the posterior distribution of the parameters. Furthermore, these inferences are valid regardless of sample size because they do not rely on asymptotic approximations, unlike classical (non-Bayesian) methods. For these reasons, Bayesian methods of estimation and inference provide an attractive and useful alternative for ecological problems that require predictions of species occurrence.
The probability density function of the posterior distribution of a site-occupancy model's parameters cannot be expressed in closed form owing to analytically intractable integrals. Therefore, stochastic simulation methods, such as Markov chain Monte Carlo (MCMC), are typically used to estimate summaries of the posterior distribution (Geyer 2011). Software to implement these methods is available and includes the programs winbugs (http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml), openbugs (http://www.openbugs.info) and jags (http://mcmc-jags.sourceforge.net), all of which have been used to conduct Bayesian analyses of site-occupancy data (MacKenzie et al. 2006; Royle & Dorazio 2008; Kéry 2010; Link & Barker 2010; Dorazio et al. 2011; Kéry & Schaub 2012). These programs are popular largely because they only require users to specify the underlying assumptions of a model. The technical details of constructing and implementing a MCMC algorithm are accomplished by the software with either limited or no control by the user. While this division of labour may seem desirable, considerable experience is often required to specify a model and to initialize the Markov chain so that the software constructs an appropriate algorithm. Model specification includes several choices – parameterization (hierarchically centred or not), priors and hyperparameter values, and functions to link the probabilities of species occurrence and detection to the effects of covariates on these probabilities. Initializing the Markov chain is not difficult and the software can even assign some parameters without user input; however, users must be careful not to assign parameter values that have low (or zero) posterior probability. For example, a site-specific parameter for species occurrence must not be initialized at zero (absence) if the species is detected during one or more surveys of the site and doing so generates an error message that can be difficult to interpret without user experience. Model specification and initialization is particularly challenging when attempting to analyse site-occupancy data for multiple species with existing software (Dorazio et al. 2010, 2011). Appendix S1 of Kéry & Schaub (2012) contains some commonly encountered problems and workarounds when using winbugs.
Given the potential for difficulties with existing software, it would seem useful to have a MCMC algorithm developed specifically for the analysis of site-occupancy data. Gibbs sampling algorithms are available for relatively simple site-occupancy models wherein species occurrence probability is constant and species detection probability is constant within surveys [(page 107 of Royle & Dorazio 2008) and (pages 177–178 of Link & Barker 2010)], but MCMC algorithms have not been developed for more complex models that contain the effects of site-specific covariates of occurrence and site- or survey-specific covariates of detection. For these models, a common choice of prior distribution and parameterization (multivariate normal priors of logit-scale parameters) leads to conditional posterior distributions that do not have familiar forms and must be sampled using specialized algorithms that require tuning (e.g. Metropolis–Hastings). These algorithms are inherently less efficient than Gibbs sampling because only a fraction of the proposed samples is accepted and tuning is usually needed to obtain desirable acceptance rates.
In this paper, we show that a Bayesian analysis of site-occupancy data can be carried out accurately and efficiently using Gibbs sampling when the model is specified using probit-scale parameters and uniform or multivariate normal priors. To illustrate this Gibbs sampling algorithm, we analyse site-occupancy data of the blue hawker, Aeshna cyanea (Odonata, Aeshnidae), a common dragonfly species in Switzerland. These data were analysed by Kéry et al. (2010) using the method of maximum likelihood and a logit-scale parameterization of the site-occupancy model. Here, we compare the results of using Bayesian and classical (non-Bayesian) methods of inference. We also provide the code used in this analysis, which was written using the R software program (R Development Core Team 2012).
Materials and methods
Site-Occupancy Models as Probit Regressions of Occurrence and Detection Probabilities
In the standard sampling protocol for collecting site-occupancy data, J > 1 independent surveys are conducted at each of n representative sample locations (sites) noting whether a species is detected or not detected during each survey. Let yij denote a binary random variable that indicates detection (y = 1) or non-detection (y = 0) during the jth survey of site i. Without loss of generality, we assume J is constant among all n sites to simplify the description of the model. In practice, site-specific differences in J pose no real difficulties and are relatively easy to implement. The standard sampling protocol yields a n × J matrix Y of detection/non-detection data.
Site-occupancy models of detection/non-detection data may be represented as hierarchical models of the following form:
where ψi = Pr(zi = 1) denotes the probability of species presence (occurrence) at site i and where pij = Pr(yij = 1|zi = 1) denotes the conditional probability of detecting the species during the jth survey of site i given that the species is present at site i (Royle & Kéry 2007; Royle & Dorazio 2008).
Suppose covariates thought to be informative of species occurrence have been measured at each of the n sample sites and the measurements are included in a n × (r + 1) matrix X of r regressors. (The first column of X is a vector of ones to accommodate an intercept parameter in the model of occurrence probability.) A probit-regression formulation , wherein Φ denotes the standard normal cumulative distribution function, is used to specify the effects β of the regressors xi on species occurrence probability at site i. (A superscript T is used to indicate the transpose of a matrix or vector.) Similarly, suppose covariates thought to be informative of species detection probability have been measured during each of J surveys conducted at each site. These measurements may be included in a n × J × (q + 1) array W of q regressors, and a probit-regression formulation may be used to specify the effects α of the regressors wij on the probability of detecting one or more individuals present at site i during the jth survey.
The joint posterior density for this model is
where C denotes the normalizing constant for the posterior distribution and π(β,α) specifies the joint prior density of the parameters β and α. (Henceforth, C will be used generically to denote the normalizing constant of a distribution.) It is entirely feasible to develop an MCMC algorithm based on the above joint posterior; however, the conditional posterior distributions (full conditionals) of β and α are not familiar forms and must be sampled using specialized algorithms that require tuning (e.g. Metropolis–Hastings). For example, assuming mutually independent priors for these parameters (wherein π(β,α) = π(β)π(α)) leads to the following full conditional densities:
Fortunately, the difficulties of sampling from these distributions can be avoided by recognizing that eqn 1 is simply the kernel of a probit-regression model of n binary outcomes zi and eqn 2 is simply the kernel of a probit-regression model of mJ binary outcomes yij, where is the number of sites occupied by one or more individuals. Therefore, we may adopt an approach proposed by Albert & Chib (1993) and use parameter-expanded data augmentation (Liu & Wu 1999) to modify the model for the purposes of simplifying the analysis and the MCMC algorithm.
To be specific, we establish a connection between probit-regression models of the binary random variables (zi and yij) and linear regression models of latent normal (Gaussian) random variables (vi and uij) as follows. Let denote a normal random variable, and assume zi = 1 if vi > 0 and zi = 0 if vi ≤ 0. These assumptions imply (Albert & Chib 1993), our probit-regression model of zi. Similarly, let denote a normal random variable, and assume yij = 1 if uij > 0 and zi = 1, and assume yij = 0 if uij ≤ 0 and zi = 1 or if zi = 0. These assumptions imply , our conditional probit-regression model of yij. A succinct description of these modelling assumptions is:
where I(a) denotes the indicator function, which equals 1 if argument a is true and 0 otherwise. The joint posterior density of this parameter-expanded site-occupancy model is
where φ(·|μ,σ2) denotes the probability density function of a Normal(μ,σ2) distribution. Although the joint posterior of this model cannot be sampled directly, posterior summary statistics (means, quantiles, etc.) can be estimated accurately and efficiently using Gibbs sampling, as described in the following section.
The full conditional distributions needed to apply Gibbs sampling to the joint posterior density (eqn 3) all have familiar forms and are easily sampled. For example, if uniform priors are used for β and α to specify prior indifference about the magnitude of these parameters, the full conditional distributions needed for Gibbs sampling are as follows:
2 β|· ∼ Normal((XTX)−1XTv, (XTX)−1)
4 ,where is a mJ × (q+1) matrix formed from the mJ observations of wij and where is a mJ-vector formed from the mJ values of uij. In other words, only the values of wij and uij at occupied sites (wherein zi = 1) are needed to update α. For this reason, it is not necessary to update uij if zi = 0 (as shown in the previous step).
5 where yi = (yi1,…,yiJ)T.
If prior distributions for β and α are assumed to be normal (thereby allowing either vague or informative priors to be specified), the full conditional distributions of these parameters are still normal but the means and covariances of these distributions are modified to accommodate the prior information. Specifically, steps 2 and 4 above are replaced by
2 where μ and Σ denote the prior mean and covariance matrix for β, and where μ and Σ denote the prior mean and covariance matrix for α.
Example: Blue Hawker Data
Sampling methods and design
Kéry et al. (2010) provide a detailed description of the study area and methods of data collection. Briefly, the blue hawker was surveyed throughout Switzerland for the revision of the Red List of Swiss dragonflies. These surveys included sites that were known to have target (i.e. rare) species and also sites that were less well known in terms of dragonfly species occurrence. Each site corresponds to a 1-ha quadrat of the Swiss topographical system.
Surveys were conducted during each of 2 years (1999 and 2000) during the known flight periods of the dragonflies in Switzerland. Individual sites were surveyed between 1 and 22 times per year. In 1999, 1522 sites were surveyed; in the following year, 1403 sites were surveyed. Of the total number of distinct sites surveyed, 12·8% (328 of 2572) were sampled in both years.
After fitting several site-occupancy models to the blue hawker data, Kéry et al. (2010) compared values of Akaike’s Information Criterion (AIC) to select a parsimonious model for predicting site-specific occurrences of this species. In this model, occurrence probability was formulated as a logit-linear function of the effects of elevation and its square and cube; detection probability was formulated as a logit-linear function of the effects of elevation, Julian survey date and the squares of these two covariate measurements.
For purposes of comparison, we analysed the blue hawker data using the same set of regressors included in the parsimonious model of Kéry et al. (2010). Prior to the analysis, we centred and scaled the elevation and date measurements to have zero mean and unit variance. We also excluded from the analysis observations from 14 sites that lacked elevation measurements. The remaining data included observations from 1516 sites in 1999 and 1395 sites in 2000.
In the Bayesian analysis of the blue hawker data, we used maximum likelihood estimates of the site-occupancy model's parameters to initialize the Markov chain. We used M = 10 000 successive draws of the Gibbs sampler to estimate posterior means and quantiles of the model parameters, to predict species occurrence probability as a function of elevation and to predict species detection probability as a function of elevation and survey date. We also used draws from the Gibbs sampler to predict species occurrence status (presence or absence) at sample sites where blue hawkers were not detected for the purposes of estimating PAO in 1999 and in 2000. The Monte Carlo standard errors of posterior means and quantiles were computed using the subsampling bootstrap method (Flegal & Jones 2010, 2011) with overlapping batch means of size .
The blue hawker was detected at 22·0% (334) of the 1516 sites surveyed in 1999 and at 21·9% (305) of the 1403 sites surveyed in 2000. These naive estimates of site occupancy appear to be substantially biased given the estimates of PAO adjusted for detection errors. For example, Bayesian posterior means for the proportion of occupied sites were 0.646 in 1999 and 0·624 in 2000, and in each year the 95% credible interval for PAO failed to include the naive estimate (Table 1). Imperfect detection of blue hawkers is, of course, the reason for the higher estimates of site occupancy. Estimated detection probabilities of blue hawkers varied with elevation and survey date (Fig. 1) and ranged from 0 to 0·86.
Table 1. Parameter estimates of site-occupancy model fitted to the blue hawker (Aeshna cyanea) data
Maximum likelihood estimates
Lower and upper limits of 95% confidence intervals (based on ±1·96 asymptotic standard errors) and 95% credible intervals are given in the columns labelled 2·5% and 97·5%, respectively. Monte Carlo standard errors are given in parentheses.
Estimates of the site-occupancy model's parameters obtained by classical and Bayesian methods are quite similar (Table 1), which is not surprising given the relatively large sample size and the non-informative priors assumed for the parameters. Occurrence probabilities of blue hawker appear to differ significantly over the range of elevations observed in the sample (Fig. 2). Posterior-predicted mean occurrence probabilities are highest at lower elevations and decline to near zero with increases in elevation.
In this paper, we develop a class of Bayesian site-occupancy models in which probabilities of species occurrence and detection are specified as probit-regression functions of site- and survey-specific covariate measurements. By using probit-regression functions, we were able to add latent parameters to the model for the purposes of developing a Gibbs sampler for Bayesian analysis of site-occupancy data. This Gibbs sampler allows summaries of the posterior and model-based predictions to be estimated more efficiently than software based on MCMC algorithms with lower acceptance rates. In addition, the Gibbs sampler can be implemented in any computing language. We developed an implementation (see Appendices S1 and S2) using the R software program (R Development Core Team 2012), which is freely available and widely used. Our implementation includes code to calculate MLEs of the model's parameters, which are used in classical (non-Bayesian) analyses. The code also accommodates missing values in the matrix of detection/non-detection data that occur commonly in site-occupancy surveys owing to unequal effort among sample sites. The code therefore can be used to complement non-Bayesian analyses obtained with other site-occupancy software [programs presence or unmarked (Fiske & Chandler 2011)].
It might be possible to apply our approach to Bayesian site-occupancy models that use other functions to link the probabilities of species occurrence and detection to linear combinations of regression parameters. For example, defining the latent variables vi and uij to have logistic distributions would lead to logit link functions; defining vi and uij to have extreme value distributions would lead to complementary log-log link functions; and so on. The choice of link function is largely subjective, and similar results should be obtained with any link function when probabilities of occurrence and detection are not close to zero or one (where the tails of normal, logistic and extreme value distributions differ).
One of the advantages of conducting a Bayesian analysis of site-occupancy data is the ability to account for uncertainty in predictions and in estimates of derived parameters, such as PAO. The results of our Bayesian analysis of the blue hawker data are qualitatively similar to the results of the classical analysis reported by Kéry et al. (2010). For example, the PAO (for both years combined) estimated by Kéry et al. (2010) was 0·629 (1839/2925), which is approximately equal to the midpoint of our year-specific estimates (Table 1). However, a comparison of credible intervals from the Bayesian analysis allows us to conclude that the PAO of sites sampled in 1999 is not significantly different from the PAO of sites sampled in 2000. Classical and Bayesian predictions of blue hawker occurrence probability as a function of elevation are also quite similar (cf. fig. 1 of Kéry et al. (2010) and Fig. 2); however, the former lacks estimates of uncertainty whereas the Bayesian predictions include a confidence envelope based on 95% credible intervals. Kéry et al. (2010) also present a map of the potential distribution of blue hawker throughout Switzerland in 1999–2000 by using site-occupancy-based predictions of blue hawker occurrence at unsampled locations. Ideally, the uncertainty of these predictions also should be mapped. A Bayesian analysis could easily provide these estimates of uncertainty.
The blue hawker dataset was kindly provided by Marc Kéry and authorized for use by the Swiss Biodiversity Monitoring program of the Swiss Federal Office for the Environment (FOEN). The Swiss dragonfly Red List project, for which data were collected in 1999 and 2000, was funded by the FOEN. Data were extracted from the database of the Centre suisse de cartographie de la faune (CSCF) by the project coordinator, Christian Monnerat. The review comments of Bill Link and two anonymous referees improved the manuscript. Any use of trade, product or firm names is for descriptive purposes only and does not imply endorsement by the US Government.