Efficient Bayesian analysis of occupancy models with logit link functions

Abstract Occupancy models (Ecology, 2002; 83: 2248) were developed to infer the probability that a species under investigation occupies a site. Bayesian analysis of these models can be undertaken using statistical packages such as WinBUGS, OpenBUGS, JAGS, and more recently Stan, however, since these packages were not developed specifically to fit occupancy models, one often experiences long run times when undertaking an analysis. Bayesian spatial single‐season occupancy models can also be fit using the R package stocc. The approach assumes that the detection and occupancy regression effects are modeled using probit link functions. The use of the logistic link function, however, is algebraically more tractable and allows one to easily interpret the coefficient effects of an estimated model by using odds ratios, which is not easily done for a probit link function for models that do not include spatial random effects. We develop a Gibbs sampler to obtain posterior samples from the posterior distribution of the parameters of various occupancy models (nonspatial and spatial) when logit link functions are used to model the regression effects of the detection and occupancy processes. We apply our methods to data extracted from the 2nd Southern African Bird Atlas Project to produce a species distribution map of the Cape weaver (Ploceus capensis) and helmeted guineafowl (Numida meleagris) for South Africa. We found that the Gibbs sampling algorithm developed produces posterior samples that are identical to those obtained when using JAGS and Stan and that in certain cases the posterior chains mix much faster than those obtained when using JAGS, stocc, and Stan. Our algorithms are implemented in the R package, Rcppocc. The software is freely available and stored on GitHub (https://github.com/AllanClark/Rcppocc).


| INTRODUC TI ON
Occupancy models are an important statistical technique that was developed to make use of detection/nondetection data to infer the probability that a species under investigation occupies a site. When an occupancy study is undertaken, n s sites are visited a number of times to estimate the occupancy probability ( ) and conditional detection probability (p) of a species associated with each site in a region. The method can be viewed as an extension of logistic regression and allows one to estimate the occupancy probability at A number of methods have been used to fit occupancy models to data. These include maximum likelihood (MacKenzie et al., 2002); penalized maximum likelihood (Hutchinson, Valente, Emerson, Betts, & Dietterich, 2015;Moreno & Lele, 2010), Bayesian methods that employ WinBUGS, OpenBUGS, JAGS, or Stan as well as approximate methods such as those developed by Clark, Altwegg, and Ormerod (2016). Recently Dorazio and Rodriguez (2012) and Johnson et al. (2013) developed Gibbs algorithms to obtain posterior samples for the parameters of a nonspatial and spatial single-season occupancy (SSO) model, respectively. Both approaches assume that detection and occupancy processes are modeled using probit link functions, which enables the use of data augmentation (Tanner & Wong, 1987) to obtain closed form expressions of the conditional posterior distributions of the parameters of the occupancy model.
Given that the probit and logistic functions are very similar and only differ in respect of the tails of the functions, analysis undertaken using either of the functions should produce similar occupancy and conditional detection probabilities (Dorazio & Rodriguez, 2012). However, the use of the logistic link function is algebraically more tractable and allows one to easily interpret the coefficient effects of an estimated model by using odds ratios, which is not easily done for a probit link function. This observation is particularly true for the nonspatial SSO model since no spatial random effects are included in this model; however, when spatial random effects are included in the model, the interpretation of the regression effects can be difficult (Boehm, Reich, & Bandyopadhyay, 2013).
The paper commences with a brief discussion of the link between logistic regression and occupancy models. Thereafter, we discuss the formulation of various popular Bayesian spatial occupancy models and develop a Gibbs sampling algorithm for a particular spatial occupancy model when the regression effects of the occupancy and detection processes are modeled using logit link functions. Before concluding, we analyze two detection/nondetection data sets of South African bird species to illustrate the methods developed in the paper. An R package (Rcppocc) has been developed to fit SSO models using Gibbs sampling which can be obtained at: https://github.com/ AllanClark/Rcppocc.

| Logistic regression and occupancy models
Assume that n s sites are surveyed a number of times and detection/nondetection data are collected at all sites. Denote the observed data as a ragged matrix y = [y ij ] where y ij = 1 if the species under investigation has been observed at site i during survey j and y ij = 0 otherwise. Let the vector z represents the true species occupancy at the sites considered such that z i = 1 if the species occupies site i and z i = 0 if it does not occupy site i . The SSO model can be represented using the following hierarchical model, for all surveys j = 1, … , V i (Royle & Dorazio, 2008). The variable i denotes the probability occupancy probability at site i , while p ij = Pr(y ij = 1|z i = 1) denotes the conditional probability of detecting the species during the jth survey of site i given that the species is present at site i . In what follows we assume that the conditional detection and occupancy regression effects ( and ) are modeled using logit link functions such that logit( i ) = x T i and logit(p ij ) = w T ij , where x T i and w T ij are row vectors in design matrices, X (occupancy) and W (detection), respectively (as defined in Clark et al., 2016).
The joint posterior distribution of the parameters of the model is where ( ) and ( ) are the prior distributions of and , respectively. A directed acyclic graph of the above problem is displayed in F I G U R E 1 A directed acyclic graph illustrating the dependencies between the parameters and observed data for an SSO model. Shaded nodes represents observed data while all latent parameters are represented using unshaded nodes. Deterministic relationships are represented using double arrows while all stochastic relationships are represented using a single arrow . . , ns the species has not been observed. The first two conditional distributions have the following form, Notice that Equations 1 and 2 are of the same form as the posterior distributions of the regression effects of a logistic regression model and therefore we adapt a Gibbs sampling scheme for logistic regression models to address the problem of obtaining posterior samples for the parameters of an occupancy model.
In a logistic regression context, Polson, Scott, and Windle (2013) show that posterior samples of the regression effects can be obtained by sampling from the conditional distributions of Pólya-Gamma random variables and multivariate Gaussian distributions in turn. Their method is similar to that of Albert and Chib (1993) who developed a Gibbs algorithm to undertake probit regression, the only difference being that the sampling from truncated Gaussian distributions is replaced by sampling from Pólya-Gamma distributions. The sampling methods developed by Polson et al. (2013) are exact since their Pólya-Gamma sampling method is uniformly ergodic and converges to the correct posterior distribution (Choi, & Hobert, 2013). In the Supporting information (Appendix S1), we discuss the existing Gibbs algorithms used for undertaking logistic regression and demonstrate the use of the Pólya-Gamma (PG) method by developing two Gibbs sampling algorithms for the parameters of SSO models. In Table 1, we summarize the Gibbs algorithms for an SSO model when using the PG method but provide the details regarding the algorithm in the Supporting Information (Appendix S2 and Appendix S3). We use the notation "a ∼ PG(b,c)" to indicate that the random variable a is a Pólya-Gamma random variable with parameters b and c. Take note that the algorithm is identical to that developed by Dorazio and Rodriguez (2012) except that the sampling from truncated Gaussian distributions is replaced by sampling from Pólya-Gamma distributions.
The paper by Gelfand et al. (2005) lead to the development of the R package hSDM (Vieilledent et al., 2014) in which the hSDM.siteocc.iCAR function can be used to fit a particular spatial occupancy model to detection/nondetection data. A region under investigation is subdivided into n s grid cells each which are surveyed a number of occasions. The model is formulated using Bernoulli latent random variables z = (z 1 , … ,z n s ) T . Formally, . In a Bayesian context, such a probit model is formulated by defining a latent Gaussian (1) TA B L E 1 The Gibbs algorithm for undertaking a SSO model using the "PG" method (See the Supporting Information (Appendix S3) for the details pertaining to the parameter matrices of the conditional posterior distributions.) random variable, z i with mean 0 and variance equal to 1 such that Pr(z i = 1) = Pr(z s > 0). In the first of their models, they allow z s to be spatially correlated such that Often the spatial random effects and fixed effects of a model are collinear when spatially varying covariates are included as fixed effects (Hanks, Schliep, Hooten, & Hoeting, 2015;Hodges & Reich, 2010;Hughes & Haran, 2013;Reich, Hodges, & Zadnik, 2006). The suggested solution to this problem was to include spatial random effects in the model specification that are orthogonal to the fixed effects and is known as restricted spatial regression (RSR). The second spatial model developed by Johnson et al. (2013) uses this method and redefines z i as The spatial random effects are modeled as Q is a n × n ICAR precision matrix (Besag & Kooperberg, 1995) obtained using surveyed and unsurveyed locations, is a spatial precision parameter and i 1 and i 2 are known constants. Kelsall and Wakefield (1999) have suggested setting these parameters to 0.5 and 0.005, respectively, such that the prior mean of is 1,000. The

| Applications
To demonstrate our methods, we used detection/nondetection data extracted from the 2nd Southern African Bird Atlas Project (SABAP2) database to produce a species distribution map of the Cape weaver (Ploceus capensis) and helmeted guineafowl (Numida meleagris) for South Africa. SABAP2 divides Southern Africa into a continuous grid of 5′ × 5′ and relies on citizen scientists to collect checklists of bird species for each grid cell. Birders are requested to spend at least 2 hr on each checklist in which they undertake intense birding and record all species they observe and the order in which they are observed. For this analysis, we aggregated the data to quarter-degree grid cells. We used data that span South Africa and contained a minimum of three and a maximum of fifty surveys during 2016 (January-December) in the analysis. Covariate information at unsurveyed locations was included in the analyses to obtain occupancy estimates that span South Africa. All covariates were centered and standardized.
F I G U R E 2 A directed acyclic graph illustrating the dependencies between the parameters and observed data for a spatial SSO model. Shaded nodes represents observed data while all latent parameters are represented using unshaded nodes. Deterministic relationships are represented using double arrows while all stochastic relationships are represented using a single arrow The climate variables ( Figure S5 in the Supporting Information Appendix S6) all form part of a data set used by Huntley et al. (2006) to model bird distributions in Southern Africa. The variables included two measures of annual temperature that related to thermal sums above 0 and 5 degree centigrade; two measures related to the mean temperature of the coldest and warmest month, respectively; the ratio of potential to realized evapotranspiration as well as two measures that relates to the intensity of the dry and wet season, respectively. The climate variables are highly correlated with two of the variables having variance inflation factors in excess of 3,000 (Tables S2 and S3 in the Supporting Information Appendix S5). Because of this fact, it was decided to extract two principal components from the design matrix that consisted of the centered and standardized climate variables. These principal components explain 90% of the variation in the design matrix (Table S4 in the Supporting Information Appendix S5) and can tentatively be interpreted as a temperature related factor and a climate intensity factor, respectively.
We follow Hughes and Haran (2013) and retain 10% of the eigenvalues ( i i = 1, … ,n) of . In a similar context, Johnson et al. (2013) suggest selecting a RSR model with i ≥ 0.5 which suggests including at most 237 eigenvectors into the spatial portion of the model. any analysis to test the sensitivity of our results to the prior specification of . All MCMC sampling was undertaken using the R packages, stocc, jagsUI (Kellner, 2014) in combination with JAGS 4.2.0 (Plummer, 2003), rstan in combination with Stan 2.17.3 as well as the authors' code. 1 All calculations were performed on a Windows 10 Pro desktop computer which had an Intel(R) Core(TM) i7-6900 processor with 64 GB of RAM. One chain of 70,000 iterations was run. The first 20,000 samples were discarded as burn-in samples, while the remaining samples were retained. Experimentation and an examination of the Geweke convergence diagnostic statistics (Geweke, 1992) and trace plots obtained by running three parallel chains using Rcppocc displayed that the MCMC chains converged using these numbers of iterations. The posterior samples were not thinned (Link & Eaton, 2012).

| RE SULTS
From the analysis of both data sets, we observe that the Gibbs algo-     Africa except for the North West regions of South Africa. Figure 3b,d displays the difference between the estimated occupancy probabilities TA B L E 4 Posterior summaries of the parameters of the Bayesian nonspatial and spatial occupancy models (posterior mean, Monte Carlo standard error, standard deviation, 2.5% and 97.5% quantiles) obtained when using Rcppocc and stocc, respectively ("Rcppocc-stocc").
The figures illustrate that we obtain similar estimates of the mean occupancy probabilities when using either estimation method with small discrepancies at the majority of the grid cells across South Africa.

| D ISCUSS I ON AND CON CLUS I ON S
Through several studies, Bayesian methods have been developed to undertake occupancy models. They, however, either use probit link functions to model the detection and occupancy processes of F I G U R E 3 Estimated occupancy probability for the Cape weaver and helmeted guineafowl estimated using Rcppocc (a and c). The difference between the estimated occupancy probabilities obtained when using Rcppocc and stocc for the Cape weaver and helmeted guineafowl, respectively (b and d). The grid cells where the species have been detected at least once are displayed in (b) and (d) Similar to Broms, Johnson, Altwegg, and Conquest (2014) and Johnson et al. (2013), we show that the ICAR model produced posterior samples with significantly larger autocorrelations than the RSR model when using stocc. As an example, the autocorrelations of the occupancy regression effects as well as the spatial precision param-

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interests to declare.

AUTH O R CO NTR I B UTI O N
Below, Allan Ernest Clark is denoted as "AEC", while Res Altwegg is denoted as "RA". AEC and RA conceived and designed the paper.
AEC analyzed the data. AEC wrote and, AEC and RA reviewed the paper. AEC designed and coded the software used in the analysis.
AEC wrote computer code used to perform all analysis.
Notes 1 An R package has been developed to fit these models using MCMC.
All code can be obtained from https://github.com/AllanClark/Rcppocc. Appendix S7 in the Supporting Information includes a worked example explaining how to run RSR models using stocc, Stan, and Rcppocc.
2 ESR = the effective sample size per unit run time. The effective sample size for the ith parameter in the model is defined as where M is the number of retained samples, and i (j) is the j th lagged autocorrelation of parameter i (Holmes & Held, 2006). We use the coda package (Plummer, Best, Cowles, & Vines, 2006)