Bayesian model selection for spatial capture–recapture models

Abstract A vast amount of ecological knowledge generated over the past two decades has hinged upon the ability of model selection methods to discriminate among various ecological hypotheses. The last decade has seen the rise of Bayesian hierarchical models in ecology. Consequently, commonly used tools, such as the AIC, become largely inapplicable and there appears to be no consensus about a particular model selection tool that can be universally applied. We focus on a specific class of competing Bayesian spatial capture–recapture (SCR) models and apply and evaluate some of the recommended Bayesian model selection tools: (1) Bayes Factor—using (a) Gelfand‐Dey and (b) harmonic mean methods, (2) Deviance Information Criterion (DIC), (3) Watanabe‐Akaike's Information Criterion (WAIC) and (4) posterior predictive loss criterion. In all, we evaluate 25 variants of model selection tools in our study. We evaluate these model selection tools from the standpoint of selecting the “true” model and parameter estimation. In all, we generate 120 simulated data sets using the true model and assess the frequency with which the true model is selected and how well the tool estimates N (population size), a parameter of much importance to ecologists. We find that when information content is low in the data, no particular model selection tool can be recommended to help realize, simultaneously, both the goals of model selection and parameter estimation. But, in general (when we consider both the objectives together), we recommend the use of our application of the Bayes Factor (Gelfand‐Dey with MAP approximation) for Bayesian SCR models. Our study highlights the point that although new model selection tools are emerging (e.g., WAIC) in the applied statistics literature, those tools based on sound theory even under approximation may still perform much better.

µ = (µ p , µ s ), where µ p is the collection of scalar parameters and µ s is the collection of 11 all latent variables. We first fix these high-dimensional variables at their MAP estimate 12μ s assuming that the posterior distribution is well summarised by this estimate. The

13
Gelfand-Dey estimator of marginal likelihood of data m(Y) is then computed as (also 14 given as (2.7) in the main text): (A.1) where {(µ p (d) , µ s (d) ) : d = 1, . . . , N iter } is a set of MCMC draws from the poste- 16 rior π(µ p , µ s | Y), f (Y | µ p , µ s ) denotes the model likelihood and π(µ p ) denotes the 17 prior density of the parameters µ p and g(µ p ) denotes a tuning density. We begin with (µ p (d 0 ) , µ s (d 0 ) ) as an initial estimate of (µ p , µ s ) where This estimate of posterior mode of (µ p , µ s ) may not be optimal since in our high di-20 mensional parameter setting, an MCMC sample of a practical size may not be enough 21 to extensively explore the posterior surface. Below we describe a resampling algorithm 22 based on MCMC draws to obtain a better MAP estimate (μ p ,μ s ). 23 Iteration 1. Obtain d 1 such that ). 26 • Otherwise go to the next iteration.

30
• Otherwise go to the next iteration.
). 45 • Otherwise go to the next iteration.

46
(ii) If m is odd, obtain d m such that

50
• Otherwise go to the next iteration.

51
This procedure is continued iteratively to eventually give us the best MAP estimate 52 of the posterior mode (μ p ,μ s ). Observe that, (μ p ,μ s ) is closer to the posterior mode Thus the asymptotic (large posterior sample) convergence of (µ p (d 0 ) , µ s (d 0 ) ) to the pos-55 terior mode ensures that our improved MAP estimate also converges to the same point. The joint density of ijk )) and u = (u 1 , . . . , u M ) under M 1 63 is the following (also given as model likelihood (2.1) in the main text): least once in one of the detectors (i.e. y (1) i·· +y (2) i·· > 0). Whereas among the u 0 individuals, 70 some may be detected and some may not.

71
Marginalisation over z. Now as mentioned above, each z i , is assumed to have a z, where π(S) denotes the density of uniform distribution over the bounded state space V.
Observe that, when y i·· = 0, The joint density of Y * := (Y (1) , Y (2 * ) ) = ((y is the following (also given as model likelihood (2.2) in the main text): The corresponding integrated likelihood is obtained by integrating (B.6) with respect to 90 u 0 , z, S: After integrating the above likelihood (B.8) over z and S with respect to their priors, we Bernoulli distribution with parameter ψ, π(S) denotes the density of uniform distribution 101 over the bounded state space V.

102
The joint density of ijk )) under M 4 is the following (also 104 given as model likelihood (2.4) in the main text): The integrated likelihood in the absence of individual covariate on sex category u is 106 obtained by integrating (B.10) with respect to z, S, and is given by i.e., the parameters ψ, θ, φ, ω 0 , p 0 , σ, σ m , σ f are as follows: ψ * = logit(ψ), θ * = logit(θ), We denote the set of transformed parameters as µ * p = h(µ p ) and the transformed prior by π * (µ * p ) with domain as the Euclidean space.

123
The prior distribution on the parameters ψ, θ, φ, ω 0 , p 0 , σ, σ m and σ f are given in Table 1. The transformed prior on where | J | denotes the Jacobian of transformation and h −1 (·) denotes the inverse trans-126 formation of the parameter.
We have carried out the computation of Gelfand-Dey estimator using different choices of g(µ * p ) to assess the sensitivity of the resulting estimates (see Section 2.2.1). First, density of the multivariate normal distribution N (μ * p ,Σ) with meanμ * p and covariance matrixΣ is chosen.μ * p andΣ are computed by using the MCMC draws {µ p * (d) : d = 1, . . . , N iter }.

The mean parameterμ
pi . The entries of the covariance matrixΣ can be obtained by using the MCMC sample aŝ Additionally, we have used multivariate-t densities for g(µ * p ) with mean parameter µ * p and scale matrixΣ using different degrees of freedom: 10, 100, 500, 1000, 10000. A truncated normal density has also been tried for g(µ * p ) following the suggestion of Geweke (1999). The parameters of the truncated normal densityμ * p ,Σ are obtained as above.
Then for some α ∈ (0, 1), we define the setΘ as follows: and we take where χ 2 r (α) denotes 100 α% quantile of χ 2 r distribution with r denoting the dimension of 128 µ * p . We have carried out the computations for three values of α: 0.9, 0.95 and 0.99.

129
Note that, in the integrated likelihood (IL) approximation approach for computing 130 Bayes factors, we assume g(µ * p , L) = g 1 (µ * p ) g 2 (L). We take g 1 (µ * p ) from the above choices 131 in MAP approximation approach and take the prior density π(L) as the tuning density 132 g 2 (L).  Table 1): where y ij· = y where µ rest denotes all the remaining parameters other than σ * m . Here a normal 165 proposal is assumed for σ * m and is updated using a Metropolis algorithm. The where µ rest denotes all the remaining parameters other than σ * . Here a normal 174 proposal is assumed for σ * and is updated using a Metropolis algorithm. The 175 parameters of this proposal distribution at t th iteration are σ * (t−1) and τ 2 , 176 where σ * (t−1) is the (t − 1) th element in the Markov chain of σ * . 177 5. Update each u i ∈ u 0 (only for M 1 and M 2 ). When z i = 1, the full conditional distribution of u i (∈ u 0 ) is the Bernoulli distribution with parameter θ 0i . Under model M 1 , θ 0i takes the following form Note that the full conditional of u i is independent of any other u a 's (a = i), .
Note that the full conditional distribution of z i is independent of any other z j 's where µ rest denotes all the remaining parameters other than s i . Each s i can be is 1 or 0. We randomly draw one ID1 from this candidate set, say l * , according 218 to these weights.

219
(d) This is the swapping step. We find the ID2 who is linked with ID1 l * and 220 denote it by i * . Now we link ID2 i * with ID1 l and link ID2 i with ID1 l * .

221
Consequently, we have L i * = l and L i = l * . We also find the set of (real) where µ rest denotes all the remaining parameters other than L and π(L) de-231 notes the prior density of L.
where µ rest denotes all the remaining parameters other than ω * 0 . The parameter ω * can be expressed as the following where µ rest denotes all the remaining parameters other than p * 0 . The full conditional 248 of p * 0 is of a non-standard form and can be updated using a random walk Metropolis