One size does not fit all: Customizing MCMC methods for hierarchical models using NIMBLE

Abstract Improved efficiency of Markov chain Monte Carlo facilitates all aspects of statistical analysis with Bayesian hierarchical models. Identifying strategies to improve MCMC performance is becoming increasingly crucial as the complexity of models, and the run times to fit them, increases. We evaluate different strategies for improving MCMC efficiency using the open‐source software NIMBLE (R package nimble) using common ecological models of species occurrence and abundance as examples. We ask how MCMC efficiency depends on model formulation, model size, data, and sampling strategy. For multiseason and/or multispecies occupancy models and for N‐mixture models, we compare the efficiency of sampling discrete latent states vs. integrating over them, including more vs. fewer hierarchical model components, and univariate vs. block‐sampling methods. We include the common MCMC tool JAGS in comparisons. For simple models, there is little practical difference between computational approaches. As model complexity increases, there are strong interactions between model formulation and sampling strategy on MCMC efficiency. There is no one‐size‐fits‐all best strategy, but rather problem‐specific best strategies related to model structure and type. In all but the simplest cases, NIMBLE's default or customized performance achieves much higher efficiency than JAGS. In the two most complex examples, NIMBLE was 10–12 times more efficient than JAGS. We find NIMBLE is a valuable tool for many ecologists utilizing Bayesian inference, particularly for complex models where JAGS is prohibitively slow. Our results highlight the need for more guidelines and customizable approaches to fit hierarchical models to ensure practitioners can make the most of occupancy and other hierarchical models. By implementing model‐generic MCMC procedures in open‐source software, including the NIMBLE extensions for integrating over latent states (implemented in the R package nimbleEcology), we have made progress toward this aim.


| INTRODUC TI ON
Application of hierarchical statistical models for analyzing complex ecological data has grown rapidly over roughly the last twenty years Kéry & Royle, 2016;Royle & Dorazio, 2008).
Estimation and inference for hierarchical models, however, are not simple. A widely used method is Markov chain Monte Carlo (MCMC) in a Bayesian framework (Brooks, Gelman, Jones, & Meng, 2011;Ellison, 2004). Alternatives to MCMC include Laplace approximation (e.g., TMB Kristensen, Nielsen, Berg, Skaug, & Bell, 2016) and integrated nested Laplace approximation (e.g., INLA, Rue, Martino, & Chopin, 2009;Rue et al., 2017), but here, we focus on MCMC as a widely used, customizable approach. MCMC algorithms sample from the posterior distribution of parameters and latent (unknown) ecological states given the observed data and assumptions about the prior distribution of parameters. More simply, they explore the range of conditions that might explain the data. A major limitation of MCMC is that when a model has hundreds or thousands of latent states and parameters, which may be highly correlated in the posterior distribution, MCMC can require hours, days, or weeks to run. This limits research efficiency, but more importantly, it limits research quality by constraining the range of models that can be compared and the potential for using simulations to check estimation performance, cross-validation, or other layers of computational analysis .
MCMC is not a single algorithm but rather a large family of algorithms that can be combined flexibly. Statistical researchers have elaborated many MCMC sampling strategies for many kinds of models, and they have pursued theoretical results on MCMC mixing-how well the posterior distribution is explored-and how MCMC mixing scales with the size of a model or data (Gilks & Roberts, 1995;Yu & Meng, 2011). Though these theoretical results are typically limited to simple models and lack consideration of computational costs, these studies suggest that there is no universally best strategy (Gilks & Roberts, 1995;Turek, Valpine, Paciorek, & Anderson-Bergman, 2017;Yu & Meng, 2011). Instead, the success of customizing sampling strategies for particular models suggests that the best strategies may be problem-specific (Turek et al., 2017).
The recognition that different sampling strategies may work well for different models points to commonly used software tools as a hindrance to efficient MCMC. Tools such as WinBUGS and OpenBUGS (collectively "BUGS") and JAGS have revolutionized statistical practice in ecology and other fields by putting MCMC in the hands of nonspecialists, in part because the BUGS syntax is relatively easy to read and adapt (Lunn, Jackson, Best, Spiegelhalter, & Thomas, 2012;Plummer, 2015Plummer, , 2003Surhone, Tennoe, & Henssonow, 2010). Other software packages that do not use the BUGS language include Stan, which implements Hamiltonian Monte Carlo methods (HMC; Betancourt & Girolami, 2013;Monnahan, Thorson, & Branch, 2017), as well as numerous other packages that provide sampling strategies for general models or specialized strategies for narrower models, among which we note PyMC (Salvatier, Wiecki, & Fonnesbeck, 2016), MCMCpack (Martin, Quinn, & Park, 2011), spBayes (Finley, Banerjee, & Carlin, 2007), and MCMCglmm (Hadfield, 2010). However, these packages generally prescribe the MCMC methods to be used or offer a small range of choices for expert users. A comparatively new package, NIMBLE ("Numerical Inference for hierarchical Models using Bayesian and Likelihood Estimation," de Valpine et al., 2017), adopts nearly the same model language as BUGS and JAGS but makes it extensible and supports customization of sampling methods. Provided as R package nimble (NIMBLE Development Team, 2019), it provides a workflow in R with code generation of C++ for efficiency.
Beyond limiting the MCMC sampling strategies applied to a model, hierarchical modeling software often limits the way models can be written, which is important because different ways to write the same model can yield different MCMC performance. A simple example is centered and noncentered parameterizations (Papaspiliopoulos, Roberts, & Skld, 2007). A more complicated example occurs when one wants to analytically marginalize some latent states out of the model by direct summation or numerical integration while using MCMC to sample others. Summing over the latent states in a hidden Markov model for multistate or multi-event capture-recapture can yield orders-of-magnitude improvement in computational efficiency (Turek, Valpine, & Paciorek, 2016). Whereas BUGS and JAGS use a closed model language, NIMBLE supports extensibility of models, making such customizations possible.
In this study, we ask how different strategies for MCMC sampling, different kinds of model structures, and alternative ways to formulate equivalent models all impact MCMC efficiency for common models in ecology and evolution. We test whether efficiency is increased by (a) simplifying model structure, (b) block sampling (e.g., joint sampling of parameters), (c) different types of samplers, and (d) summing over latent states. Based on typical results from the statistical literature, we expect the MCMC efficiency of different strategies will be model-specific (Browne, Steele, Golalizadeh, & Green, 2009;Solonen et al., 2012), so we examine the interaction of these strategies with different models, focusing on occupancy and N-mixture models (MacKenzie et al., 2006;Royle & Kéry, 2007;Royle, 2004).
Just over a decade after occupancy models were introduced, they are being used to model species ranging from bees (M'Gonigle, Ponisio, Cutler, & Kremen, 2015) to tigers (Hines et al., 2010) with a great variety of model complexity (Bailey, MacKenzie, & Nichols, 2014;Denes, Silveira, & Beissinger, 2015). Estimating abundance and site occupancy is a critical challenge for most subdisciplines in ecology and evolution concerned with quantifying population dynamics including metapopulation, endangered species, and invasion biology. However, occupancy and N-mixture models can lead to high-dimensional MCMC algorithms that can mix slowly, requiring lengthy run times. Standard hierarchical modeling implementations of these models include latent states for the true occupancy state or number of individuals at each site in each closed season, as well as random effects at the level of species, sites, and/or observations. Together, these can yield hundreds or thousands of dimensions that require MCMC sampling.
To examine how to increase model estimation efficiency, we focus on software using the BUGS language including NIMBLE (NIMBLE Development Team, 2019) as well as JAGS (Plummer, 2015). Within NIMBLE, models can be extended with new functions and distributions, which provides enormous flexibility in how models are written. In addition, MCMC can be extended with new sampler configurations and entirely new samplers. Though it is out of the scope of this study to compare all available MCMC software, we focus on NIMBLE because it allows us to examine the efficiency of MCMC customizations in which we are interested, and JAGS to allow comparison to this widely used tool that uses nearly the same model language.

| MATERIAL S AND ME THODS
We focus on four models-three occupancy and one N-mixture, (Table 1, Appendix 1)-that are commonly employed in ecology and evolution. The efficiency of sampling strategies may depend on model structure, model size, and the data. To explore the effect of model structure, for each model we created a version with and without some component of hierarchical structure and the associated hyperparameters. To examine the effect of different ways to write the same model, for each case we created a model where we sampled latent states and an equivalent model where we integrate out the latent states to limit MCMC sampling to top-level parameters (Turek et al., 2016). To explore the effect of different sampling strategies, we created a variety of sampler configurations in NIMBLE, including some that use block sampling as well as NIMBLE's default samplers and a sampler configuration similar to that of JAGS. In one occupancy model, we simulated the data and were, therefore, able to include a scenario with high and low detectability of individuals to explore the effect of changing the parameter values and data on the efficiency of samplers. We next provide more details on each of these contrasts, after which we describe how we compare performance among MCMC methods.

| Model choices
We focus on three occupancy models including a single-species, mul-

| Model structure
For each model, we identified model terms for which an analyst might assume either there is or is not unexplained heterogeneity in parameters. Without heterogeneity, a single parameter is sufficient.
With heterogeneity, different parameters for different parts of the data are assumed to follow a shared distribution, typically with hyperparameters, yielding an additional hierarchical layer in the model.
Incorporating multiple sources of variation in this way is a common practice in Bayesian hierarchical modeling and indeed a primary motivation for it. However, it is also common to see pragmatic assumptions of where unexplained heterogeneity will not be modeled.  Kéry & Royle, 2016). The models including species-, year-, site-, or survey-specific parameters drawn from common distributions are "more hierarchical" in comparison with the models excluding those terms. Our a priori expectation is that sampling models with more hierarchical models will always be less efficient than their less hierarchical counterparts.

| Model size
For each model, we wrote custom distributions in NIMBLE to directly sum probabilities over discrete latent states, that is, to marginalize over them. However, the implications of this marginalization differed for each model. For the single-species, multiseason model, we use a hidden Markov model probability summation across the discrete latent state (occupied vs. unoccupied) across all times for a given site. Hidden Markov models are a general class of models for noisy data of system states that change stochastically, and they encompass many ecological models (Gimenez et al., 2007;Zucchini, MacDonald, & Langrock, 2017

| Data
With the single-species, multiseason example (Table 1, Appendix 1.1), we were able to modify the data because it is simulated. We simulated the data with high (p = .73) and low (p = .27) detectability.
We expect that because a lower detection probability will result in more nondetections, the latent states for more site-years will need to be sampled (Table 1), thereby decreasing efficiency.

| MCMC sampling strategies
We fit each model using a variety of MCMC sampling strategies.
Before summarizing these strategies, we briefly introduce the kinds of MCMC samplers involved, including three kinds of scalar samplers and two kinds of multivariate (block) samplers (Roberts & Sahu, 1997;Sargent, Hodges, & Carlin, 2000). Here, we use "parameter" to mean any estimated quantity, random effect, latent state, or posterior dimension being sampled.
Adaptive random-walk Metropolis-Hastings (ARWMH) samplers propose a new value for a parameter from a normal distribution centered on the current value, followed by accepting or rejecting that value according to the Metropolis-Hastings acceptance probability (Hastings, 1970;Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953). The "adaptive" aspect updates the standard deviation of the proposal distribution to achieve an acceptance rate with good mixing (Haario, Saksman, & Tamminen, 1999;Roberts & Rosenthal, 2001).
While simple and sometimes slow mixing per iteration, ARWMH is computationally fast, allowing it to run many iterations. Slice samplers (Neal, 2003) explore a range of new values for a parameter based on the current value. They almost always result in a new value. They may mix better than ARWMH, but they have higher computational cost due to exploring potentially many values, each requiring associated model calculations. In practical implementations, slice samplers should only be used when the conditional distribution of the parameter being sampled is unimodal, which will commonly be the case. For discrete-valued parameters, one may achieve conjugate (Gibbs) sampling by trying every possible discrete value to determine the full conditional distribution by computation, which we call "computational Gibbs." This also incurs model computations for each candidate value, a reason that sampling categorical variables can be slow. Discrete unimodal parameters can also be sampled with slice samplers. We also mention regular Gibbs (or "conjugate") samplers, which draw a new value for a parameter from its conditional distribution when that distribution can be written analytically. That is only the case for certain fortunate combinations of prior and likelihood, which do not occur in the examples below.
Even generally efficient univariate samplers will mix slowly when the posterior has strong correlations among two or more parameters.
The two kinds of block samplers used here are multivariate adaptive random-walk Metropolis-Hastings samplers ("block_RW") and automated-factor slice samplers ("block_AFSS", Tibbits, Groendyke, Haran, & Liechty, 2014). The block_RW sampler is like the ARWMH sampler above but draws its proposal from a multivariate normal distribution. The adaptation of this sampler attempts to find a proposal covariance that yields good mixing. Using these samplers, we chose a set of sampling strategies for comparisons for each model. These included the default samplers for NIMBLE ("nimble") and JAGS ("jags"), the default JAGS strategy run in NIMBLE ("jags_like_nimble"), and blocking selected parameters using block_RW or block_AFSS while sampling remaining parameters using NIMBLE's default samplers.

| Block sampling in MCMC
To block parameters, we examined each model and formulate strategies based on possible correlations between the parameters (

| Prior distributions
For most parameters, we used uninformative priors of nearly flat normal distributions for the means of the distributions of the top-level parameters, and uniform priors over the interval [0,100] for standard deviations (see Appendix 1 for specific priors for each model

| RE SULTS
All chains converged sufficiently, and all posteriors from different methods for the same model scenario were in agreement (Appendix 2: Table A1-A8). Interestingly, the slowest mixing parameters were generally consistent across MCMC strategies (Appendix 3: Figures   A1-A6), suggesting different strategies did not have strong effects on the relative sampling efficiency of specific parameters. Across all occupancy and N-mixture models, efficiency was always much higher in the models without additional hierarchy (species-, year-, site-, survey-specific parameters). As expected, latent state integration and MCMC samplers did not have consistent effects on efficiency across models.

| Occupancy: Single-species, multiseason model
For the single-species, multiseason example, there were interactions between the model hierarchical structure, integrating over latent states, and sampling strategy ( Figure 1). With more hierarchical structure, integrating over latent states decreased efficiency compared to sampling latent states (1 min in comparison with 5.5 min to generate 1,000 effectively independent samples using default NIMBLE, Figure 1a

| Occupancy: Multispecies, single-season model
Model structure and size interacted with the sampling strategies to de-termine efficiency in the multispecies, single-season example ( Figure 2), but in ways that differed from the single-species, multiseason occupancy models. In this case, integrating over latent states improved efficiency regardless of model hierarchy (the difference between 2.4 hr and 24 min to generate 1,000 independent samples using default NIMBLE in the more hierarchical model, and 3.2 min and 11 s in the less hier-

| N-mixture model: Zero-inflated Poisson
In the N-mixture example, integrating over latent states is generally less efficient than sampling latent states, regardless of more vs. less hierarchy (the difference between just under an hour and over four hours to generate 1,000 independent samples using default NIMBLE in the more hierarchical model, Figure 4). This is likely because the summation   clear that the cost is much higher than any benefit of integrating over latent states. Our model efficiency comparisons for occupancy and N-mixture models suggest that integrating over latent states seems to be beneficial primarily when such an integration is simple and efficient.

| D ISCUSS I ON
The costs and benefits of block sampling will also be different for

CO N FLI C T O F I NTE R E S T
None Declared.

O PE N R E S E A RCH BA D G E S
This article has been awarded Open Materials, Open Data Badges.
All materials and data are publicly accessible via the Open Science figures correspond to variable names in the code, which is generally related to but not the same as the mathematical notation of each model's description.

Occupancy: Single-species, multiseason model
The single-species multiseason occupancy model (i.e., dynamic occupancy model, Royle & Kéry, 2007) uses simulations modified from Kery and Schaub (2012). We simulated 100 sites sampled 5 times per year for 15 years. Let z i,j denote the true occupancy of a species at the i th site in the j th year. z i,j is a Bernoulli random variable, z i,j ~ Bern(ψ i,j ). In the first year, z i,1 ~ Bern(ψ i,1 ). ψ i,1 is the occupancy probability at the i th site in the j th year.
The model with more hierarchical structure includes year-specific persistence and colonization probabilities, each from a prior distribution. Let ϕ j denote the logit probability the species persists at a site from years j to j + 1 (given that z i,j = 1) and γ j denote the logit probability that site i is colonized in year j + 1 (given that z i,j = 0). The priors for γ j and ϕ j , priors for their hyperparameters, and the calculation of ψ i,j are given by: Here, expit is the inverse of the logit function: expit (z) = 1/ (1 + e −z ). Note that, since z i,j is either 0 or 1, it is equivalent to write logit i,j+1 = j × z i,j + j × 1 − z i,j . Our notation for normal distributions uses mean and variance, that is, "N (mean, variance)".
We then let y i,j,k indicate whether the species was (y i,j,k = 1) or was not (y i,j,k = 0) detected in the k th visit to site i in year j. The logit probability of detection when a site is occupied in year j is p j . Similar to γ j and ϕ j , p j is year-specific and follows a normal distribution. These relationships and relevant priors are given as follows: The model with less hierarchical structure lacks the year-specific parameters; instead, there are single, time-independent γ, ϕ, and p parameters, so their subscripts j can be removed in the above equations. Each follows a normal prior with mean 0 and variance 10 6 . In this case, there is no need for µ γ , σ γ , µ ϕ , σ ϕ , µ p , or σ p .

Occupancy: Multispecies, single-season model
The multispecies, single-season occupancy example models bird community data in relation to variables about wildlife (A1) ∼ N 0,10 6 ∼ N 0,10 6 ∼ uniform 0,100 There are 70 sites, each sampled 3-4 times for detection/nondetection of 58 species. For species i and site j, the probability of occupancy is i,j and the true occupancy status is z i,j , that is, Occupancy probability was allowed to depend on study area, tree basal area (BA), and understory foliage cover (UFC).
Detection probability was allowed to depend on study area and date. Relationships with BA, UFC, and date included both linear and quadratic terms.
In the model with more hierarchical structure, the coefficients of BA, UFC, and study area are species-specific and drawn from common distributions with similar priors as above: Similarly, the species-specific detection probabilities are modeled with species-specific coefficients from distributions with priors, all as follows: The case with less hierarchical structure has no species-specific parameters, so all of the species share the same coefficients for the effect of habitat and management. This gives: As for the first example, the priors for the coefficients in this model are normal with mean 0 and standard deviation 1,000. The purpose of this model is not to be a scientific alternative to the more hierarchical model but rather to provide a useful case for comparison of MCMC methods.

Occupancy: Multispecies, multiseason model
The multispecies, multiseason occupancy example models data on wild bees in on-farm habitat restoration patches .
There are 31 sites, each sampled 2-7 times per year for 10 years for detection/nondetection of 49 bee species. For species i, site j, and year t, the latent occupancy state is z i,j,t . For the same indices and visit r, the detection probability is p i,j,k,r , and the detection/nondetection datum is y i,j,t,r . Thus, y i,j,t,r ∼ Bern p i,j,t,r * z i,j,t .
Occupancy probability is denoted i,j,t and defined by persistence and colonization probabilities. If a site is occupied in year t, the logit probability that it persists (continues to be occupied) in year t + 1 is i,j,t . If a site is unoccupied in year t, the logit probability that it is colonized (becomes occupied) in year t + 1 is i,j,t . The definition of i,j,t is then given by: Note that only one term of the right-hand side will be nonzero.
Probability of occupancy in year 1 is denoted i,j,1 and defined as the equilibrium occupancy calculated from the mean logit persistence and colonization probabilities. Specifically, Site-level intercept and slope parameters for the effects of habitat quality (floral resource diversity, frd j,t ) and the weighted proximity of other habitat patches (hedgerows and remnant natural habitat, HRwtprox j and RemnantWtProx j , respectively) were modeled as independent species-level random effects. Species-level covariates included species' body size (BodySize i ) and diet breadth (k i ). Interactions between site-level and species-level covariates were also included. The persistence components of the model including priors are as follows: The colonization components of the model including priors are as follows: Here i and A i denote species-specific intercepts of logit persistence and colonization probabilities, respectively. The [s] and B[s] parameters preceeding each of the explanatory variables represent the effect of those variables on persistence and colonization.
The detection probability of each species was allowed to vary over the season according to species-specific phenologies defined by a quadratic function of day of year (date j,t,r ) with species-specific coefficients drawn from across-species prior distributions (M' Gonigle et al., 2015). Specifically, p i,j,t,r was modeled as follows: where p[0] i , p[1] i , and p[2] i denote species-specific intercept, linear coefficient, and quadratic coefficient, respectively, for effect of day of year on detection probability of species i.
The case with less hierarchical structure does not have speciesspecific coefficients. Instead, all species share the same coefficients for the effect of local and landscape habitat variables. This is given by: As for the first example, the priors for the coefficients in this model are normal with mean 0 and standard deviation 10 3 .

Zero-inflated N-mixture model
The zero-inflated N-mixture example models the abundance of birds (great tits) using breeding bird survey data across Switzerland (Kéry & Royle, 2016, ch. 6.11.1). There are 267 routes, each in a 1 square kilometer quadrat in a grid, each surveyed 2-3 times in one year. Kéry and Royle (2016), this example features some informative priors.
i,j,k = + 1 *HRwtProx j + 2 *RemnantWtProx j + 3 *frd j,t + 4 *k i + 5 *BodySize i + 6 *frd j,t *HRwtProx j + 7 *frd j,t *RemnantWtProx j + 8 *k i *HRwtProx j + 9 *BodySize i *HRwtProx j + 10 *k i *RemnantWtProx j + 11 *BodySize i *RemnantWtProx j i,j,t = A + B 1 *HRwtProx j + B 2 *RemnantWtProx j + B 3 *frd j,t + B 4 *k i + B 5 *BodySize i + B 6 *frd j,t *HRwtProx j + B 7 *frd j,t *RemnantWtProx j + B 8 *k i *HRwtProx j + B 9 *BodySize i *HRwtProx j + B 10 *k i *RemnantWtProx j + B 11 *BodySize i *RemnantWtProx j logit p i,j,t,r = p 0 +p 1 *date j,t,r + p 2 * (date j,t,r ) 2 The number of birds available to be counted on each route is assumed to follow a zero-inflated Poisson distribution. The probability of a structural zero (due to unsuitable habitat) is ϕ. The unobserved state w i is 0 if the habitat is unsuitable, 1 if it is suitable. The number of birds available to be counted at site i is N i , which is 0 if w i = 0 and follows a Poisson with mean i if w i = 1. These relationships are written as follows: Covariates for i included site-level forest cover (%, forest i ), elevation (m, elev i ), and route length (km, length i ). In addition, a random effect of site on i allowed for unexplained route-to-route variation.
The i component of the model is as follows: The observed abundance of birds on the i th route during the j th survey is then modeled, including priors, as follows: where covariates for survey date (date i,j ), duration (min, dur i,j ), elevation, and their interactions are included as well as random effects for site and survey.
The model with less hierarchical structure has no random effects for site ( ,i , p(site),i ) or survey ( p(survey),i,j) on abundance and detection, nor their associated hyperparameters (Kery & Schaub, 2012, ch. 13.5.1).