MC(MC)MC: exploring Monte Carlo integration within MCMC for mark–recapture models with individual covariates



  1. Estimating abundance from mark–recapture data is challenging when capture probabilities vary among individuals.
  2. Initial solutions to this problem were based on fitting conditional likelihoods and estimating abundance as a derived parameter. More recently, Bayesian methods using full likelihoods have been implemented via reversible jump Markov chain Monte Carlo sampling (RJMCMC) or data augmentation (DA). The latter approach is easily implemented in available software and has been applied to fit models that allow for heterogeneity in both open and closed populations. However, both RJMCMC and DA may be inefficient when modelling large populations.
  3. We describe an alternative approach using Monte Carlo (MC) integration to approximate the posterior density within a Markov chain Monte Carlo (MCMC) sampling scheme. We show how this Monte Carlo within MCMC (MCWM) approach may be used to fit a simple, closed population model including a single individual covariate and present results from a simulation study comparing RJMCMC, DA and MCWM. We found that MCWM can provide accurate inference about population size and can be more efficient than both RJMCMC and DA. The efficiency of MCWM can also be improved by using advanced MC methods like antithetic sampling.
  4. Finally, we apply MCWM to estimate the abundance of meadow voles (Microtus pennsylvanicus) at the Patuxent Wildlife Research Center in 1982 allowing for capture probabilities to vary as a function body mass.


Individual variation is a key driver of evolution and an important consideration in modelling the demographics of many populations. However, individual heterogeneity presents a challenge in the analysis of mark–recapture data – particularly when the goal is to estimate abundance. In practice, differences in the behaviour of individuals in a population may be modelled as functions of individual covariates or random effects. In either case, the likelihood function will include integrals to account for all possible values of the unobserved effects. These integrals may be difficult to compute if multiple covariates/random effects are included or if a single individual covariate/random effect changes over time, which makes evaluating the true likelihood for the entire population problematic.

Intractable likelihoods pose a general problem in statistics, and several solutions have been proposed within the Bayesian framework. We explore Monte Carlo integration within Markov chain Monte Carlo sampling (MCWM) to obtain inference from mark–recapture models with individual heterogeneity. While we focus on modelling, the effects of individual covariates, the same methods can be applied to models including random effects or a combination of the two.

One way to avoid the problem with intractable likelihoods is to estimate abundance with a conditional likelihood approach. Huggins (1989) and Alho (1990) presented methods for estimating the size of a closed population when the capture probability depends on an individual covariate. Likelihoods which condition on at least one capture are fit to the data from the marked individuals and used to estimate capture probability as a function of the covariate. Abundance is then estimated using a Horvitz–Thompson estimator. These methods were later extended to open population models by McDonald & Amstrup (2001). However, these models are restrictive and can only be used if the covariate is completely observed for the marked individuals (i.e. the covariate is constant or changes deterministically like age).

Alternatively, Bayesian inference via Markov chain Monte Carlo (MCMC) has been applied to fit models allowing for the effects of time-varying, individual covariates or other covariates that are only partially observed for the marked individuals. Dupuis (1995) applied Bayesian methods to model the effects of discrete covariates on survival of individuals in an open population (i.e. the multistate model). Following this, Pollock (2002) suggested that a Bayesian approach could be applied for the particular case of continuous, time-varying, individual covariates and noted that: ‘Bayesian methods automatically integrate out unobserved random variables using numerical integration or Markov chain Monte Carlo sampling methods’ (Pollock 2002, p. 97). Bonner & Schwarz (2006) applied Bayesian inference via MCMC to model the effects of time-dependent covariates on individual capture and survival probabilities in the Cormack–Jolly–Seber model. King et al. (2006) described a similar approach and provided methods of variable selection while Gimenez et al. (2006) incorporated semiparametric regression to allow for nonlinear effects of the covariate. Royle, Dorazio & Link (2007) and Royle (2009) later developed MCMC-based methods to make inference about the size of a closed population when capture probabilities vary among individuals. Their method is based on augmenting the observed data with a large number of zero capture histories representing a pool of individuals that may have been alive, but never captured and has become known as the data augmentation (DA) approach. This method is appealing because it provides a conceptually simple framework that can be applied to many models and is easily implemented in the BUGS language. More recently, Schofield & Barker (2011) and Royle & Dorazio (2012) have shown how the same methods may be applied to model open populations with individual heterogeneity. Alternatively, Bayesian inference regarding the size of an open or closed population with individual heterogeneity may be implemented with the reversible jump MCMC (RJMCMC) algorithm as described by King & Brooks (2008).

Our current work is motivated by our experiences applying DA and RJMCMC to a variety of mark–recapture data sets. Both DA and RJMCMC avoid the need for explicit integration by working with complete data likelihoods (CDL) in place of the observed data likelihood. These CDL are constructed by adding extra, unobserved random variables to the data that would simplify computation of the likelihood, if observed (Dempster, Laird & Rubin 1977; Gelman et al. 2003, section 7.2).

We have found that the chains constructed by these algorithms may be computationally inefficient1 in that they mix poorly and take a long time to generate a representative sample from the posterior distribution. This seems especially true when the models include time-dependent, individual covariates or other multidimensional covariates that make the likelihood difficult to evaluate numerically. All MCMC methods work by constructing a Markov chain that has the posterior distribution as its unique stationary distribution. Samples from the posterior distribution are generated by simulating sufficiently long realizations of the Markov chain, and these samples are used to estimate posterior summary statistics. The challenge with DA and RJMCMC is that a lot of time may be spent updating the extra variables added to the CDL when a small fraction of the population is captured and marked. Moreover, we have found that the chains can have high autocorrelation meaning that large samples are needed to estimate posterior summary statistics accurately.

We explore the use of MCWM as an alternative to these algorithms for fitting mark–recapture models with individual covariates. We focus on a simple, closed population model with one individual covariate as an example of the method and provide results of a simulation study comparing MCWM, DA and RJMCMC. We also apply our method to data on meadow voles (Microtus pennsylvanicus) collected at the Patuxent Wildlife Research Center in 1981 and 1982 (Nichols et al. 1992) and compare the results with DA and RJMCMC. Although this data was collected using a robust design, we only consider the information from the final primary period and model capture probability as a function of a vole's average observed body mass. Previous analysis of this data has shown a significant, positive relationship between capture probability and body mass (Schofield & Barker 2011), and abundance estimates that ignore this heterogeneity would be biased.

Materials and methods

We describe MCWM and compare it with the alternative RJMCMC and DA algorithms for the following simple model. Suppose that the population of interest is closed and that the capture probability for each individual is a linear function of a normally distributed covariate on the logit scale. Assuming no behavioural effects, time effects or losses on capture, the number of times the ith individual is captured on T occasions, math formula, can be modelled as:

display math

where N is abundance,

display math

Further, suppose that math formula, math formula and math formula so that the only unknown parameters are μ and N. Let n denote the number of individuals captured at least one time and let math formula and math formula represent the observed data. The observed data likelihood is:

display math

where ϕ(z) represents the standard normal density function and math formula is the probability that a randomly selected individual is never captured. That is:

display math(eqn 1)

To complete the Bayesian specification, we define prior distributions for the two unknown parameters. We assume independent priors for μ and N such that the posterior density satisfies:

display math

Specifically, we have selected a conjugate normal prior for μ, math formula with math formula fixed, and the Jeffrey's prior for N, math formula, as recommended by Link (2013).

The posterior density is not tractable even for this simple model, and so, it is necessary to sample from the posterior distribution to make inference about μ and N. Supposing that it was in fact possible to evaluate math formula directly, the likelihood in eqn 1 could be computed explicitly, and values from the posterior distribution could be generated by a standard MCMC implementation. The full conditional distribution of N would follow a negative binomial distribution so that values of N could be generated directly (a so-called Gibbs sampling step).2 The full conditional distribution of μ would not be tractable, but values of μ could be generated from a slightly more complicated Metropolis–Hastings step. This involves proposing a new value for μ from some distribution conditional on the current value, denoted by q(·|μ), and accepting or rejecting this proposal according to the Hastings ratio (see Gilks, Richardson & Spiegelhalter 1996, pp. 5–8). Explicitly, let math formula and math formula represent the values of μ and N generated on the tth iteration. The next values would be generated in two steps by:

  • 1.Updating math formula given math formula, math formula, and math formula via a MH step:
  1. Propose math formula
  2. Accept math formula and set math formula with probability math formula where:
    display math
    Otherwise, set math formula.
  • 2.Updating math formula given math formula, math formula, and math formula via a Gibb's sampling step:
    display math

Under general conditions on q(·|μ), the distribution of math formula would converge to the posterior distribution as t→∞. If t were big enough then math formula could be considered as approximate (in some cases, exact) draws from the posterior distribution and used to estimate posterior summary statistics (see Gilks, Richardson & Spiegelhalter 1996 for further details). Of course, this algorithm cannot be implemented because math formula cannot be computed.

Complete data likelihoods

Both RJMCMC and DA avoid the need to compute math formula directly by constructing posterior distributions from CDLs that do not include the integral in eqn 1. As mentioned above, these CDLs are formed by expanding the model to include additional, unobserved data that simplify the likelihood.

The CDL for RJMCMC is constructed by modelling the hypothetical data for all N individuals in the population. For the simple model, the additional random variables comprise the covariates for the N − n unobserved individuals denoted by math formula. The CDL for RJMCMC is:

display math

The posterior distribution is constructed by assigning priors to the parameters μ and N, exactly as above. Summary statistics including posterior means, standard deviations and credible intervals are then approximated by sampling values from the joint posterior distribution of μ, N and math formula.

The full conditional distribution of N for RJMCMC does not have a simple form and cannot be updated by Gibbs sampling. In fact, the update of N requires a reversible jump (RJ) step that is more complicated than the standard MH update because the dimension of math formula depends on N. In the RJ step, a new value for N is proposed as in an MH step, but a corresponding proposal for math formula must also be constructed by adding or deleting elements to obtain the correct number of covariates. The proposals for N and math formula are then accepted or rejected as a single unit. Further to this, the elements of math formula must be updated separately outside of the reversible jump step. The full conditionals for these values are not tractable, and these values must be updated through N − n separate MH steps (see Schofield & Barker 2011, for details).

As an alternative, the DA algorithm of Royle, Dorazio & Link (2007) constructs a CDL by modelling the hypothetical data for a fixed super-population of size M ≫ N. The additional data for our simple model comprises the covariates for the M − n unobserved individuals in the super-population, math formula, along with M − n binary variables indicating that unobserved individuals are part of the realized population, denoted by math formula. The CDL for DA is:

display math

Here, math formula is the probability that an individual in the super-population is part of the realized population. The posterior distribution is constructed by assigning prior distributions to μ and ψ. We assign μ a conjugate normal prior, as above, and approximate the Jeffrey's prior for N by setting ψ ∼ Beta (0·0001,1), as described by Link (2013). Samples are then drawn from the posterior distribution of μ, ψ and z with N treated as a derived quantity (math formula).

In comparison with the RJMCMC algorithm, all of the updates in the DA algorithm may implemented with Gibbs or MH steps. However, the variables math formula and math formula must be updated for each unobserved individual on each iteration. These two values may be updated separately or together in a block MH step, but in either case, the resulting chains may the complexity of DA depends on M. We have found that RJMCMC and DA may both take a long time to run and the resulting chains may have high autocorrelation when N is large and the distribution of the covariate is complex.

Monte Carlo within MCMC

In short, MCWM is a generalization of the MH updater that uses Monte Carlo (MC) integration to approximate both the numerator and denominator of the Hastings ratio when the exact posterior density cannot be computed. For the simple example, this allows us to implement an approximation to the two step MCMC algorithm presented at the start of this section that avoids computations that depend on N or M as in RJMCMC and DA. We first show how MCWM can be applied to update μ for the simple model and then show that our solution also addresses the problem of updating N.

Consider the MH step for updating μ described earlier. Given the current value, math formula, a proposal is generated from some distribution, math formula. This value is then accepted with probability math formula where:

display math

In MCWM, the Hastings ratio, math formula, is replaced by an approximation:

display math

where math formula and math formula represent MC estimates of the full conditional density of math formula and math formula, as described below (A brief introduction to MC integration is also provided in the Supporting information). Approximating the Hastings ratio in this way introduces extra variability into the mh algorithm, and the posterior distribution is no longer a stationary distribution of the chain. However, theorem 9 of Andrieu & Roberts (2009) shows that the stationary distribution of the chains generated by MCWM approximates the true posterior when the MC estimator is unbiased, and the size of the MC sample, denoted by K, is large. In essence, if the algorithm is run for enough iterations and the MC samples are large enough, then the MCWM updater will produce values that are approximately, but not exactly, distributed according to the full conditional, math formula.

The remaining challenge in implementing this algorithm is to develop an efficient MC estimator of math formula. The only term in math formula that cannot be computed directly is math formula, and so, it is sufficient to develop an MC estimator for this value alone. An unbiased estimator of Q(μ) can be obtained by generating K sets of N − n covariate values:

display math

and then setting:

display math

However, this requires generating N × K random variables so that the complexity of this estimator depends on N – exactly the problem we are trying to avoid. Instead, we propose a second MC estimator. Let math formula be a single random sample of size K and define:

display math

The posterior density can then be approximated by replacing math formula with:

display math

This produces a biased but consistent estimator of the posterior density, but we conjecture that it maintains the overall properties of MCWM described by Andrieu & Roberts (2009). We believe that samples produced by the mcwm algorithm using math formula as an estimator of math formula will still approximate draws from the true posterior distribution for large enough K, though this remains to be proved.

A further advantage of the second MC estimator is that it allows the Gibbs update of N to be performed without further computation. Recall that the update of N depends only on math formula – exactly the value estimated in our MCWM update of μ. If math formula is accepted, then we set math formula. Otherwise we set math formula Our full algorithm proceeds by:

  • 1.Updating math formula given math formula via MCWM:
  1. Propose math formula
  2. Compute MC estimates math formula and math formula, and the corresponding estimates math formula and math formula.
  3. Accept math formula and set math formula with probability math formula where:
    display math
    Otherwise, set math formula.
  • 2.Updating math formula given math formula via Gibb's sampling:
    display math


We propose two extensions of MCWM that seem to provide more efficient sampling for mark–recapture models. The first is to use related samples in computing the MC estimates of the posterior density in both the numerator and denominator of the Hastings ratio. Consider the simple model. The basic property of location-scale families can be used to generate math formula: if math formula, then math formula. In our implementation of the mcwm algorithm, we use a single sample of K independent standard normal random variates to estimate both math formula and math formula. Specifically, we generate math formula and define:

display math

The advantage is that the MC samples used in the numerator and denominator of math formula have the same quantiles with respect to their corresponding distributions. This ensures that extreme values do not occur in one of the MC samples alone and seems to improve mixing. The same procedure can also be applied using uniform random variates and the probability integral transformation if the distribution of math formula is not in a location-scale family.

The second modification we have tested is to use antithetic sampling in constructing the MC estimates. Instead of generating K distinct values from the normal distribution, we generate a random normal sample of size K/2 (assuming K is even), math formula and then set math formula, k = 1,…,K/2. This induces negative correlation within the MC sample and reduces the variance of the MC estimator if the integrand is a monotone function of x (Givens & Hoeting 2012, pp. 187–188). This is true for the simple model above and for the model in 'Application' Section that treats math formula and math formula as unknown. Similar methods can also be applied for non-normal covariates and in higher dimensions. We refer to the mcwm algorithm combined with antithetic sampling as MCWM/AS.

Simulation study

To demonstrate the properties of MCWM, we describe results from a small simulation study based on the simple model presented in 'Monte Carlo within MCMC' Section. We assumed a population of N = 1000 individuals and T = 5 capture occasions. We generated 100 data sets each for two different values of μ. In the first scenario, we set μ = −1 such that math formula and math formula. In the second scenario, we set μ = −3 such that math formula and math formula.

Samples from the posterior distribution conditional on each simulated data set were generated via RJMCMC, DA, MCWM and MCWM/AS. We also compared the effects of varying the size of the super-population for DA and the size of the MC sample for MCWM. We first ran RJMCMC for each data set and then applied DA with M equal to r times the largest value of N sampled during the RJMCMC algorithm for r = 1,2 and 4. Finally, we applied both MCWM and MCWM/AS with MC sample sizes of K = 100, 500 and 1000.

Each algorithm depends on choices regarding the updaters of μ, N and the augmented data (if applicable). We tried to implement the algorithms as would a relatively experienced user of MCMC. We applied Gibbs sampling steps when possible and otherwise used MH steps with standard proposal densities optimized through an adapting phase. Complete details of the different algorithms are provided in Table 1. All chains were started from the true parameter values to avoid effects of the initial values and were run for a total of 55 000 iterations with the first 5000 removed as burn-in. All code was written in R and vector calculations were used when possible. R packages containing this code are available as Data S1 and S2 in the Supporting Information, and from the first author upon request.

Table 1. Implementations choices for the variants of the Markov chain Monte Carlo (MCMC) algorithms
ParameterUpdate Method
  1. The three sections of the table describe the updates for each parameter in reversible jump MCMC (RJMCMC; top), data augmentation (DA; middle) and Monte Carlo within MCMC (MCWM; bottom). The implementation of MCWM/AS was the same as MCWM except that antithetic sampling was used to estimate the posterior density in the MCWM update of μ.

μGibbs step: math formula, where math formula
NRJ step with proposal: math formula
math formulaMH step with proposal math formula
μGibbs step: math formula
ψGibbs step: math formula
math formulaMH step with proposal math formula
math formulaGibbs step: math formula
μMCWM step with proposal math formula.
NApproximate Gibbs step: math formula

For each of the two scenarios, we compared the efficiency of the different samplers and the accuracy of the estimated posterior summary statistics. Accuracy of the samplers was assessed by comparing the location and spread of the sampled values of N. Specifically, we compared the bias and mean-squared-error (MSE) of the posterior mean of N:

display math

where math formula represents the posterior mean estimated from the sth simulation and the estimated posterior standard deviation of N. Efficiency of the samplers was assessed by comparing the effective number of samples for N generated per second (the effective sample size of N divided by the runtime of the chain). Simply comparing the runtime for the different algorithms is inappropriate because the samples are not independent. A chain that runs quickly but has high autocorrelation may be less efficient than a slower chain that mixes better. The effective sample size of an MCMC sample is the number of independent draws that would be needed to provide the same information about the posterior distribution. This value is estimated by fitting an autoregressive time series model to the sampled chain and then computing the integrated autocorrelation as described by Liu 2008, pp. 125–126) and implemented in the coda package in R (Plummer et al. 2006). Results are presented in Figs 1 and 2. Complete numerical results are also provided in the Supporting information (Table S1).

Figure 1.

Simulation results 1 – posterior summaries. Distributions of the error in the posterior mean (top) and the posterior standard deviations (bottom) of N for Scenario 1 (blue symbols) and Scenario 2 (red symbols) for the different Markov chain Monte Carlo (MCMC) implementations. Points in each plot represent the mean value over all 100 simulated data sets. These values are also provided numerically. Error bars connect the largest and smallest values over the 100 simulated data sets.

Figure 2.

Simulation results 2 – efficiency. Comparisons of the runtime in minutes (top) and log efficiency for sampling N (effective sample size/second) of the different Markov chain Monte Carlo (MCMC) implementations for Scenario 1 (blue symbols) and Scenario 2 (red symbols). The points represent the mean runtime/efficiency over the 100 replicate data sets. These values are also provided numerically. The error bars extend to the limits of the runtime/efficiency observed over the 100 simulated data sets.

Posterior summary statistics produced via RJMCMC and all variants of DA were almost identical for all of the 100 data sets in Scenario 1 (μ = −1). The bias of the posterior means for RJMCMC and DA ranged from −0·3 to 0·2, and the MSE ranged from 63·4 to 65·2. MCWM and MCWM/AS also produced good estimates of the posterior means. The bias of these implementations was slightly higher with smaller values of K, but with K = 1000 the bias was <0·4 and the MSE was 63·4. However, MCWM tended to overestimate the posterior variance. Mean posterior standard deviations from RJMCMC and DA ranged between 22·3 and 22·6, and MCWM overestimated the posterior standard deviation by approximately 1·7 times when K = 100 and 1·1 times when K = 1000. However, the problem was almost completely resolved by the use of antithetic sampling. MCWM/AS overestimated the posterior standard deviation by approximately 1·1 times when K = 100 and almost not at all when K = 1000.

The clear advantage of both MCWM and MCWM/AS was the gain in efficiency. The runtimes for the different variants of MCWM and MCWM/AS were similar to the runtimes for RJMCMC and DA with r = 1, but the chains mixed much more quickly. Even with K = 1000, MCWM and MCWM/AS were approximately 3·5 times as efficient as the most efficient da algorithm and more than 100 times as efficient as the RJMCMC algorithm. Antithetic sampling had little effect on these results. On average, MCWM/AS did run slightly faster than MCWM, but the small difference was offset by the change in effective sample size.

Results for Scenario 2 (μ = −3) were qualitatively similar. The posterior summary statistics produced by RJMCMC and all variants of DA were close. Posterior means from these methods were biased by approximately 0·5% due to the influence of the selected prior for N that favours smaller values. Once again, MCWM overestimated the posterior mean of N when K = 100, and both MCWM and MCWM/AS also overestimated the posterior standard deviation for all values of K. However, the error was <2% on average for MCWM/AS with K = 1000. With K = 500, MCWM/AS continued to produce good estimates of the posterior mean and overestimated the standard deviation by only 4% on average.

Mean runtimes for DA and RJMCMC in Scenario 2 were between 1·2 and 1·7 times the mean runtimes in Scenario 1. In comparison, the mean runtimes of MCWM and MCWM/AS decreased slightly because the speeds of DA and RJMCMC depend on the upper bound on N, which increased from Scenario 1 to Scenario 2, while the speeds of MCWM and MCWM/AS depend on n, which decreased. Effective sample sizes for all algorithms decreased in Scenario 2, but MCWM and MCWM/AS were still more efficient than RJMCMC and all variants of the da algorithm. With K = 1000, MCWM/AS was 22·0 times as efficient as RJMCMC and 13·0 times as efficient as the best version of DA. As before, reducing K to 500 affected the accuracy of the posterior summary statistics slightly but increased the efficiency even further so that MCWM/AS was 29·6 times as efficient as RJMCMC and 17·5 times as efficient as DA.

In summary, MCWM/AS with large values of K (500 or 1000) performed well in both scenarios. Posterior summary statistics were almost equal to those produced by DA and RJMCMC, but MCWM/AS was much more efficient. Decreasing K reduced the accuracy of the estimated posterior summary statistics, in particular the posterior standard deviation, but led to a further increase in efficiency. It was surprising that RJMCMC had such low efficiency, and we discuss this result further in 'Discussion' Section.


As an example of these methods, we analysed data taken from a study of meadow voles (M. pennsylvanicus) conducted at the Patuxent Wildlife Research Center in 1981 and 1982 (Nichols et al. 1992). The experiment followed a robust design with six primary periods each comprising five capture occasions. We focus on the final primary period and assume that the population was closed over this time. The data from this period contain records of 77 voles of which 23 (30%) were captured once and 54 (70%) twice or more. The average number of captures per marked vole was 2·7. We consider the average observed body mass for each vole as a static individual covariate and ignore issues with censoring and rounding discussed by Schofield & Barker (2011).

The model we fit to this data is the same as the model described in 'Monte Carlo within MCMC' Section, except that we treat all parameters as unknown. This includes abundance, N, the coefficients of the logistic model for math formula, math formula and math formula, and the parameters of the normal distribution for math formula, μ and math formula. Once again, we specify a conjugate normal prior for μ and the improper Jeffrey's prior for N. For the remaining parameters, we selected the half t prior with three degrees of freedom for σ and independent t priors with three degrees of freedom for both math formula and math formula. These represent weakly informative priors with most mass near 0 but also with heavy tails.

In this model, the probability that an individual is never captured is a function of math formula and math formula. This requires that MC integration be used to estimate the posterior density in the update steps for each of these parameters. In our implementation, we update math formula as a single unit, and so our algorithm requires three separate MCWM steps per iteration of the MCMC algorithm along with the Gibbs update of N.

As in the simulation study, we compared (i) samples generated via MCWM and MCWM/AS with varying values of K, (ii) samples from DA with varying values of r and (iii) samples from RJMCMC. We again implemented all algorithms using standard updating procedures: Gibb's sampling where possible and MH updates with standard proposals otherwise. The algorithms were again implemented in R, and chains were run for a total of 500 000 iterations with a burn-in period of 50 000 iterations. All code is available from the first author. Plots of the results are provided in the top half of Figs 3 and 4. Numeric summaries are provided in the Supporting information (Table S2).

Posterior summary statistics from all implementations were almost exactly identical. Even with K = 25, MCWM and MCWM/AS provided very accurate results. runtimes for the different implementations were also similar, except that MCWM and MCWM/AS both took significantly longer when K was large (K = 1000). Once again, the RJMCMC implementation mixed slowly and had much lower efficiency than the other algorithms. However, MCWM and MCWM/AS provided no advantage over DA. The best DA implementation (r = 2) was in fact 1·1 times more efficient that the best MCWM implementation (MCWM/AS with K = 100).

The MCWM approach is intended to address computational problems that arise with DA and RJMCMC when the proportion of individuals captured is small (n much less than N), and so, we have repeated the analysis with a modified version of the meadow vole data constructed by artificially decreasing the capture probability for each marked individual. Specifically, we generated new data by (i) replicating the capture histories for each of the 77 marked voles 5 times, (ii) subsampling the captures in the resulting histories with probability 0·2 and (iii) removing histories with no remaining captures. The resulting data contained 159 histories with 122 (77%) individuals being captured once and only 39 (23%) twice or more. The average number of captures per marked individual was 1·3. Plots of the results are provided in the bottom half of Figs 3 and 4. Numeric summaries are provided in the Supporting information (Table S2).

Figure 3.

Application results 1 – posterior summaries. Comparison of the posterior distribution for the original meadow vole data (blue symbols) and the modified data (red symbols) for the different Markov chain Monte Carlo (MCMC) implementations. The estimated posterior mean for each implementation is represented by the point with 95% credible interval represented by the error bar. Values above the error bars indicate the estimated posterior standard deviation.

Figure 4.

Application results 2 – efficiency. Comparison of the runtime (top) and efficiency (bottom) of the different Markov chain Monte Carlo (MCMC) implementations in the analysis of the original data (blue symbols) and modified data (red symbols). The top plot compares the time taken in minutes. The bottom plot compares the efficiency for sampling N (effective sample size/second).

In this case, posterior means obtained from MCWM were comparable with the other methods, but the posterior standard deviation was overestimated when K was small. This was corrected completely by MCWM/AS, and estimated posterior summary statistics obtained from MCWM/AS were indistinguishable from the other methods.

Once again, the advantage of MCWM is clear. Whereas the runtime of RJMCMC increased 1·4 times and the runtime of DA increased between 1·8 and 3·0 times depending on r, the runtime of both MCWM and MCWM/AS increased by <1·1 time for all values of K. As a result, MCWM and MCWM/AS with K = 100 were both approximately 2·5 times as efficient as RJMCMC and the fastest implementation of DA. Note that the efficiency of all of the algorithms, including MCWM and MCWM/AS, decreased significantly with the modified data. This simply reflects the fact that the autocorrelation of the Markov chains is higher when n is small.


The examples presented in Sections 'Simulation study' and 'Application' provide an initial assessment of MCWM for fitting mark–recapture models with heterogeneity. As expected, mcwm performed nearly as well as the other algorithms when most individuals were marked and was more efficient when the proportion of marked individuals was small. Not only was the runtime for MCWM smaller in these situations because the computational complexity depends on observed sample size, rather than the size of the population or super-population, but the chains produced by MCWM also mixed more quickly. The disadvantage is that MCWM samples from an approximation to the posterior distribution, and the accuracy of the posterior summary statistics depends on the MCMC sample size (K). Posterior summary statistics will be biased if K is too small, but the algorithm will take a long time to run and sampling will be inefficient if K is too large. Selecting an appropriate value for K remains as an important question.

Although the examples presented involved scalar covariates, we intend these methods for modelling more complex data with high-dimensional covariates. When capture probabilities depend on a scalar covariate, the probability that an individual from the population is never captured, math formula, could be computed with numerical quadrature (see Appendix S1 in the Supporting information). Choquet & Gimenez (2010) and Gimenez & Choquet (2010) have used this approach to evaluate the likelihood for mark–recapture models with scalar individual random effects. However, quadrature methods with regular grids can be inefficient for computing integrals in high-dimensions essentially because the integrand may be close to zero at many of the grid points. In these cases, MC integration can be more efficient if the sampling distribution concentrates on the regions of the sample space where the integrand is nonzero Liu 2008, p. 32). In future, we will apply MCWM to fit both closed and open population models with high-dimensional integrals, focusing primarily on data with time-varying, individual covariates as in Bonner & Schwarz (2006).

We believe that the methods presented will be most useful for modelling data from large populations in which the overall capture probability is low. Fitting these models with DA will require large super-populations and might lead to long runtimes. In these cases, MCWM may provide accurate inference in much shorter times allowing users to explore a range of models more easily. We also believe that MCWM could provide an alternative to DA and CDL methods used to model other complex ecological data [e.g. spatially explicit mark–recapture models (Royle, Young & Young 2008) or distance sampling models including individual covariates (Royle, Dawson & Bates 2004)].

We will also investigate further modifications that might improve the accuracy or efficiency of MCWM. Using antithetic sampling within the MCWM steps improved the accuracy of posterior summary statistics significantly, and further gains may be made by incorporating more advanced MC methods. For example, importance sampling could be used to estimate the probability that an individual is never captured. Defining an appropriate important sampling distribution a priori will be difficult, but this could be chosen through an adaptive scheme. We also plan to explore two related algorithms that make use of MC integration within MCMC: the Grouped Independence Metropolis–Hastings (GIMH) algorithm (Beaumont 2003; Andrieu & Roberts 2009) and the Monte Carlo Metropolis–Hastings (MCMH) algorithm (Liang, Liu & Carroll 2010). Incredibly, both algorithms produce Markov chains that converge to the exact posterior distribution when the MC estimator of the posterior density (GIMH) or MH acceptance ratio (MCMH) is unbiased Andrieu & Roberts (2009). Unfortunately, the only unbiased estimator of the posterior density we have found requires K samples of size N − n that reintroduces the dependence on N (see Section 'Monte Carlo within MCMC'). Further study is needed to understand the properties of these algorithms if a biased but consistent estimator is used instead.

Finally, the simulation study raised new questions about the RJMCMC and DA algorithms. In particular, the efficiency of both algorithms improved in some cases when the amount of data augmentation increased. Consider the da algorithm. Conventional wisdom has suggested that M be as small as possible (though it must be big enough not to restrict the posterior distribution of N). To avoid penalizing the da algorithm by selecting an arbitrary value we originally set M equal to the largest value of N generated by the RJMCMC algorithm (r = 1). We later found that the efficiency of DA could be increased with a larger value of M as in simulation Scenario 1 with r = 2. Similarly, we were surprised to find that the efficiency of the RJMCMC algorithm was higher in Scenario 2, when only 25% of the population was captured, than in Scenario 1, when 75% of the population was captured.

The increase in the efficiency of DA seems to occur because the chains mix better, and the effective sample size is larger when M is bigger. We believe that this occurs because a larger super-population can generate more populations of size N. This allows the composition of the population to change more freely when unobserved individuals are drawn from the super-population on each MCMC iteration, and in turn allows for larger changes in the model parameters. Although the runtime increases when M increases, the computational cost of the vector calculations used to update z and math formula in our implementation of DA increases slowly with M and may be offset if the mixing improves sufficiently. Further research is needed to determine if the same result occurs with other software and if there is an optimal value for the size of the super-population.

The problem is harder to address for RJMCMC because the amount of augmentation is not pre-determined. More efficient variants of RJMCMC might be implemented using different proposal distributions for N. Our proposal distribution is based on King & Brooks (2008), except that we adapted the width of the uniform distribution to produce an acceptance rate near 50%. In some cases, the proposal distribution was very limited and N could not change by more than 1 or 2 on each iteration. Skewed distributions that allow for occasional large jumps might improve the efficiency, and further research is needed to identify optimal proposal distributions.


We thank Dr. Jim Nichols for allowing us to use the Patuxent meadow vole data in our example. SJB and MRS were partially funded by the NSF KY-EPSCOR fund (NSF Grant No. 0814194). MRS was also partially funded by NSF Grant No. 0934516.


  1. 1

    We use efficiency to refer to computational efficiency of the different sampling algorithms not statistical efficiency. One algorithm is more efficient than another if it requires less time to provide the same amount of information about the posterior distribution.

  2. 2

    The negative binomial may be parametrised as a distribution on either the number of trials or number of failures until a specified number of successes occurs. We consider the distribution of the number of trials until n successes are reached so that N ≥ n.