Estimation of capture probabilities using generalized estimating equations and mixed effects approaches


  • Md. Abdus Salam Akanda,

    Corresponding author
    1. Department of Mathematics, Research Center in Mathematics and Applications, University of Évora, Évora, Portugal
    2. Department of Statistics, Biostatistics & Informatics, University of Dhaka, Dhaka, Bangladesh
    • Correspondence

      Md. Abdus Salam Akanda, Department of Mathematics, Research Center in Mathematics and Applications, University of Évora, 7000-671 Évora, Portugal.

      Tel: +351965155536; Fax: +351266745393;


    Search for more papers by this author
  • Russell Alpizar-Jara

    1. Department of Mathematics, Research Center in Mathematics and Applications, University of Évora, Évora, Portugal
    Search for more papers by this author


Modeling individual heterogeneity in capture probabilities has been one of the most challenging tasks in capture–recapture studies. Heterogeneity in capture probabilities can be modeled as a function of individual covariates, but correlation structure among capture occasions should be taking into account. A proposed generalized estimating equations (GEE) and generalized linear mixed modeling (GLMM) approaches can be used to estimate capture probabilities and population size for capture–recapture closed population models. An example is used for an illustrative application and for comparison with currently used methodology. A simulation study is also conducted to show the performance of the estimation procedures. Our simulation results show that the proposed quasi-likelihood based on GEE approach provides lower SE than partial likelihood based on either generalized linear models (GLM) or GLMM approaches for estimating population size in a closed capture–recapture experiment. Estimator performance is good if a large proportion of individuals are captured. For cases where only a small proportion of individuals are captured, the estimates become unstable, but the GEE approach outperforms the other methods.


Many estimation methods have been developed for the analysis of closed population capture–recapture data. For comprehensive material on the subject see, for instance, Otis et al. (1978), Seber (2002), Williams et al. (2002) and Amstrup et al. (2005). The most general capture–recapture closed population model, considered by Otis et al. (1978) was denoted by Mtbh where (h) is used to denote inherent individual heterogeneity, (t) time effect, and (b) behavioral response to capture. In this work, we are interested in estimating the population size and SE of a submodel of the type Mh, where individual heterogeneity can be modeled as a function of covariates. Development of capture–recapture models dealing with individual heterogeneity in capture probabilities has been one of the most challenging tasks. Failure to account for such heterogeneity has long been known to cause substantial bias in population estimates (Otis et al. 1978; Lee and Chao 1994; Hwang and Huggins 2005). Moreover, Link (2003) showed that without strong assumptions on the underlying distribution, estimates of population size under model Mh are fundamentally nonidentifiable.

The use of covariates (or auxiliary variables), if available, has been proposed as an alternative way to partially cope with the problem of heterogeneous capture probabilities (Pollock et al. 1984; Huggins 1989; Alho 1990). The idea is to model capture probabilities as a function of individual (i.e., age, sex, and weight) and environmental (i.e., temperature, rainfall, and location) covariates, using a generalized linear modeling (GLM) approach, such as logistic regression. The method of Huggins (1989, 1991), based on a conditional likelihood to estimate population size, has become very popular, but it assumes independence among capture occasions (Huggins and Hwang 2011).

In the analysis of capture–recapture data, Hwang and Huggins (2005) and Zhang (2012) examined the effect of heterogeneity on the estimation of population size by solving estimating equations, but these authors also assumed independence of capture occasions. Capture–recapture data are collected on the same individuals across successive capture occasions. One may view capture–recapture data as binary longitudinal or repeated measurements data (Huggins and Yip 2001). These repeated observations are often correlated over time. This dependency or correlation structure may be induced by incorporating individual heterogeneity. Failure to account for this dependency may provide biased estimates. Hwang and Huggins (2007) also state that the assumption of independence among capture occasions is often violated in practice, but the authors still rely on the assumption. Some dependencies among capture occasions can be dealt with through the modeling of behaviorally effects, such as trap happy and trap shy effects, which are treated as special cases in the capture–recapture literature (Yang and Chao 2005; Pradel and Sanz-Aguilar 2012). One alternative approach is to use a generalized estimating equations (GEE) to account for a working correlation structure among capture occasions (Liang and Zeger 1986) and use observed individual characteristics to model heterogeneity in capture probabilities. A mixed effects modeling approach may also be used to model heterogeneity of individual observed and unobserved characteristics in capture–recapture experiments motivating the use of generalized linear mixed models (GLMM) (Pinheiro and Bates 2000). Some authors have previously introduced the use of GLMM (logit models with normal random effects) (e.g., Coull and Agresti 1999; Stoklosa et al. 2011). An advantage of using GLMM for the estimation of capture probabilities is to accommodate not only the heterogeneity attributed to individual characteristics, but also the heterogeneity that cannot be explained by the observed individual characteristics.

Bayesian methods have also become popular in capture–recapture studies. An extensive Bayesian literature on capture–recapture closed population models includes Castledine (1981), Smith (1991), George and Robert (1992), Madigan and York (1997), Basu and Ebrahimi (2001), Ghosh and Norris (2005), King and Brooks (2008), and Gosky and Ghosh (2009, 2011). Bayesian statistical modeling requires the development of the likelihood function of the observed data, given a set of parameters, as well as the joint prior distribution of all model parameters. Bayesian methods allow for estimation of the unobserved random effects as well, but the performance of their estimates often depends on the chosen prior distributions. Often, the method of selecting prior distributions is subjective (Lee et al. 2003). A possible advantage of GEE over random-effects models and Bayesian methods relates to the ability of GEE to allow specific correlation structures to be assumed between capture occasions.

Here, we propose a GEE approach for estimating capture probabilities and population size in capture–recapture closed population studies. We also compare the results of population size estimates and their SE, when using the two estimation methodologies (i.e., GEE and GLMM). For illustrative purposes, we analyze a real data set that has already been discussed in the literature. Conditional arguments are used to obtain a Horvitz–Thompson-like estimator for estimating population size. A simulation study is also conducted to compare the performance of the estimation procedures. In the next section, we describe the notation and models that are used to estimate capture probabilities and population size.

Notation and Models

Consider a population consisting of N animals in a capture–recapture experiment over m capture occasions, j = 1,2,…,m. Let Yij be a binary outcome, equaling 1 if the ith animal is being caught on the jth capture occasion and 0 otherwise. Let Yi = (Yi1,Yi2,…,Yim) be a random vector with the capture history of individual i. Let inline image be the number of times the ith animal has been caught in the course of the trapping closed population study. Let ti be the time the ith individual is first captured. Heterogeneity in captured probabilities is often explained by observed individual covariate xi, such as age, sex, weight. For simplicity, we consider xi a single covariate, but the model can be easily generalized for xi to be considered a vector of covariates. Let the probability that the ith animal is captured on any trapping occasion j, be

display math(1)


display math

is the design matrix, β = (β0,β1) is the vector of parameters associated with the covariates, and h(u) = (1+exp(−u))−1 is the logistic function. This is an Mh model where variation in capture probabilities among individuals is explained by the covariate xi. The probability of not capturing the ith individual on the jth occasion is (1−pi(β)), and the variance of Yij is pi(β)(1−pi(β)) (Liang and Zeger 1986). Then, TiBin(m,pi(β)) and πi(β) = 1−(1−pi(β))m is the probability of individual i being captured at least once, given the covariate xi. Let the set of distinct individuals captured at least in one occasion be indexed by i = 1,2,…,n and uncaptured individuals would be indexed by i = n + 1,…,N without loss of generality. To estimate the population size, once an estimate of β is obtained (inline image), the Horvitz–Thompson estimator inline image may be used as in Huggins (1989).

Generalized estimating equations approach

Let inline image be the covariance matrix of Yi, where, Ai = diag[Var(Yi1),Var(Yi2),…,Var(Yim)] is a m×m diagonal matrix and Ri(α) is known as the working correlation structure among Yi1,Yi2,…,Yim to describe the average dependency of individuals being captured from occasion to occasion. A GEE approach permits several types of working correlation structure Ri(α) (for details, see Diggle et al. 1994). For the description that follows, and for simplicity, we consider an independence working correlation structure, Ri(α) = I where I is an identity matrix. The covariate xi is never known for the individuals that have not been captured. Therefore, Yij is conditional on the captured individuals (n) (i.e., Ti ≥ 1) with the corresponding observed individual covariates similar to Huggins (1989) and Zhang (2012). The probability that the ith individual is captured on the jth occasion (pij) given that the ith individual is observed at least once is, inline image. Let inline image, and Di be the matrix of derivatives ∂μi/∂β, where μi = (μi1,μi2,…,μim), hence Di = AiXi. The variance vij of Yij given T≥ 1 is inline image. Considering, Vi = diag(vij), an estimator of β can be obtained by solving the following generalized estimating equations:

display math(2)

If covariate xi (i = 1,2,…,n) is available for captured individuals, then the model becomes pi(β) = h(Xiβ). This model is not equivalent to any of those discussed in Otis et al. (1978), rather this model is a restricted version of their model Mh (Huggins 1991). If pi(β) = h(Xiβ), then following Zhang (2012), estimating equations (2) can be simplified to

display math(3)

For a given inline image, then inline image and an estimate of the variance of inline image is given by inline image where inline image represents an estimate of the conditional information matrix for β and inline image is the vector inline image. If the individual capture probability does not depend on time, previous capture history, or any covariate, then the model (1) simplifies to pi(β) = h(β0) = p0, which is a reparameterization of model M0 of Otis et al. (1978) (see Huggins 1991; Hwang and Huggins 2005). This model assumes all the individuals have equal capture probabilities. Then, the estimating equations for β0 is simplified to

display math(4)

Let inline image be the resulting estimator of β0 then inline image where inline image.

Methods based on a partial likelihood

The full likelihood of all model parameters is proportional to

display math(5)

As the number of total individuals, N, is unknown and the covariates are not known for individuals that are never captured, this likelihood cannot be directly evaluated. The conditional likelihood (Huggins 1989) is the first product component, and it can be formulated as a GLM (Huggins and Hwang 2011) for the positive Binomial distribution (Patil 1962). It may be rewritten as

display math(6)

When the full likelihood is partitioned into a product of conditional densities, then a partial likelihood (Cox 1975) may arise considering some of the product terms, but it involves only the parameters of interest, isolating the nuisance parameters. Therefore, the partial likelihood, PL(β), is the first product of the equation (6), which is the likelihood of the number of recaptures after the first capture (Stoklosa et al. 2011). For a given ti, (Ti − 1)|tiBin(mti,pi(β)), which is used to estimate the parameters β.

To utilize a simple GLMM with a random effect, we suppose that pi(β) = h(Xiβ + σbzi) where zi is a realization of the standard normal random variable inline image, with σb>0. The use of random effects reflects the belief that there is heterogeneity that cannot be explained by covariates. The partial likelihood can be considered as the joint distribution of the response and the random effects. To estimate β and σb, the marginal likelihood of the response is obtained by integrating out the random effects. The integration can be approximated by penalized quasi-likelihood (Breslow and Clayton 1993), which enables parameter estimation via an iterative procedure.

The variance of inline image for a smoothing parameter λ may be estimated according to Stoklosa et al. (2011) using the following formula, inline image, where η(β) is a vector with ηi(β) = πi(β)−2mpi(β){1−πi(β)}, and all quantities are evaluated at inline image. The smoothing parameter λ, which is part of the quasi-likelihood procedure, controls the degree of roughness of the estimated functions. To obtain an optimal value for λ, we used generalized cross-validation (GCV) technique (Wood 2006).


We applied the techniques discussed in the previous Section to a data set of least chipmunks (Eutamias minimus) made available by V. Reid (1975). The data set has been previously analyzed and discussed by Otis et al. (1978) and Wang et al. (2007). V. Reid laid out a 9 × 11 livetrapping grid with traps spaced 50 feet (15.2 m) apart. The study was conducted in an area dominated by sagebrush and snowberry in Colorado, USA. The numbers of animals caught for six occasions (n1 to n6) were 7, 15, 16, 24, 19, 7, and ∑nk = 88. Of these 88 captures, n = 45 distinct animals were captured, and the covariate sex (male or female) was collected for each captured individual; there were 22 males and 23 females. The recorded capture frequencies (f1 to f6) were 21, 12, 7, 3, 2, 0. The average capture frequencies for male and female were 1.86 and 2.04, respectively. Our estimation results are summarized in Table 1. The inclusion of the covariate sex does not improve our estimates of population size which are very similar, except when the random effect is considered in the GLMM, which is based on partial likelihood estimation. This may indicated that there is unmodeled individual heterogeneity in capture probabilities that is not being accounted for with the other models (GLM and GEE). The population estimate, in this case, is approximately 74 individuals with a SE of 12. Both values are quite high when compared to the values obtained with the other estimation strategies. Although, GLMM accounts for heterogeneity due to unobserved individual characteristics, it may also be overestimating population size at the expenses of greater loss in precision, possibly due to the increase in the number of model parameters that are estimated. In contrast, quasi-likelihood GEE methodology provided lower SE, when compared to results from the Bayesian approach of Wang et al. (2007) for the same data set. The latter authors estimated population size of 50 with a SE of 3.14. The GEE estimation results also agree with Otis et al. (1978), but our model jointly takes into account heterogeneity in capture probabilities and correlation among capture occasions.

Table 1. Comparison of parameter estimates (SE in parenthesis) for least chipmunk data after fitting models with and without a covariate (sex).
Model no.logit{pi(β)} inline image
  1. A realization of the standard normal random variable inline image is zi. Numbers in this table are rounded to two decimal places; therefore, 0.00 does not mean zero. QL, quasi-likelihood; PL, partial likelihood; GLM, generalized linear models; GEE, generalized estimating equations; GLMM, generalized linear mixed models.

Intercept-only models
1. PL GLM−0.82 (0.18)50.72 (3.33)
2. QL GEE−0.73 (0.13)49.66 (2.27)
3. PL GLMM−0.85 (0.26) + 0.00 zi (0.73)50.73 (3.35)
Linear covariate models
4. PL GLM−0.81 (0.25) − 0.03 sex (0.37)50.73 (3.35)
5. QL GEE−0.84 (0.18) − 0.21 sex (0.26)52.40 (2.94)
6. PL GLMM−0.83 (0.34) − 0.14 sex (0.49) + 1.59 zi (0.00)74.16 (12.06)

Simulation Study

A simulation study was conducted in order to evaluate the performance of the estimators. The effect of heterogeneity among observed individuals was modeled using two covariates, sex (male = 1 and female = 0), and weight. Two levels of population sizes N = 100 and 500 and two levels of capture occasions m = 6 and 10 were considered. For each individual, we assigned sex with probability 0.5 from a Bernoulli distribution and weight from a normal distribution with mean 15 and variance 4. These values are based on the previous data analysis. Individual capture probabilities were modeled with a logistic regression, so that

display math(7)

where β0 is the constant term, β1 and β2 represent the sex and weight effects, respectively. A positive β1 implies that the sex taking value 1 is more catchable, and a positive β2 means that the catchability increases with weight. We considered three different simulation scenarios for capture probabilities: (a) high capture probabilities (β0 = −3.5); (b) medium capture probabilities (β0 = −4.0); (c) low capture probabilities (β0 = −4.5); and their averaged are presented in Table 2. In addition, a Gaussian random effect with mean 0 and σb = 0.1 was included as an unobserved covariate to ensure the existence of heterogeneity due to unobserved individual characteristics. For each simulation scenario, GLM, GEE, and GLMM approaches were used for data analyses and to assess estimators performances. The simulation study was carried out with 1000 Monte Carlo replicates.

Table 2. Simulated capture probability scenarios for the capture probability model, logit(pi) = β0+β1 × sex + β2 × weight. inline image represents average capture probability when weight = 15 and πi represents the average probability of an individual being captured at least once during the study.
Simulation scenariosEffects of covariates inline image π i
m = 6m = 10
β 0 β 1 β 2 MaleFemaleMaleFemaleMaleFemale
(a) High−
(b) Medium−
(c) Low−

To evaluate estimators' performance, we present the SE, the relative bias (PRB), the root mean square error (RMSE), the coefficient of variation (CV), and confidence interval coverage (%) (COV) for the estimates of population size. The simulation results for six capture occasions are given in Table 3. We noticed that all estimation procedures for scenario (a) perform well. There was little bias, low SE, low coefficient of variation for inline image. In this scenario, confidence interval coverage for all estimators is very good (93–96%), considering a nominal level of 95%. As in our example, the exception is the GLMM that tends to overestimate population size. Overestimation is particularly severe when capture probabilities are low, see for instance, results of scenarios (b) and (c). Confidence interval coverage for GLMM is also poor (77–90%) in these scenarios. For all scenarios, the GEE approach performs well when estimating population size. This approach also consistently provides lower SE and lower RMSE when compared to GLM and GLMM estimators, although the differences are minimal for GEE-GLM comparisons. Therefore, our simulation results indicate that the general performance of estimators obtained from GEE is better than GLM and GLMM. The GEE approach may overcome the effect of random effects due to its ability accounting for the correlation structure among capture occasions. The simulation results for 10 capture occasions are presented in Table 4. The performance of estimators for 10 capture occasions is better than for six capture occasions yielding lower CV, absolute value of PRB, RMSE, but higher COV. This is generally true because the average capture probability is higher for 10 capture occasions than for six capture occasions. We also conducted simulations for two other levels of N (50 and 200) when m = 6 and 10. These results are similar to the ones presented here.

Table 3. Simulation results (1000 repetitions) considering m = 6 trapping occasions.
  N inline image AVE(inline image)SE(inline image)PRBCVRMSECOV
  1. Averages of the numbers of captured individuals, (inline image); the estimates of population size, AVE(inline image); SE of the estimated population size, SE(inline image); percentage relative bias, inline image, where inline image is estimated by inline image; root mean square error, inline image; percentage coefficient of variation, inline image and confidence interval coverage (%), COV. QL, quasi-likelihood; PL, partial likelihood; GLM, generalized linear models; GEE, generalized estimating equations; GLMM, generalized linear mixed models.

(a) High
PL GLM10092100.633.770.633.753.8294.5
QL GEE10092100.662.900.662.882.9795.8
PL GLMM10092101.814.301.814.224.6795.9
PL GLM500460500.657.970.131.598.0093.2
QL GEE500460500.876.
PL GLMM500460506.569.071.311.7911.2093.1
(b) Medium
PL GLM10084101.567.161.567.057.3394.3
QL GEE10084101.514.821.514.755.0595.2
PL GLMM10084106.589.066.588.5011.2089.1
PL GLM500421501.7414.890.352.9715.0094.6
QL GEE500421501.9210.310.382.0510.5095.2
PL GLMM500421526.3318.905.273.5932.4083.3
(c) Low
PL GLM10069104.6114.014.6113.4014.8095.7
QL GEE10069103.537.483.537.228.2794.6
PL GLMM10069131.0721.1431.0716.1037.6077.2
PL GLM500356504.2426.680.855.2927.0095.0
QL GEE500356503.8615.450.773.0715.9094.5
PL GLMM500356576.7237.0615.346.4385.2077.4
Table 4. Simulation results (1000 repetitions) considering m = 10 trapping occasions.
  N inline image AVE(inline image)SE(inline image)PRBCVRMSECOV
  1. Averages of the numbers of captured individuals, (inline image); the estimates of population size, AVE(inline image); SE of the estimated population size, SE(inline image); percentage relative bias, inline image, where inline image is estimated by AVE inline image; root mean square error, inline image percentage coefficient of variation, inline image and confidence interval coverage (%), COV. QL, quasi-likelihood; PL, partial likelihood; GLM, generalized linear models; GEE, generalized estimating equations; GLMM, generalized linear models.

(a) High
PL GLM10098100.111.430.111.431.4394.3
QL GEE10098100.141.360.141.351.3696.3
PL GLMM10098100.151.450.151.441.4594.6
PL GLM500492500.
QL GEE500492500.
PL GLMM500492500.
(b) Medium
PL GLM10095100.473.140.473.123.1795.2
QL GEE10095100.422.980.422.973.0196.5
PL GLMM10095100.923.320.923.293.4593.4
PL GLM500473500.766.710.151.346.7594.6
QL GEE500473500.666.350.131.276.3896.1
PL GLMM500473502.037.200.411.437.4894.1
(c) Low
PL GLM10086101.
QL GEE10086101.315.971.315.896.1194.2
PL GLMM10086104.717.354.717.028.7388.6
PL GLM500431500.9813.040.202.6013.0895.0
QL GEE500431500.6512.570.132.5112.5895.4
PL GLMM500431512.1515.212.432.9719.4688.7


Individual heterogeneity and time dependence are fundamentally important in real-life applications of capture–recapture studies. The main purpose of this study was to compare estimates of population size and their SE using statistical techniques such as, quasi-likelihood for GEE and partial likelihood for GLM and GLMM. We also present a GEE approach that permits capture–recapture data analysis using individual covariates that accounts for heterogeneity in capture probabilities and for correlation among capture occasions. Evaluating the pattern of time dependency is important in several regards: (i) it may help characterize the relationship between the capture probability and covariates and (ii) it is also important to estimate the population parameters accurately in the capture–recapture studies. A natural question that arise is “what happens if one ignores the time dependency and uses the traditional regression methodology assuming independence among capture occasions?” From a statistical point of view, there are at least two consequences of ignoring time dependency: incorrect assessment of the regression estimates and inefficient estimation of regression coefficients. Therefore, estimated capture probabilities may be incorrect and consequently population size may not be accurately estimated if time dependency is ignored. The quasi-likelihood GEE approach seems to perform better than GLM and GLMM approaches because the SE of the estimated population size are consistently lower. The estimators perform well when average capture probabilities are high, but it is hard to obtain reliable estimates of GLMM approach for low capture probabilities. However, other existing methods in capture–recapture studies allowing for heterogeneity have similar problems (Nichols and Pollock 1983; Nichols 1986). For cases where only a small proportion of individuals are captured, the GEE approach provides better RMSE and is robust to violation of the assumption of independence among capture occasions. This approach also provides means of exploring factors thought to be responsible for differences in capture probability among individuals. Hence, it is important to account for correlation structure among capture occasions when estimating animal population parameters in capture–recapture studies. Future work could focus on expansion of the simulations to assess the performance of estimators based on GEE, GLMM, and Bayesian methods for capture–recapture studies. Extensions of this work to model Mth may also be possible after imposing some parameter constraints. The GEE approach accounts for individual heterogeneity in capture probability as a function of covariates and correlation among capture occasions. It would be interesting if one can modify our proposed approach to additionally account for individual heterogeneity that cannot be explained by covariates. Researchers may also extend this approach for open population models to estimate unknown animal abundance in capture–recapture studies.


This research was funded by EMMA in the framework of the EU Erasmus Mundus Action 2 and Fundação Nacional para a Ciência e Tecnologia (FCT), Portugal under the project PEst-OE/MAT/UI0117/2011. The authors are very grateful to the associate editor, an anonymous referee, and Dr. William Link for their careful reading of the manuscript and several suggestions that considerably improved the presentation.

Conflict of Interest

None declared.