Estimation of capture probabilities using generalized estimating equations and mixed effects approaches

Modeling individual heterogeneity in capture probabilities has been one of the most challenging tasks in capture–recapture studies. Heterogeneity in capture probabilities can be modeled as a function of individual covariates, but correlation structure among capture occasions should be taking into account. A proposed generalized estimating equations (GEE) and generalized linear mixed modeling (GLMM) approaches can be used to estimate capture probabilities and population size for capture–recapture closed population models. An example is used for an illustrative application and for comparison with currently used methodology. A simulation study is also conducted to show the performance of the estimation procedures. Our simulation results show that the proposed quasi-likelihood based on GEE approach provides lower SE than partial likelihood based on either generalized linear models (GLM) or GLMM approaches for estimating population size in a closed capture–recapture experiment. Estimator performance is good if a large proportion of individuals are captured. For cases where only a small proportion of individuals are captured, the estimates become unstable, but the GEE approach outperforms the other methods.


Introduction
Many estimation methods have been developed for the analysis of closed population capture-recapture data. For comprehensive material on the subject see, for instance, Otis et al. (1978), Seber (2002), Williams et al. (2002) and Amstrup et al. (2005). The most general capturerecapture closed population model, considered by Otis et al. (1978) was denoted by M tbh where (h) is used to denote inherent individual heterogeneity, (t) time effect, and (b) behavioral response to capture. In this work, we are interested in estimating the population size and SE of a submodel of the type M h , where individual heterogeneity can be modeled as a function of covariates. Development of capture-recapture models dealing with individual heterogeneity in capture probabilities has been one of the most challenging tasks. Failure to account for such heterogeneity has long been known to cause substantial bias in population estimates (Otis et al. 1978;Lee and Chao 1994;Hwang and Huggins 2005). Moreover, Link (2003) showed that without strong assumptions on the underlying distribution, estimates of population size under model M h are fundamentally nonidentifiable.
The use of covariates (or auxiliary variables), if available, has been proposed as an alternative way to partially cope with the problem of heterogeneous capture probabilities (Pollock et al. 1984;Huggins 1989;Alho 1990). The idea is to model capture probabilities as a function of individual (i.e., age, sex, and weight) and environmental (i.e., temperature, rainfall, and location) covariates, using a generalized linear modeling (GLM) approach, such as logistic regression. The method of Huggins (1989Huggins ( , 1991, based on a conditional likelihood to estimate population size, has become very popular, but it assumes independence among capture occasions . In the analysis of capture-recapture data, Hwang and Huggins (2005) and Zhang (2012) examined the effect of heterogeneity on the estimation of population size by solving estimating equations, but these authors also assumed independence of capture occasions. Capture-recapture data are collected on the same individuals across successive capture occasions. One may view capture-recapture data as binary longitudinal or repeated measurements data (Huggins and Yip 2001). These repeated observations are often correlated over time. This dependency or correlation structure may be induced by incorporating individual heterogeneity. Failure to account for this dependency may provide biased estimates. Hwang and Huggins (2007) also state that the assumption of independence among capture occasions is often violated in practice, but the authors still rely on the assumption. Some dependencies among capture occasions can be dealt with through the modeling of behaviorally effects, such as trap happy and trap shy effects, which are treated as special cases in the capture-recapture literature (Yang and Chao 2005;Pradel and Sanz-Aguilar 2012). One alternative approach is to use a generalized estimating equations (GEE) to account for a working correlation structure among capture occasions (Liang and Zeger 1986) and use observed individual characteristics to model heterogeneity in capture probabilities. A mixed effects modeling approach may also be used to model heterogeneity of individual observed and unobserved characteristics in capture-recapture experiments motivating the use of generalized linear mixed models (GLMM) (Pinheiro and Bates 2000). Some authors have previously introduced the use of GLMM (logit models with normal random effects) (e.g., Coull and Agresti 1999;Stoklosa et al. 2011). An advantage of using GLMM for the estimation of capture probabilities is to accommodate not only the heterogeneity attributed to individual characteristics, but also the heterogeneity that cannot be explained by the observed individual characteristics.
Bayesian methods have also become popular in capturerecapture studies. An extensive Bayesian literature on capture-recapture closed population models includes Castledine (1981), Smith (1991), George and Robert (1992), Madigan and York (1997), Basu and Ebrahimi (2001), Ghosh and Norris (2005), King and Brooks (2008), and Ghosh (2009, 2011). Bayesian statistical modeling requires the development of the likelihood function of the observed data, given a set of parameters, as well as the joint prior distribution of all model parameters. Bayesian methods allow for estimation of the unobserved random effects as well, but the performance of their estimates often depends on the chosen prior distributions. Often, the method of selecting prior distributions is subjective (Lee et al. 2003). A possible advantage of GEE over randomeffects models and Bayesian methods relates to the ability of GEE to allow specific correlation structures to be assumed between capture occasions.
Here, we propose a GEE approach for estimating capture probabilities and population size in capture-recapture closed population studies. We also compare the results of population size estimates and their SE, when using the two estimation methodologies (i.e., GEE and GLMM). For illustrative purposes, we analyze a real data set that has already been discussed in the literature. Conditional arguments are used to obtain a Horvitz-Thompson-like estimator for estimating population size. A simulation study is also conducted to compare the performance of the estimation procedures. In the next section, we describe the notation and models that are used to estimate capture probabilities and population size.

Notation and Models
Consider a population consisting of N animals in a capture-recapture experiment over m capture occasions, j = 1,2,. . .,m. Let Y ij be a binary outcome, equaling 1 if the ith animal is being caught on the jth capture occasion and 0 otherwise.
be a random vector with the capture history of individual i. Let T i ¼ P m j¼1 Y ij be the number of times the ith animal has been caught in the course of the trapping closed population study. Let t i be the time the ith individual is first captured. Heterogeneity in captured probabilities is often explained by observed individual covariate x i , such as age, sex, weight. For simplicity, we consider x i a single covariate, but the model can be easily generalized for x i to be considered a vector of covariates. Let the probability that the ith animal is captured on any trapping occasion j, be where is the vector of parameters associated with the covariates, and h(u) = (1+exp (Àu)) À1 is the logistic function. This is an M h model where variation in capture probabilities among individuals is explained by the covariate x i . The probability of not capturing the ith individual on the jth occasion is (1Àp i (b)), and the variance of Y ij is p i (b)(1Àp i (b)) (Liang and Zeger 1986). Then, T i $ Bin(m,p i (b)) and of distinct individuals captured at least in one occasion be indexed by i = 1,2,. . .,n and uncaptured individuals would be indexed by i = n + 1,. . .,N without loss of generality. To estimate the population size, once an estimate of b is obtained (b), the Horvitz-Thompson estimator N ¼ P n i¼1 1=ðp i ðbÞÞ may be used as in Huggins (1989).
Generalized estimating equations approach is known as the working correlation structure among Y i1 ,Y i2 ,. . .,Y im to describe the average dependency of individuals being captured from occasion to occasion. A GEE approach permits several types of working correlation structure R i (a) (for details, see Diggle et al. 1994). For the description that follows, and for simplicity, we consider an independence working correlation structure, R i (a) = I where I is an identity matrix. The covariate x i is never known for the individuals that have not been captured. Therefore, Y ij is conditional on the captured individuals (n) (i.e., T i ≥ 1) with the corresponding observed individual covariates similar to Huggins (1989) and Zhang (2012). The probability that the ith individual is captured on the jth occasion (p ij ) given that the ith individual is observed at least once is, , an estimator of b can be obtained by solving the following generalized estimating equations: If covariate x i (i = 1,2,. . .,n) is available for captured individuals, then the model becomes This model is not equivalent to any of those discussed in Otis et al. (1978), rather this model is a restricted version of their model M h (Huggins 1991) For a givenb, thenp i ðbÞ ¼ 1 À ð1 À p i ðbÞÞ m and an estimate of the variance ofN is given by DðbÞ 0 CðbÞ À1 DðbÞ where CðbÞ represents an estimate of the conditional information matrix for b and DðbÞ is the vector P n i¼1p i ðbÞ À2 @p i ðbÞ=@b. If the individual capture probability does not depend on time, previous capture history, or any covariate, then the model (1) Otis et al. (1978) (see Huggins 1991Hwang and Huggins 2005). This model assumes all the individuals have equal capture probabilities. Then, the estimating equations for b 0 is simplified to Letb 0 be the resulting estimator of b 0 then

Methods based on a partial likelihood
The full likelihood of all model parameters is proportional to As the number of total individuals, N, is unknown and the covariates are not known for individuals that are never captured, this likelihood cannot be directly evaluated. The conditional likelihood (Huggins 1989) is the first product component, and it can be formulated as a GLM  for the positive Binomial distribution (Patil 1962). It may be rewritten as When the full likelihood is partitioned into a product of conditional densities, then a partial likelihood (Cox 1975) may arise considering some of the product terms, but it involves only the parameters of interest, isolating the nuisance parameters. Therefore, the partial likelihood, PL(b), is the first product of the equation (6), which is the likelihood of the number of recaptures after the first capture (Stoklosa et al. 2011). For a given t i , (T i À 1)| t i $ Bin(mÀt i ,p i (b)), which is used to estimate the parameters b.
To utilize a simple GLMM with a random effect, we suppose that p i (b) = h(X i b + r b z i ) where z i is a realization of the standard normal random variable Z i $ N ð0; 1Þ, with r b >0. The use of random effects reflects the belief that there is heterogeneity that cannot be explained by covariates. The partial likelihood can be considered as the joint distribution of the response and the random effects. To estimate b and r b , the marginal likelihood of the response is obtained by integrating out the random effects. The integration can be approximated by penalized quasi-likelihood (Breslow and Clayton 1993), which enables parameter estimation via an iterative procedure.
The variance ofN for a smoothing parameter k may be estimated according to Stoklosa et al. (2011) using the following formula, d , and all quantities are evaluated atb. The smoothing parameter k, which is part of the quasi-likelihood procedure, controls the degree of roughness of the estimated functions. To obtain an optimal value for k, we used generalized cross-validation (GCV) technique (Wood 2006).

Application
We applied the techniques discussed in the previous Section to a data set of least chipmunks (Eutamias minimus) made available by V. Reid (1975). The data set has been previously analyzed and discussed by Otis et al. (1978) and Wang et al. (2007). V. Reid laid out a 9 9 11 livetrapping grid with traps spaced 50 feet (15.2 m) apart. The study was conducted in an area dominated by sagebrush and snowberry in Colorado, USA. The numbers of animals caught for six occasions (n 1 to n 6 ) were 7, 15, 16, 24, 19, 7, and ∑n k = 88. Of these 88 captures, n = 45 distinct animals were captured, and the covariate sex (male or female) was collected for each captured individual; there were 22 males and 23 females. The recorded capture frequencies (f 1 to f 6 ) were 21, 12, 7, 3, 2, 0. The average capture frequencies for male and female were 1.86 and 2.04, respectively. Our estimation results are summarized in Table 1. The inclusion of the covariate sex does not improve our estimates of population size which are very similar, except when the random effect is considered in the GLMM, which is based on partial likelihood estimation. This may indicated that there is unmodeled individual heterogeneity in capture probabilities that is not being accounted for with the other models (GLM and GEE). The population estimate, in this case, is approximately 74 individuals with a SE of 12. Both values are quite high when compared to the values obtained with the other estimation strategies. Although, GLMM accounts for heterogeneity due to unobserved individual characteristics, it may also be overestimating population size at the expenses of greater loss in precision, possibly due to the increase in the number of model parameters that are estimated. In contrast, quasi-likelihood GEE methodology provided lower SE, when compared to results from the Bayesian approach of Wang et al. (2007) for the same data set. The latter authors estimated population size of 50 with a SE of 3.14. The GEE estimation results also agree with Otis et al. (1978), but our model jointly takes into account heterogeneity in capture probabilities and correlation among capture occasions.

Simulation Study
A simulation study was conducted in order to evaluate the performance of the estimators. The effect of heterogeneity among observed individuals was modeled using two covariates, sex (male = 1 and female = 0), and weight. Two levels of population sizes N = 100 and 500 and two levels of capture occasions m = 6 and 10 were considered. For each individual, we assigned sex with probability 0.5 from a Bernoulli distribution and weight from a normal distribution with mean 15 and variance 4. These values are based on the previous data analysis. Individual capture probabilities were modeled with a logistic regression, so that where b 0 is the constant term, b 1 and b 2 represent the sex and weight effects, respectively. A positive b 1 implies that the sex taking value 1 is more catchable, and a positive b 2 means that the catchability increases with weight. We considered three different simulation scenarios for capture probabilities: (a) high capture probabilities (b 0 = À3.5); (b) medium capture probabilities (b 0 = À4.0); (c) low capture probabilities (b 0 = À4.5); and their averaged are presented in Table 2. In addition, a Gaussian random effect with mean 0 and r b = 0.1 was included as an unobserved covariate to ensure the existence of heterogeneity due to unobserved individual characteristics. For each simulation scenario, GLM, GEE, and GLMM approaches were used for data analyses and to assess estimators performances. The simulation study was carried out with 1000 Monte Carlo replicates.
To evaluate estimators' performance, we present the SE, the relative bias (PRB), the root mean square error (RMSE), the coefficient of variation (CV), and confidence interval coverage (%) (COV) for the estimates of population size. The simulation results for six capture occasions are given in Table 3. We noticed that all estimation procedures for scenario (a) perform well. There was little bias, low SE, low coefficient of variation forN. In this scenario, confidence interval coverage for all estimators is very good (93-96%), considering a nominal level of 95%. As in our example, the exception is the GLMM that tends to overestimate population size. Overestimation is particularly severe when capture probabilities are low, see for instance, results of scenarios (b) and (c). Confidence interval coverage for GLMM is also poor (77-90%) in these scenarios. For all scenarios, the GEE approach performs well when estimating population size. This approach also consistently provides lower SE and lower RMSE when compared to GLM and GLMM estimators, although the differences are mini- Table 2. Simulated capture probability scenarios for the capture probability model, logit(p i ) = b 0 +b 1 9 sex + b 2 9 weight.
p represents average capture probability when weight = 15 and p i represents the average probability of an individual being captured at least once during the study.

Simulation scenarios
Effects of covariates p mal for GEE-GLM comparisons. Therefore, our simulation results indicate that the general performance of estimators obtained from GEE is better than GLM and GLMM. The GEE approach may overcome the effect of random effects due to its ability accounting for the correlation structure among capture occasions. The simulation results for 10 capture occasions are presented in Table 4. The performance of estimators for 10 capture occasions is better than for six capture occasions yielding lower CV, absolute value of PRB, RMSE, but higher COV. This is generally true because the average capture probability is higher for 10 capture occasions than for six capture occasions. We also conducted simulations for two other levels of N (50 and 200) when m = 6 and 10. These results are similar to the ones presented here.

Discussion
Individual heterogeneity and time dependence are fundamentally important in real-life applications of capturerecapture studies. The main purpose of this study was to compare estimates of population size and their SE using statistical techniques such as, quasi-likelihood for GEE and partial likelihood for GLM and GLMM. We also present a GEE approach that permits capture-recapture data analysis using individual covariates that accounts for heterogeneity in capture probabilities and for correlation among capture occasions. Evaluating the pattern of time dependency is important in several regards: (i) it may help characterize the relationship between the capture probability and covariates and (ii) it is also important to estimate the population parameters accurately in the capture-recapture studies. A natural question that arise is "what happens if one ignores the time dependency and uses the traditional regression methodology assuming independence among capture occasions?" From a statistical point of view, there are at least two consequences of ignoring time dependency: incorrect assessment of the regression estimates and inefficient estimation of regression coefficients. Therefore, estimated capture probabilities may be incorrect and consequently population size may not be accurately estimated if time dependency is ignored. The quasi-likelihood GEE approach seems to perform better than GLM and GLMM approaches because the SE of the estimated population size are consistently lower. The estimators perform well when average capture probabilities are high, but it is hard to obtain reliable estimates of GLMM approach for low capture probabilities. However, other existing methods in capture-recapture studies allowing for heterogeneity have similar problems (Nichols and Pollock 1983;Nichols  Averages of the numbers of captured individuals, ( n); the estimates of population size, AVE(N); SE of the estimated population size, SE(N); percentage relative bias, PRB ¼ 100 Â ðEðNÞ À NÞ Ä N, where EðNÞ is estimated by AVE ðNÞ; root mean square error, RMSE ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi VarðNÞ þ Bias 2 q ; percentage coefficient of variation, CV ¼ 100 Â SEðNÞ Ä EðNÞ and confidence interval coverage (%), COV. QL, quasi-likelihood; PL, partial likelihood; GLM, generalized linear models; GEE, generalized estimating equations; GLMM, generalized linear models.
1986). For cases where only a small proportion of individuals are captured, the GEE approach provides better RMSE and is robust to violation of the assumption of independence among capture occasions. This approach also provides means of exploring factors thought to be responsible for differences in capture probability among individuals. Hence, it is important to account for correlation structure among capture occasions when estimating animal population parameters in capture-recapture studies. Future work could focus on expansion of the simulations to assess the performance of estimators based on GEE, GLMM, and Bayesian methods for capture-recapture studies. Extensions of this work to model M th may also be possible after imposing some parameter constraints. The GEE approach accounts for individual heterogeneity in capture probability as a function of covariates and correlation among capture occasions. It would be interesting if one can modify our proposed approach to additionally account for individual heterogeneity that cannot be explained by covariates. Researchers may also extend this approach for open population models to estimate unknown animal abundance in capture-recapture studies.