Multistate capture–recapture analysis under imperfect state observation: an application to disease models

Authors


*Correspondence author. E-mail: paul.conn@noaa.gov

Summary

  • 1Multistate capture–recapture models are frequently used to estimate the survival and state transition parameters needed to parameterize stage-structured population models, tools that are important for conservation and management. Typically, such models assume that all encountered individuals can be assigned to a particular state without error or ambiguity, a requirement which is difficult to meet in practice. Model extensions to relax this assumption would increase the richness of ecological data sets available for estimating life-history and stage-transition parameters with multistate models.
  • 2One relatively common analytical approach when confronted with ambiguity in state determination is to censor all encounters where the state of an animal cannot be ascertained. Here, we present an alternative approach, which uses a hidden Markov (or multievent) modelling framework that can incorporate data from encounters of unknown state. Using simulation, we show that our approach leads to estimators of state-specific survival and transition probabilities that are more precise, and sometimes considerably so, than methods based on censoring.
  • 3We demonstrate our approach using field data from a study of the dynamics of conjunctivitis in the house finch Carpodacus mexicanus Müller. A fundamental challenge in modelling disease dynamics involves the estimation of the rates of entry and exit from one or more disease states, which can be complicated when disease state is uncertain. We show that incorporating data from unknown states made substantial improvements to parameter precision.
  • 4Synthesis and applications. Missing or incomplete records are an unfortunate but common feature of many ecological field studies, often diminishing the quality and quantity of data. Our approach of treating state as a hidden Markov process allows such records to be used, increasing the precision of survival and state transition parameters in multistate mark–recapture studies. Our approach is more general than other approaches in the literature, and does not require specialized sampling designs or ancillary information to inform state assignment. We suggest that ecologists consider using this modelling approach instead of censoring records whenever state information is missing.

Introduction

Ecologists frequently use age- or stage-structured models to project populations into the future and to investigate the consequences of management actions. When the detection probability of animals is imperfect, multistate capture–mark–reencounter (MSMR) studies (Arnason 1972; Hestbeck, Nichols & Malecki 1991; Brownie et al. 1993; Schwarz, Schweigert & Arnason 1993) provide a natural framework for estimating the state-transition and apparent survival parameters needed to parameterize stage-structured population models (Nichols et al. 1992).

One of the most basic data requirements for MSMR models is that the true state of each sampled individual is known. If only a small percentage of observations do not conform to this requirement, they may be censored without causing bias or diminishing precision to any great degree. On the other hand, when state is difficult to determine, the number of records that must be deleted may be quite high, resulting in imprecise parameter estimates. For instance, age or size class, sex, breeding status, or disease state may be difficult to determine in the field, particularly when individuals are observed at a distance.

Imperfect state assignment can arise from at least two, not mutually exclusive sources (Fig. 1). In some cases, all encountered individuals are assigned a state, but the state assignment process is subject to error. This is generally referred to as ‘misclassification error’. When the state of an animal that is captured is misclassified, the potential for bias in transition probabilities as well as all other parameters arises. Estimated differences in survival between states could be underestimated. As with other biases, those due to misclassification could certainly bias projections of population change from matrix population models (Caswell 2001). Lebreton & Pradel (2002) and Kendall (2008) have outlined the problem of misclassified states. In addition to potential for biased parameter estimates, they also pointed out that without additional information, parameter redundancy problems would arise, such that some key parameters would be inestimable. Fujiwara & Caswell (2002) modelled misclassification and adjusted for it by incorporating fixed misclassification probabilities derived outside the capture–mark–recapture modelling process. Runge, Hines & Nichols (2007) developed an approach for measuring species-specific parameters when species may be misclassified and a subsample of marked animals are used to estimate classification probabilities. Similarly, Royle & Link (2005, 2006) used multinomial mixtures to account for misclassification in generalized site occupancy models; in this case, knowledge of pertinent biology helped them identify which mode of a multimodal solution best represented truth without the need for auxiliary data on (mis)classification rates.

Figure 1.

A contrast of misclassification and partial observation in a two-state system. With pure misclassification (A), all individuals are assigned a state, but observations are subject to misclassification. In a partially observed system (B), states are determined definitively for some fraction of observations, while those that cannot be determined are recorded as unknown (‘Obs U’).

Alternatively, individual state may be only ‘partially observable’ for a fraction of the individuals encountered at a given sampling period. In such cases, the individual is encountered, but state is not observed (Fig. 1). We distinguish ‘partial observability’ from truly ‘unobservable states’, which strictly applies only to individuals that are not encountered, either because they are not available for encounter (encounter probability equals 0 for individuals in a given state) or because of imperfect detection of available individuals (encounter probability < 1 for individuals in a given state). Nichols et al. (2004) developed an approach for estimating sex-specific survival when sex is not always observed. However, their approach relied in part on strong determinism of an unobservable state that is fixed over the lifetime of the individual (e.g. sex).

In some cases, the line between misclassification and partial observation is blurred even further. For instance, Kendall, Hines & Nichols (2003) and Kendall et al. (2004) developed models based on the robust design which permitted partial observability of a dynamic state (breeding status). Their approach relied on extended periods where state was assumed to be static, and on certain state assignments being unambiguous. Under these conditions, they were able to treat the partial observation process under a misclassification framework where only one state could be misclassified.

Here, we propose a new multistate capture–recapture model capable of handling partially observable states such that a hidden or partially observable Markov process determines state dynamics. In particular, we model both the detection process as well as the process of obtaining data on state conditional on being detected. Our approach differs from previously proposed models for unknown state (e.g. sex class; cf. Nichols et al. 2004) in that (i) the state of individuals is allowed to change over time, and (ii) it does not impose a directional misclassification error by assigning all encountered individuals to a state (e.g. breeding state; cf. Kendall et al. 2004). In addition, the approach we describe may be more general, since it might apply to cases where strong determinism of at least one state is not practical, given the limits of the available data (contra Kendall et al. 2004). The hidden Markov model (hereafter, HMM) that we propose falls under the general class of ‘Multievent’ models described by Pradel (2005). The multievent framework is quite general; a number of multistate mark–recapture models may be fit under the general ‘umbrella’ of multievent models (Pradel 2005). However, some realizations of multievent models may result in multimodal solutions (Pradel 2008; Pradel et al. 2008), with the vast majority of possible models remaining unexplored. Thus, it is important to demonstrate that a particular multievent realization actually provides a reasonable solution to the problem at hand.

After introducing the model, we conduct a small numerical experiment to quantify expected gains in relative efficiency for several plausible biological scenarios. That is, what increases in precision can be expected when our method is applied vs. simply deleting all observations for which state is unknown? We focus on precision in this case because both approaches are unbiased, at least for large sample sizes. Next, we illustrate application of our method by analysing disease data from a house finch Carpodacus mexicanus Müller population, where true disease state for some encountered individuals is completely ambiguous. In this study, state transitions are governed by disease dynamics, and interest focuses on estimating survival, infection, and recovery probabilities associated with presence or absence of conjunctivitis induced by MG. Finally, we discuss implications of our work for studies of stage-structured populations and disease ecology in particular.

Model structure

We assume that the investigator marks animals at discrete encounter occasions (these may be natural markings so that an animal need not be physically captured, provided that the markings do not change over time). At each occasion, they also search for animals that are previously marked. Each time an animal is encountered, the investigator may or may not be able to determine its state (state being disease, stage class, etc.). If they are able to determine state, it is determined without error; otherwise, it is recorded as belonging to an unknown state (denoted subsequently by a ‘U’). Other than partial state observability, we make all the assumptions common to multistate mark–recapture analyses (cf. Williams, Nichols & Conroy 2002).

We develop model structure in a manner analogous to Nichols et al. (2004), but allow states to be dynamic (such that an individual's true state may change over time). For simplicity of presentation, a multinomial tree diagram of population and sampling processes is presented instead of a complete likelihood (Fig. 2), where parameters and statistics of the model are defined in Table 1. To illustrate calculation of a few encounter histories, consider the case of a 3-year capture–recapture experiment with three underlying states, ‘A’, ‘B’, and ‘†’ († denoting dead). The following encounter types may occur at a given sampling occasion: ‘A’ (observed in state A); ‘B’ (observed in state B); ‘U’ (observed, state unknown); ‘0’ (not encountered). For example, the encounter history ‘UBB’ denotes an individual who was encountered in the first period in an unknown state, and was encountered in the second and third occasion in state B. There are two possibilities: the animal could have been in state B throughout the course of the experiment, or could have been in state A at time one, with transition from state A to state B occurring between the first and second sampling occasions. Thus, conditioning on being encountered in an unknown state at time 1, the probability of said history is:

Figure 2.

A multinomial tree diagram describing the probability structure for multistate observations when state is not always observed. Solid boxes marked A, B, and † indicate possible states (alive in state A; alive in state B; dead), while dashed boxes represent possible observations following initial release. Here, possible observations include A (encountered in state A), B (encountered in state B), U (encountered in an unknown state), and 0 (not encountered). The probability for observing a particular encounter history is obtained by summing the probability of all possible paths leading to a given encounter history. The probability of a given path can be obtained by multiplying the probabilities appearing alongside its component arrows. These probabilities consist of functions of π, the initial state probabilities; S, apparent survival probabilities; ψ, state transition probabilities; p, detection probabilities; and δ, the parameters describing the partial observation process. More rigorous definitions are provided in Table 1.

Table 1.  Definitions of parameters used in the multistate mark–recapture model incorporating unknown states
ParameterDefinition
image
Probability that an animal originally encountered at time i is in state s
image
Probability that an individual in state s at time i survives to time i + 1 and does not permanently emigrate from the study area
image
Probability that an individual in state a at time i will be in state b at time i + 1 given that it survives to i + 1
image
Probability that an individual in state s at time i is encountered at time i
image
Probability that the state of an animal is observed given that it is in state s at time i and encountered at time i
image

As another example, consider the encounter history ‘A0U.’ Here, the true state vector could have been AAA, ABA, AAB, or ABB. In this case, the probability of said history has four components:

image

Due to the large number of possible sample paths when animals are not encountered, the need for an algorithm to calculate encounter history probabilities should be apparent; a more rigorous and general development of the likelihood using matrix notation is presented in Supporting Material, Appendix S1. Estimation may then proceed via maximum likelihood as implemented in program E-SURGE (Choquet 2007; Choquet, Rouan & Pradel 2008).

Anticipated efficiency gains

To illustrate anticipated gains in parameter precision by incorporating individuals of unknown state into a multistate capture–recapture model, we generated expected value data for a number of hypothetical sampling scenarios. Our approach was to fit a number of models to each data set, which varied both by the level of state dependency assumed in model parameters, and by the type of analysis. We then compared the precision of model parameters for analyses incorporating the unknown state to traditional multistate models in which observations of unknown state were censored (Supporting Material, Table S1).

The major factors influencing parameter efficiency (expressed as the ratio of relative parameter precision; Supporting Material, Table S1) appeared to be estimation model complexity and δ, the underlying probability of observing the state of an animal given that it is encountered. For survival, relative efficiency of the traditional multistate model (sensu Faustino et al. 2004; Senar & Conroy 2004) in relation to the model proposed in this study was 0·2–0·5 for the case where δ = 0·5, and 0·6–0·75 for the case where δ = 0·8. For the most part, efficiency increased as estimation model complexity increased. Relative efficiency was higher for state transition probabilities, but still exhibited the same trends with regard to δ. However, the relative efficiency of state transition estimators increased as estimation model complexity decreased (Supporting Material, Table S1). For both parameter types, precision was always better when data from animals of unknown state were included in the analysis, a result that is likely to be quite general (see e.g. Barker & Kavalieris 2001).

To illustrate application of our model to a real-life problem, we consider the case of disease dynamics among house finches encountered near Ithaca, New York.

Example application

We were initially motivated to pursue the issue of state uncertainty by involvement with a study by Faustino et al. (2004), who used multistate capture–recapture analysis to explore disease dynamics of Mycoplasma gallisepticum (hereafter, MG) conjunctivitis in house finches. The MG pathogen causes moderate to severe eye swelling, sometimes so severe that birds are virtually blind (see Dhondt et al. 2005 for a general review). Researchers were interested in whether presence or absence of clinical signs of the pathogen MG influenced survival, and also wanted to quantify infection and recovery rates. A mark–recapture and mark–resighting programme was instituted for a period of 3 years, in which time birds were captured via mist nets and marked with individually identifiable colour bands (see Faustino et al. 2004, for further details). Faustino et al. (2004) employed multistate mark–recapture models where the state of a bird at a specific time period corresponded to presence (state I) or absence (state N) of the bacterium.

However, their analysis was complicated by the fact that field biologists were not always able to determine the state of encountered birds. Field biologists could almost always observe clinical signs (visible conjunctivitis) of recaptured birds, but ascertaining the disease status of resighted birds was more difficult since determining the presence of the pathogen was only possible when a bird's eyes were clearly visible. They used two approaches to deal with this logistical difficulty. In the first, they treated unknown state birds as members of a separate state (state U). In the second, they censored all ‘unknown state’ observations. The first approach is deficient because it results in biased estimates of infection and recovery rates (although estimates of survival and encounter probability will typically remain unbiased). Essentially, adding a dummy state to the model means that transitions to/from that state have to be taken away from ‘true’ transitions. The situation is further complicated whenever δA ≠ δB. The second approach is deficient in the sense that one is throwing away data, which almost invariably leads to decreased precision on parameter estimates (Faustino et al. 2004).

Some clarification may be needed to differentiate the first of these approaches from the HMM developed earlier. In the first of these approaches, birds in an unknown state are given their own survival and transition probabilities. In effect, these birds are treated as a member of a separate group of birds with different dynamics from infected and non-infected birds. In contrast, the HMM we advocate in this study only incorporates process dynamics for two states: infected and non-infected birds. Birds of unknown state are assumed to be a member of one of the two groups; we are just not sure which one they belong to. Focus is instead on describing how these encounters arise according to an underlying probabilistic framework.

In order to examine how much precision one loses in this study by discarding data on unknown state animals, we now analyse data from two different time periods at the Ithaca, New York, study site. The first consists of marking and encounter records from October to December 2000; the second is from September to December 2002. We selected these periods because in the first case, there are quite a few encounters of individuals of unknown state (382 not diseased, 210 unknown, 10 infected). However, the number of encounters of infected birds is quite low. In contrast, the 2002 data do not have many encounters of birds of unknown state but the prevalence of infection appears to be much higher (1374 not diseased, 178 diseased, 67 unknown).

For each data set, we conducted two separate analyses. In the first case, we used the HMM model to account for encounters with individuals of ‘unknown’ state, while in the second, we censored these observations and utilized a traditional multistate mark–recapture modelling framework. Program E-SURGE (Choquet 2007; Choquet et al. 2008) was used to conduct all analyses using the former approach (Supporting Material, Appendix S2), while program M-SURGE (Choquet et al. 2004) was used in the latter case. Both programs use the same numerical procedure to calculate the standard error of parameter estimates.

We grouped males and females for simplicity and used the results of Faustino et al. (2004) to help guide selection of appropriate models for each data set. In particular, we ran models where survival, capture, and state transition probabilities were dependent upon disease state. We then compared the relative parsimony of multistate models that included additive and interactive models of disease and time on all parameters, as well as models without time effects, using Akaike Information Criterion corrected for small sample size (AICc; Burnham & Anderson 2002). Following this model selection exercise, we used the structure of the highest ranked traditional multistate model as a starting point for HMM modelling, setting π and δ parameters to be constant over time. However, we considered cases where δ was or was not dependent upon disease status, using parameter estimates and standard errors from the highest-ranked AICc model to compare precision of the two methods of analysis.

For the 2000 data set, the highest-ranked AICc model for the censored data set included an additive submodel for the effects of disease and time on capture probability; survival and state transition probabilities depended only on disease status. Estimated weekly apparent survival for non-infected (state = N) and infected (state = I) birds were 0·910 (SE 0·058) and 1·000 (SE N/A), respectively, for the traditional multistate model, with infection probability inline image (SE 0·111) and recovery probability inline image (SE 0·095). Survival for infected birds was estimated on the boundary and thus standard error could not be computed properly; however, a 95% profile likelihood confidence interval was calculated as (0·56, 1·0). When data from unknown state encounters were also modelled via the HMM, precision improved dramatically. In this case, estimated weekly apparent survival for non-infected birds was 0·843 (SE 0·019). Survival for infected birds was again estimated as 1·000 (SE N/A); a 95% profile likelihood interval was estimated as (0·946, 1·000). Transition probabilities were estimated as inline image= 0·232 (SE 0·0314) and inline image= 0·466 (SE 0·052). Model selection criterion (AICc) favoured the model in which δ was allowed to vary as a function of disease state. Parameter estimates for δ indicated that non-infected birds were positively identified (inline image= 1·000, SE N/A), while most infected encounters were classified as ‘unknown’ (inline image= 0·045, SE 0·014).

For the 2002 data set, model selection favoured a model with submodels for survival and capture probability that were additive with respect to disease status and time, but in which state transition probabilities were time invariant. In this case, improvements to precision bordered on negligible (Supporting Material, Table S2). The probability of positively identifying a non-infected bird was estimated to be 0·954 (SE 0·006), while the probability of positively identifying an infected bird was estimated to be 0·998 (SE 0·012).

Precision contrasts for this study underscored results from numerical efficiency experiments. When unknown state encounters made up a notable proportion of total observations, precision increased substantially when the HMM was used to account for unknown state encounters. This was evident from the 2000 data set, where precision of survival and transition rates was much improved from the traditional multistate approach employed by Faustino et al. (2004). However, estimator precision was similar for the two approaches in the 2002 data set, where unknown state encounters only accounted for 67 of a total of 1619 encounters.

Discussion

Multistate mark–reencounter models are important tools for investigating and parameterizing dynamical models of animal and plant populations, and are in wide use in applied ecology. While initial interest in MSMR models focused on model structure and parameter identifiability, there has been increasing consideration of these models as an omnibus framework for estimating a large number of key parameters (Nichols & Kendall 1995; Lebreton, Almeras & Pradel 1999), including the transitions to and from states which may not be completely observable (Pradel 2008). In this study, we have outlined a method for incorporating unknown states into such analyses when state is partially observed. Multistate models are typically data-hungry, and we have shown that a substantial decrease in precision may result if encounters of individuals whose state cannot be determined are censored prior to analysis. This is particularly the case when the number of unknown state encounters makes up a large percentage of the observations. Censoring data when state cannot be ascertained is also a viable solution leading to unbiased estimators of parameters of interest. The issue then is whether to sacrifice some precision in favour of using a simpler model with fewer assumptions (e.g. by censoring data), or whether to employ a more complicated model to be able to utilize all available data (but perhaps at the expense of introducing bias if assumptions are not met). In data-poor situations, we believe the latter will often be preferable, although consequences of assumption violations certainly deserve more investigation.

In this study, we have concentrated on the case where animals are sampled instantaneously at regular intervals. While we believe our formulation to be less restrictive, we note that more information can be obtained when two or more samples are obtained consecutively and the system is assumed to be static between consecutive observations. This approach to sampling is usually termed the ‘robust design,’ and enhances the investigator's ability to make inferences about partial observability and misclassification (e.g. Kendall et al. 2003, 2004; Kendall 2008). In this manner, if the observations ‘A’ and ‘U’ are obtained on the same animal close enough together to preclude a state transition from ‘A’ to ‘B’, we are able to infer that the ‘U’ observation really corresponded to state A, and are thus able to infer more about the partial observation process. Similarly, if we observe the states ‘A’ and ‘B’ in a similar duration, this gives us information about misclassification, a problem we have not considered in this paper. In our numerical study, we found that parameters can be identified when the correct state can be identified for at least some fraction of the encounters (except for the pathological case where transition probabilities all equal 0·5; Supporting Material, Table S1). A further consideration when extending the HMM to include misclassification in addition to partial observation is whether parameters can be identified, in this case without auxiliary information on classification rates. For static states (e.g. sex for most organisms), Pradel et al. (2008) showed that introducing misclassification can result in multimodal solutions but suggested that one can often choose which one is more plausible biologically (see also, Royle & Link 2006). There is clearly a need for further exploration of parameter identification issues in multievent models, but this will probably be a function of the ecological problem under investigation. Although the Catchpole–Morgan–Freeman method (e.g. Catchpole, Morgan & Freeman 1998) may be useful for diagnosing parameter redundancy, it cannot detect when there are a finite number of multimodal solutions (Pradel et al. 2008).

Previous mark–recapture studies of avian disease have provided strong evidence that detectability of infected and non-infected individuals varies over time, space, and disease status (Senar & Conroy 2004; Jennelle et al. 2007), indicating that the common approach of fitting dynamical disease models to time series of counts may lead to erroneous inferences. Use of MSMR appears a robust solution to the problem of accounting for imperfect state-specific detection probabilities in the estimation of key disease parameters (sensu Faustino et al. 2004; Senar & Conroy 2004), parameters which are often the focus of field experiments (e.g. Caley & Ramsey 2001; Caley & Hone 2004). However, it is important to recognize that there are several key assumptions which might reduce the utility of the MSMR approach in some cases. Many of these assumptions are quite general, and are not unique to the study of disease dynamics. First, MSMR requires a fixed number of discrete states; in the context of our finch disease example, individuals were classified as infected or not infected. While this dichotomy makes the problem tractable, it is important to recognize that the underlying state space for disease dynamics, and probably many other biological processes, is likely to be continuous (or nearly so) in most situations, with susceptibility and mortality likely varying by phenotype, disease history, and severity of infection. In this respect, classifying animals based on presence or absence of a pathogen may be overly simplistic. Further, we have assumed that the state of an individual is determined accurately if information on state is obtained at all. In some situations, we expect that some level of misclassification may exist. In this case, sensitivity and specificity of the state assignment procedure would also need to be estimated and incorporated into model structure (sensu Fujiwara & Caswell 2002; Runge et al. 2007). This type of misclassification is increasingly considered in clinical studies of disease. In addition, as in most multistate mark–recapture applications, state transitions were assumed to occur immediately prior to sampling periods in order to avoid having to model competing risks associated with survival and state transition. One could contemplate a more realistic continuous time formulation, perhaps with constant (or piecewise constant) hazard rates where events are exponentially distributed. Other developments would also be needed to model the rate of infection when it depends on the number of infected individuals, and when sojourn times depend on how long an individual has already spent in a given state. Although nontrivial, we are cautiously optimistic that stochastic compartment models fit using Bayesian approaches will be useful in this regard (cf. Gibson & Renshaw 1998; O’Neill & Roberts 1999; Höhle, Jørgensen & O’Neill. 2005).

Stage-structured population models are important tools for managing and conserving natural populations. Such models are frequently used to explore population viability, to examine the importance of individual life stages to perturbations, and to project population dynamics into the future when examining different management strategies. Such models are also important in investigating the drivers behind population dynamics, as in the case of house finch conjunctivitis disease system. Here, we have shown how to incorporate data from unknown states when estimating the parameters of these models with MSMR data. In particular, applied ecologists can increase the quantity of data available for estimating state transition parameters by incorporating hidden Markov models into the estimation process, thus obtaining more precise inferences and predictions about stage-structured populations.

Acknowledgements

We thank C. Jennelle for compiling house finch encounter histories and B. Kendall, R. Pradel, A. Royle and two anonymous reviewers for useful comments on previous versions of this manuscript. The authors acknowledge funding from a National Science Foundation EF-0622705 grant under the NSF–NIH Ecology of Infectious Diseases Program.

Ancillary