Disease‐structured N‐mixture models: A practical guide to model disease dynamics using count data

Abstract Obtaining inferences on disease dynamics (e.g., host population size, pathogen prevalence, transmission rate, host survival probability) typically requires marking and tracking individuals over time. While multistate mark–recapture models can produce high‐quality inference, these techniques are difficult to employ at large spatial and long temporal scales or in small remnant host populations decimated by virulent pathogens, where low recapture rates may preclude the use of mark–recapture techniques. Recently developed N‐mixture models offer a statistical framework for estimating wildlife disease dynamics from count data. N‐mixture models are a type of state‐space model in which observation error is attributed to failing to detect some individuals when they are present (i.e., false negatives). The analysis approach uses repeated surveys of sites over a period of population closure to estimate detection probability. We review the challenges of modeling disease dynamics and describe how N‐mixture models can be used to estimate common metrics, including pathogen prevalence, transmission, and recovery rates while accounting for imperfect host and pathogen detection. We also offer a perspective on future research directions at the intersection of quantitative and disease ecology, including the estimation of false positives in pathogen presence, spatially explicit disease‐structured N‐mixture models, and the integration of other data types with count data to inform disease dynamics. Managers rely on accurate and precise estimates of disease dynamics to develop strategies to mitigate pathogen impacts on host populations. At a time when pathogens pose one of the greatest threats to biodiversity, statistical methods that lead to robust inferences on host populations are critically needed for rapid, rather than incremental, assessments of the impacts of emerging infectious diseases.

t sampling occasions in one of s discrete disease states. Individual model parameters can be modeled as linear functions of one or more covariates of interest using the appropriate link functions and design matrices. Construction of a multistate mark-recapture model analogous to a classic compartment disease model (e.g., SIR model) proceeds naturally, where transitions between disease states are determined by the joint probability of surviving and moving between disease states (see Fig. 2 main text).
The classic multistate mark-recapture model accounts for state-speci c differences in host detection probability and assumes that states are assigned without error, ignoring imperfect pathogen detection probability and partial observations (i.e., animal seen alive but state unknown). To address the latter issue, Conn & Cooch 2009 developed a multistate markrecapture model using a hidden Markov modeling framework to incorporate data from encounters of unknown disease state. This approach allows ecologists to include records where state information is missing, instead of censoring data.
One of the biggest differences between marked and unmarked data models is the scale of data collection and inference. For example, marked models take data on individuals, and inference is made on the individual; whereas the data collection and inference scales depend on the type of unmarked data model. For instance, occupancy models take data and make inference on the 'site' level (e.g., habitat patch); but the generalized N-mixture model takes data on at the site level and make inference for individuals within a site.

S2.2: Multistate mark-recapture model
Multistate mark-recapture models have long been used in disease ecology to describe disease dynamics. Here, we present a multistate Jolly-Seber model to estimate host survival, entry, and disease state transition probabilities for a host population across several seasons while accounting for imperfect host detection. We used the multistate formulation of the Jolly-Seber model, which includes both state and observation processes (Kéry & Schaub 2012). The state process describes the true disease state of an individual where state s is described as not entered, uninfected, infected, and dead, and assumes that an individual moves between these 4 disease states over a nite number of sampling occasions t. For any given individual, the successive disease state is described by a discrete rst-order hidden Markov model, where the probability of an individual transitioning to or remaining in a disease state from time t-1 to t only depends on the current state at time t-1. To estimate true abundance, we use a data augmentation method. We augment the observed data set y with a large number of all-zero capture-histories, resulting in a larger data set of xed dimension Q, where Q is much greater than N, the true population size. By augmenting the observed data set y, we account for individuals never observed during surveys but that were likely present.

S2.3: Parameters of interest
List of state variables, observed data, and parameters of interest along with their de nitions.

Quantity
Type We denote the true but, unknown, disease state of individual i at sampling occasion t as \ (z_{i,t}\), where \(z_{i,t} = 1\ldots4\) and represents the states "not entered", "uninfected", "infected", or "dead", respectively. To estimate monthly survival \((\phi)\), recruitment \ ((\gamma)\), and transition \((\beta)\) rates, we used the transition matrix \(\psi\), where the cells represent the probabilities of moving from a row state to a column state from time t-1 to time t for host i.
The "not entered" category consists of individuals that are not part of the population yet, where the parameters \(\gamma_{2}\) and \(\gamma_{3}\) are the state-speci c entry probabilities for uninfected and infected hosts from t-1 to t, i.e., the probability that a host in state s enters the population at time t.
The \(\gamma\) parameters are the probabilities that an available individual in M (the data augmentation) enters the population at occasion t. Importantly, \(\gamma\) refers to available individuals, that is, to those in M that have not yet entered. The entry process is thus a removal process; over time, fewer and fewer individuals will be in the state "not yet entered" and thus available to enter the population. Thus, \(\gamma\) is the removal entry probability. Depending on the type of Jolly-Seber model parameterization used (i.e., Restricted Dynamic Occupancy Model, Superpopulation parameterization, Multistate parameterization), there are different ways to calculate the "entry probability" and the number of individuals entering each season.
The parameter \(\phi_{2}\) and \(\phi_{3}\) are the state-speci c survival probability for uninfected and infected hosts from t-1 to t, while the parameters \(\beta_{UI}\) and \ (\beta_{IU}\) are the infection and recovery probabilities, respectively. In other words, if host i survives from t-1 to t, it can become infected if they were uninfected at t-1, or recover from infection if there were infected at time t-1, with probabilities \(\beta_{UI}\) and \(\beta_{IU}\), respectively. Each parameter must be between 0 and 1, and each row of the matrix must sum to 1. For survival and transition this is easy. However, for the removal entry probability, \ (\gamma\), this is less straightforward. One solution is to choose a Dirichlet prior for \ ([\gamma_{1}, \gamma_{2}, \gamma_{3}]\) to ensure that these parameters sum to 1.
Because there are more than two possible true and observed states, we use the categorical distribution to model the transition from one state to another for host i each sampling occassion, t: \[z_{i,t} \mid z_{i,t-1} \sim \textrm{categorical}(\psi_{z_{i,t-1},1:4}),\] where \(\psi\) is the transition matrix de ned above. Note that the argument of the categorical distribution is a vector of length 4; that is, it is the row of the matrix \(\psi\) corresponding to the state s of host i in time step t-1.
To estimate host recapture probabilities, we use a second transition matrix \(\pi\), which determines the probabilities of the possible observation outcomes (columns) for the true state of each captured individual (rows). We do not assume partial observations or state misclassi cations and the observed states are "seen uninfected", "seen infected", and "not seen".
We use the categorical to model the observed state \(y_{i,t}\) as a function of the true state: \[y_{i,t} \mid z_{i,t} \sim \textrm{categorical}(\pi_{z_{i,t},1:3}),\] where \(\pi\) is the observation matrix we just de ned. Similarly as before, the argument of the categorical distribution is a vector of length 3; that is, it is the row of the matrix \(\pi\) corresponding to the true state of individual i in time step t.

S3: Model assumptions
Abstract-In this section, we aim to list the assumptions that need to be met for a variety of different mark-recapture, site-occupancy, and N-mixture models. Violations of these assumptions can lead to parameter bias, misleading inference and conclusions.
S3.1: Unmarked data assumptions Site-occupancy model assumptions (MacKenzie et al. 2002) and Multistate occupancy model assumptions ; also see Bailey et al. 2014): 1. The occupancy status of a site does not change over the course of sampling (i.e., closed population) 2. The probability of occupancy is the same across all sites or any heterogeneity is modeled using covariates 3. The probability of detection is the same across all sites or any heterogeneity is modeled using covariates 4. Detection histories at each site are independent 5. Sites are a random sample of the population Dynamics site-occupancy model assumptions (Royle & Kéry 2007) 1. The occupancy status of a site does not change over the course of sampling (i.e., closed population) 2. The occupancy status of a site may change between seasons, i.e., populations are "open" to extinction and colonization 3. The probability of occupancy, extinction, and colonization is the same across all sites or any heterogeneity is modeled using covariates 4. The probability of detection is the same across all sites or any heterogeneity is modeled using covariates 5. Detection histories at each site are independent 6. Sites are a random sample of the population N-mixture model assumptions (Royle,  Generalized N-mixture model assumptions (Dail & Madsen  If host i is infected (\(z_{i}\) = 1), we model the pathogen detection probability \(p_{1,i,j}\) as a function of true infection intensity of the host i (\(N_{i}\) modeled above): to account for variability in the detection process dependent on host infection intensity.
Finally, to account for the measurement error caused by diagnostic testing in pathogen infection intensities, we model the pathogen infection intensity \(x_{i,j,k}\) during detection run k (e.g., qPCR) from sample j (e.g., swab or tissue sample) collected from host i is lognormally distributed: \[x_{i,j,k} \sim \textrm{lognormal}(\textrm{log}(m_{i,j}+0.001), \sigma^2_{3}),\] where \(m_{i,j}\) is the average infection intensity for sample j from host i and \(\sigma^2_{3}\) represents the logged measurement error from the diagnostic test. transformation to the probability scale will correspond to high probabilities around the extremes, i.e., 0 and 1. We use hyperpriors to calculate standard deviation when necessary using a gamma distribution (shape = 0.001, scale = 0.001). Although we have speci ed priors with little information, it is important to evaluate the effects of prior choice by varying priors within speci c systems. We computed three chains for each random variable with diffuse initial values. We assess convergence by visually inspecting traceplots and using the Rhat diagnostics of Gelman (Brooks & Gelman 1998).

S4.3: Simulating data
Our rst step in simulating data is to de ne the survey conditions and parameter estimates. If you are trying to determine how many samples or diagnostic tests your study requires, make your best guess at parameter values and examine how the precision and accuracy of recovered parameter estimates varies with the number of samples collected and analyzed. Lastly, we will build in imperfect detection of the diagnostic method and measurement error of infection intensity caused by diagnostic test.

S6.1: Table
List of state variables, observed data, and parameters of interest along with their de nitions.

S7: Steps for designing a study and modeling a system
Abstract-This section serves as a practical step-by-step guide for designing a project using unmarked data models outlined in this appendix. We outline the following six steps for any study: (1) De ne the study objective, study design, and key model terms, (2) Determine sampling error to consider, (3) Construct a conceptual model addressesing the ecological & sampling processes, (4) Write out the likelihood for the conceptual model, (5) Fit the model to simulated data to isolate mistakes and con rm that model parameters are identi able, and (6) Collect/compiling the data to analyze and make inference. Following these steps will enable researchers to use unmarked data models for improving inference in the study of disease dynamics. We intend this section to be used simultaneously with the rest of the appendix outlining code, simulating data, and generating models.
Step 1: De ne the study objective, study design, and key model terms Upon de ning the study objective(s), de ne key model terms such as: parameters to estimate (see Table 1 in main text), type of data to collect (i.e., marked or unmarked), sample units (i.e., site de nition), the time period over which observations are made and whether populations are assumed closed (i.e., no birth, death, immigration, or emigration) or open (i.e., processes of birth, death, immigration, or emigration), units for replicate surveys (i.e., temporal surveys, spatial surveys, a group of swab samples, multiple qPCR runs, or eDNA samples), and the criterion for 'detection' (i.e., evidence of species presence, such as feces or tracks, vs. direct species sighting). Study objectives, logistic or nancial constraints, and other study-speci c limitations will determine the de nition of these design and model parameters. As you work through these de nitions and the sampling design, consider the nested structure of your biological system of interest and what role that plays in data collection and analysis (see Fig. 1 in main text). Some of these same principles apply to working with datasets already collected Step 2: Determine sampling error to consider Hosts are regularly missed during survey events because of variations in observer experience, animal behavior, environmental conditions, and random noise (see Fig. 1  Step 3: Construct a conceptual model addressing the ecological & sampling processes When constructing a conceptual model of the ecological process for a single season model, consider covariates that in uence parameter values, such as site-speci c variables that affect host occupancy or abundance. For dynamic or open population models, specify the structure of the host population and the ecological processes, such as birth, death, emigration, immigration, and disease state transitions. By accommodating disease structure in a population, parameters such as recovery and infection probabilities are estimable. These graphical representations (see Fig. 2 in main text) are created using study-system knowledge of the processes affecting host populations and provide the conceptual basis for building mathematical models.
Consider how the observed host states (e.g., seen uninfected, seen infected, not seen) map onto the true host states (e.g., uninfected, infected, dead) when conceptualizing the sampling process. A transition matrix, where rows represent true states and columns refer to observed states, is typically used in multistate mark-recapture models to organize this information. A second question to consider is: "are all states completely observed?" For example, consider the case where an individual is encountered alive, but disease state is not determined (Conn & Cooch 2009;Zipkin et al. 2014b). Instead of censoring the data, the observation model can accommodate those data.
Step 4: Use the conceptual model to write the likelihood Translating the conceptual model to likelihood statements is done for each data type (if multiple are used) or for each process (i.e., ecological vs. observations). The likelihood functions describe the probability of an observed outcome (i.e, the data) conditional on paritcular parameter values. For example, count data can be analyzed using a linear model, which spe cies a gaussian likelihood to estimate abundance and covariate estimates. Typically, the likelihood statements are made using either discrete (e.g., Poisson, negative binomial, etc.) or continuous (e.g., gaussian, gamma, beta, etc.) statistical distributions.
We recommend the following helpful references for learning to translate conceptual models to Step 5: Fit the model to simulated data to isolate mistakes and con rm parameters are identi able After building the likelihood model, we recommend simulating data to con rm that the parameter estimates used to generate the data can be recovered. This ensures that the model was constructed correctly and helps gauge whether the proposed number of sites and replicate surveys are suf cient to adequately estimate parameters of interest as well as providing a power analysis to determine if the strength of covariate signals (e.g., Kéry & Schaub 2012;Kéry & Royle 2016). Adding covariates to the model may increase the amount of data necessary to estimate parameters (Table 1 of main text). Adjust the sampling design as necessary, by either increasing the number of sites sampled or replicate surveys per site. At this phase, it is also important to ensure that the model addresses the study objective and that suf cient data are collected to discriminate among hypotheses.
Step 6: Collect/compiling the data to analyze and make inference Data collection may be challenging because of uncontrollable circumstances, such as low host recapture rates, preventing the analysis of data as planned. Under this circumstance, it is powerful to have a well-planned study that is amenable to other types of data analyses, such as using unmarked data models.
By using unmarked data models, individual or species detection probabilities have been accomodated, making inference robust and reliable. For example, if the ecological and detection process likelihoods were not teased apart, then covariates that would be signi cant in the detection process may be signi cant in the ecological model, leading to erroneous conclusions (Kéry & Schaub 2012).
To quantify parameter effects, we calculate the differences between parameters of interest at each MCMC iteration following Ruiz-Gutiérrez et al. (2010). We compute the proportion of iterations where one parameter is greater than the other. We considered effects with small posterior medians or with a large degree of parameter overlap to be either unimportant to the process being modeled, or to have been estimated too imprecisely to draw conclusive inference.

S8: Common interfaces and programs to run models
Abstract-It can be dif cult to nd the balance between model exiblity, run time, and the learning curve to modeling programs. We provide a list of popular programs used to analyze mark-recapture and unmarked (e.g., detection/non-detection or count) data.