## 1. Introduction

### 1.1. Motivating problem

In natural populations, animals are likely to be infected by a variety of pathogens, either simultaneously or successively. Interactions between these pathogens, which can be synergistic or antagonistic, can affect infection biology (e.g. the intensity of one or both infections), or host susceptibility to infection, or may impact on the host's morbidity or/and mortality. However, the biological processes that are involved are often too complex to allow clear-cut predictions regarding the outcome of such interactions. To explore potential interactions, a longitudinal study was undertaken by recording the sequences of infection events for different parasites in four spatially distinct populations of field voles (*Microtus agrestis*). The data are records of six pathogens: three species of *Bartonella* bacteria (*B. taylorii*,* B. grahamii* and *B. doshiae*), cowpox virus, the bacterium *Anaplasma phagocytophilum* and the protozoan *Babesia microti*. Aside from their intrinsic interest as a community of pathogens, *Bartonella*,* Anaplasma*,* Babesia* and cowpox virus infections may also be zoonotic: capable of being transmitted from animals to humans and causing disease.

As in most capture–mark recapture studies, a different set of voles was caught at each session leading to incomplete profiles for all subjects. The data set therefore contains many missing observations; for example a profile for a given vole and a given disease from the first to last observation times for that vole might be NPxxPNxP, where x, N and P respectively indicate a missing observation, a negative response and a positive response. Inference on incomplete data in longitudinal and capture–recapture studies is a major problem; for examples see Daniels and Hogan (2008) and Pradel (2005). Previous analyses of our and related data sets (see Telfer *et al*. (2010) and Begon *et al*. (2009)) have examined all pairs of observations for a given vole that occurred exactly 1 lunar month apart and for which the first of the two observations was an N. The influence of each covariate on the probability of contracting a disease is then ascertained through logistic regression. In this paper we offer a more realistic model and a more powerful analysis methodology for investigating the effects of previous infections for each disease on the other diseases. We use a hidden Markov model (HMM) for each disease (Section 'Hidden Markov models and notation') and perform inference via a Gibbs sampler; this allows us to use all of the data set and to infer covariate effects on a given disease, even when these covariates are the (potentially missing or hidden) states of the other five diseases.

### 1.2. Data

We analyse data collected between March 2005 and March 2007 from field voles in Kielder Forest, which is a man-made forest on the English–Scottish border. The voles were trapped at four grassy clear-cut sites within the forest, with each site at least 3.5 km from the nearest neighbouring site. Individuals were trapped within a 0.3 ha live trapping grid comprising 100 traps set at 5 m intervals, with trapping taking place every 28 days from March to November, and every 56 days from November to March. Begon *et al*. (2009) have provided further details of the study area and the trapping design.

Captured voles were marked with a unique identifying passive transponder tag to be recognized in later captures. At each capture, a 20–30-*μ*l blood sample was taken for pathogen diagnostic tests. Polymerase chain reaction assays were used to test directly for evidence of infection with *Anaplasma phagocytophilum*,* Babesia microti* and the three *Bartonella* species (see Courtney *et al*. (2004), Bown *et al*. (2008) and Telfer *et al*. (2008)). Antibodies to cowpox virus were detected by immunofluorescence assay (see Chantrey *et al*. (1999)). A brief description of the observed and derived variables is given in Table 1.

Variable | Description |
---|---|

Tag | Unique number that identifies each vole |

Site | Identifier for the capture site (four-level factor) |

Sex | Male or female |

Lm | Capture time point in whole lunar months (1–27; integer) |

Weight | Weight in grams rounded to the nearest 0.5 g |

Sin | sin (2π Lm/13) |

Cos | cos (2π Lm/13) |

Tay | B. taylorii, N (negative) or P (positive) |

Grah | B. grahamii, N (negative) or P (positive) |

Dosh | B. doshiae, N (negative) or P (positive) |

Cow | Cowpox, N (negative) or P (positive) |

Ana | Anaplasma, N (negative) or P (positive) |

Bab | Babesia, N (negative) or P (positive) |

After some processing (which is described in detail in Xifara (2012)) our data set contains 4344 captures of 1841 voles. Only voles that have been caught at least twice are directly informative about transition probabilities (see Section 'Hidden Markov models and notation'), although voles that have been captured only once still contribute to inference for the initial distribution of each hidden Markov chain (see Section 3.1.1).

The data set contains a substantial fraction of missing data: almost half of the voles are not captured at every lunar month between the first and last times that they were observed. Thus, even for many of the voles that were observed at least twice, not all of the covariates are available, either because the vole was not caught in a given lunar month, or sometimes because the vole was caught but a given variable was not ascertained. Table 2 shows the frequency of missing values derived from the first cause. The number of additional missing values, where it was not possible to ascertain the status of a particular disease, despite the vole being captured, is given in Table 3, which also shows the frequency of positive (P) and negative (N) records for each disease.

Lunar months from first to last capture | Frequencies for the following numbers of values missing: | ||||
---|---|---|---|---|---|

0 | 1 | 2 | 3 | ≥ 4 | |

0 | 832 | — | — | — | — |

1 | 275 | — | — | — | — |

2 | 132 | 74 | — | — | — |

3 | 75 | 55 | 15 | — | — |

4 | 30 | 49 | 33 | 9 | — |

5 | 21 | 24 | 34 | 5 | 3 |

6 | 7 | 7 | 33 | 15 | 2 |

7 | 1 | 4 | 25 | 16 | 9 |

>7 | 0 | 2 | 27 | 12 | 15 |

Disease | Number of additional | Number | Number |
---|---|---|---|

missing values | of N | of P | |

B. doshiae | 46 | 3583 | 715 |

B. grahamii | 44 | 3468 | 832 |

B. taylorii | 32 | 3139 | 1173 |

Babesia | 0 | 2354 | 1990 |

Cowpox | 85 | 1408 | 2851 |

Anaplasma | 6 | 4107 | 231 |

### 1.3. Statistical challenges

We aim to investigate potential interactions between the six pathogens of the study. In particular, for each disease *d*, we wish to evaluate the way in which the presence or absence of each of the other diseases (and perhaps further information such as whether or not any infection is in its first month) affects the probability that vole contracts *d*. Additionally where applicable we are interested in how other diseases affect the probability of recovery from *d*.

We could model each disease as a two-state discrete time Markov chain, where state 1 corresponds to no disease and state 2 to presence of disease; however, this two-state model imposes a very specific structure. For example the length of any infection is geometrically distributed; however, it might be that the probability of remaining infected when a disease is in its first month (an acute phase) is different from that in subsequent months (chronic phase). It has also been found (e.g. Telfer *et al*. (2010)) that acute and chronic phases of a disease can have different effects on the probability that a vole contracts disease . A two-state semi-Markov model (see, for example, Guédon (2003)) could account for the first effect, at the expense of extra complexity, but not the second. To represent both the dynamics and the influence of each disease with minimal extra complexity adequately, therefore, in this analysis the dynamics of all except one of the diseases is modelled as a Markov chain with more than two states. Section 'Hidden Markov models and notation' details the model for each disease.

Only knowledge of the presence or absence of the disease is available to us. In general, this equates to knowledge of a subset of the state space in which the true state must lie, but not to the exact state of the chain. For example, for all except one disease, states 2 and 3 both correspond to the presence of the disease. In disease modelling, HMMs arise when the Markov model for disease progression has several stages, or states, but these are not directly observed (e.g. Guihenneuc-Jouyaux *et al*. (2000) and Chadeau-Hyam *et al*. (2010)). Often the relationship between the state of the Markov chain and the observation is stochastic, although in our case no stochasticity is involved, but the state of the Markov chain is nonetheless hidden. Furthermore, observations are only available to us when the vole has been captured. The forward–backward algorithm (see Section 'The forward–backward algorithm') can be applied to any discrete time HMM with a finite state space and addresses both of these issues.

We consider *D*=6 diseases, and hence six interacting (or coupled) HMMs. It is possible to consider the coupled Markov chains for each disease together as a single Markov chain on an extended state space. In this case the likelihood function is straightforward to evaluate by using the forward–backward algorithm (see for example Zucchini and MacDonald (2009)) and a Bayesian analysis can then be performed by using Markov chain Monte Carlo (MCMC) methods. In our particular scenario the state spaces have size 4, 4, 4, 3, 3 and 2, which would lead to an extended state space of size . Since the forward–backward algorithm applied to an HMM with *n* states takes operations, a naive implementation of the algorithm applied to the extended state space would have a complexity of compared with a complexity of for six coupled chains; equivalently, 100000 iterations of an algorithm which deals with each chain separately would take approximately the same central processor unit time as five or six iterations of the single-chain algorithm. In our specific scenario, but certainly not in generality, some of the transition probabilities in each individual chain are 0, and (in our scenario) only 32 768 elements of the extended transition matrix would be non-zero. The use of sparse matrix routines could therefore reduce the efficiency ratio to approximately 468. Such a reduction in computational efficiency would only be justified if the fraction of missing data were very close to 1 so that the mixing of our Gibbs sampler would be extremely slow.

Pradel (2005) analysed capture–recapture data by using an HMM, and incorporation of covariate information within this framework via an appropriate link function is straightforward (see Lachish *et al*. (2011) and Zucchini and MacDonald (2009) (section 8.5.2)). However, the methodology does not allow the use of multiple HMMs nor, therefore, can it use the state of each HMM as a covariate for the other HMMs. We require six HMMs (one for each disease) and we wish to use covariate information such as the time of year and weight of the vole. Furthermore we wish the covariate set for each disease to include the states of the HMMs for the other diseases. For each disease *d*, we shall represent the probability of each possible state change through a logistic regression. However, some of the covariates, the states of the other HMMs, are unknown. Our solution is a Gibbs sampler which employs the forward–backward algorithm and adaptive random-walk Metropolis steps to sample from the true posterior distribution of all of the HMMs and the covariate parameters jointly.

### 1.4. Outline

The remainder of this paper is organized as follows. Section 'Modelling the hidden and missing data' describes the model which was used for each disease, gives its likelihood function and outlines the imputation of missing weight values and the other fixed covariate values. The MCMC algorithm is described in Section 'Bayesian approach' and we present our results, including the sensitivity study, in Section 'Analysis and results'. The paper concludes with a discussion.

The data that are analysed in the paper can be obtained from