Modelling, Bayesian inference, and model assessment for nosocomial pathogens using whole‐genome‐sequence data

Whole‐genome sequencing of pathogens in outbreaks of infectious disease provides the potential to reconstruct transmission pathways and enhance the information contained in conventional epidemiological data. In recent years, there have been numerous new methods and models developed to exploit such high‐resolution genetic data. However, corresponding methods for model assessment have been largely overlooked. In this article, we develop both new modelling methods and new model assessment methods, specifically by building on the work of Worby et al. Although the methods are generic in nature, we focus specifically on nosocomial pathogens and analyze a dataset collected during an outbreak of MRSA in a hospital setting.


Epidemiological parameter updates
The importation probability p can be updated according to its full conditional distribution. We assume that p ∼ Beta(α p , β p ) a priori where Beta(a, b) denotes a Beta distribution with probability density function f (x) ∝ x a−1 (1 − x) b−1 , 0 ≤ x ≤ 1. It follows from (5) that In the MRSA analysis we set α p = β p = 1.
For the MRSA analysis we set α z = β z = 1.
The infection parameter β is assumed to have an improper prior distribution on (0, ∞), and does not have a standard full conditional distribution and so it is updated using a Gaussian random walk Metropolis-Hastings step.

Genetic parameter updates
We assume that the direct-transmission parameter θ has a Γ(ν, λ) prior distribution, i.e. with probability density function f (x) ∝ x ν−1 exp(−λx) for x > 0. It follows from (2) and (1) that for both the error dependence and chain dependence model we have is the set of pairs of sequences with transmission distance 1, and thus θ can be updated directly from its full conditional distribution. In a similar fashion, if θ G ∼ Γ(ν G , λ G ) and θ I ∼ Γ(ν I , λ I ), a priori and K ∞ and K 0 are defined in the obvious manner, we obtain For the MRSA analysis, all these three genetic parameters were assigned independent Γ(1, 10 −6 ) prior distributions.
Finally, the parameter γ in the error dependence models is assumed to have an improper prior distribution on (0, ∞), and does not have a standard full conditional distribution and so it is updated using a Gaussian random walk Metropolis-Hastings step.

Latent variable updates
Recall that the latent (i.e. unobserved) variables are the colonisation times t c , the colonised-onadmission indicator functions φ, the sources of colonisation s and the unobserved genetic distances ψ u . We now describe four update steps which together will update these quantities, where one update step is chosen with equal probability during each iteration of the MCMC algorithm. We define the following quantities.
• n sus is the number of patients that never have a positive test; • n add is the number of patients that never have a positive test and that are colonised; • n add0 is the number of patients that never have a positive test, are colonised and are not the source of colonisation for any other patient.
Add a colonisation event In this move we select uniformly at random a currently uncolonised patient, i, and propose that they become colonised. If there are no uncolonised patients to choose from then no move is made. The number of uncolonised patients to choose from equals n sus − n add . With probability w the chosen patient is proposed to be colonised before admission to the ward, so they are an importation. With probability 1 − w the patient is proposed to be colonised by another colonised patient on the ward. In this case we draw a day of colonisation, t c * i from t a i , t a i + 1, . . . , t d i . We select a source of colonisation uniformly at random from the set of colonised patients on this day. If there are no available patients to be a source on this day, the move is not made. If the move is possible, whether we are inferring an importation or a colonisation on the ward, we then draw a set of proposed genetic distances, ψ * σ(i),j , from patient i's sequence, σ(i), to every other sequence from every colonised patient (both those observed positive in the data, either with or without sequenced isolates, and those currently added by the algorithm). The distances are drawn from the relevant probability distributions depending on the choice of genetic distance model, and the transmission distances k(σ(i), j).
Using the 'remove colonisation' step definition below, the proposal ratio for adding an importation is is the probability of proposing the genetic distances associated with the proposed new importation. The proposal ratio for adding an acquisition (i.e. a patient colonised on the ward) is .
Remove a colonisation event In this step we select uniformly at random one of the n add0 currently added colonised patients who are not the source of any other colonisations and propose that they be no longer colonised. This move cannot be made if no such individuals exist. If we propose to remove the colonisation time from patient i who is an importation then the proposal ratio is is the probability of the genetic distances associated with patient i. The proposal ratio when we propose to remove the colonisation time for a patient i who was colonised on the ward is .

Move a colonisation time
In this move we pick a patient i uniformly at random from the set of currently colonised patients and move their colonisation time. As when we added a colonisation time, we propose that the patient was positive on admission with probability w. With probability 1 − w the patient acquired the pathogen whilst on the ward so we sample a colonisation time uniformly at random from the set {t a i , t a i + 1, . . . , f i }, where f i is the latest day on which patient i could have been colonised. This is the minimum of (i) the day before the patient's first positive swab result, if any; (ii) the day before the colonisation of the first patient to be colonised on the ward by i, if any; (iii) the day of discharge t d i . We then propose a source uniformly at random for patient i from the set of colonised patients on the chosen day of colonisation. If there are no such colonised patients no move is made.
If patient i is colonised on the ward on day t c i and we propose that they be colonised on the ward on day t c * i then we have .
If patient i is colonised on the ward on day t c i and we propose that they be colonised on admission then we have .
If patient i is colonised on admission and we propose that they be colonised on the ward on day t c * i then Finally, if patient i is colonised on admission and we propose that they remain colonised on admission then nothing has changed and the update step is complete.
Change genetic distances In this move we attempt to update genetic distances other than those found in the data set, i.e. distances which are imputed in the algorithm. To do so we pick a patient i uniformly at random from all those with one or more imputed sequences, and then pick one of their sequences, L(i) say, uniformly at random. We then propose a new set of genetic distances ψ u * L(i),j according to the underlying genetic distance model. The proposal ratio is .

Additional updates to improve mixing
Block updates for genetic parameters and distances All the genetic parameter updates described above are dependent on the current values of the genetic distances ψ i,j , and in particular those which have been imputed rather than being part of the data. This correlation can cause mixing problems and we found it was beneficial for some models to update θ and θ G with some genetic distances, as follows. First, proposed new values θ * and θ * G are obtained using Gaussian proposals centered on the current values. Second, we pick a patient i and a sequence L(i) in the manner described in the 'change genetic distances' move above. Third, we propose new genetic distances by using the underlying genetic distance model, but using the proposed parameters θ * and θ * G . The proposal ratio is .
Single updates for genetic distances Rather than proposing to update all of the genetic distances associated with patient i, which can lead to low acceptance probabilities if the move is too large, an alternative is to pick one of i's distances uniformly at random and then, with equal probability, add or subtract a quantity drawn from a pre-specified distribution, such as a Poisson distribution. The proposal ratio for this move equals 1.
Swap a patient and their source The motivation behind this move is to provide a small change to the overall transmission forest, since large-scale moves can have low acceptance probabilities. A patient j is selected uniformly at random from those who are colonised on the ward. Their source, patient i, is identified. If i colonises another patient before j then no move is made. If t c i < t a j no move is made. Otherwise we propose t c * i = t c j , t c * j = t c i , and s * i = j. The proposal ratio for this move equals 1.
Change a source without changing colonisation time Another small move is as follows. We select a patient i uniformly at random from those colonised on the ward. We uniformly at random select a patient j from those colonised on day t c i , and propose s * i = j. The proposal ratio for this move equals 1.