Rethinking thresholds for serological evidence of influenza virus infection

Introduction For pathogens such as influenza that cause many subclinical cases, serologic data can be used to estimate attack rates and the severity of an epidemic in near real time. Current methods for analysing serologic data tend to rely on use of a simple threshold or comparison of titres between pre‐ and post‐epidemic, which may not accurately reflect actual infection rates. Methods We propose a method for quantifying infection rates using paired sera and bivariate probit models to evaluate the accuracy of thresholds currently used for influenza epidemics with low and high existing herd immunity levels, and a subsequent non‐influenza period. Pre‐ and post‐epidemic sera were taken from a cohort of adults in Singapore (n=838). Bivariate probit models with latent titre levels were fit to the joint distribution of haemagglutination‐inhibition assay‐determined antibody titres using Markov chain Monte Carlo simulation. Results Estimated attack rates were 15% (95% credible interval: 12%‐19%) for the first H1N1 pandemic wave. For a large outbreak due to a new strain, a threshold of 1:20 and a twofold rise (if pared sera is available) would result in a more accurate estimate of incidence. Conclusion The approach presented here offers the basis for a reconsideration of methods used to assess diagnostic tests by both reconsidering the thresholds used and by analysing serological data with a novel statistical model.


| INTRODUCTION
Estimates of infection rates are crucial to decision-making and communication during an epidemic, for long-term public health planning, and to assess past responses. Without an accurate gauge of the size and severity of an epidemic, it is challenging to prioritize interventions and services to mitigate impact. 1,2 In the 2009 H1N1 pandemic, limited data inflated predictions of severity in the early stages, leading in turn to what in hindsight was overreaction in many quarters. 3 Lessons learned from the 2009 epidemic and its aftermath can be applied to influenza epidemics of strains both old and new.
Serological assessments can play a key role in assessing influenza outbreaks because they allow diagnosis of subclinical or misdiagnosed cases, and as a result, they provide the basis for estimates of an epidemic's impact, including, for example, estimates of hospitalization and mortality rates. 2 Common assays, such as haemagglutination inhibition, typically bracket the antibody level to an interval between two dilutions, for instance positive at 1:20 but negative at 1:40. Two study designs are frequently used: cross-sectional, in which a positive measurement beyond a specific threshold-which for influenza is usually set to 1:40 or 1:32-is taken to indicate recent infection, 4 and longitudinal, in which a rise in the highest positive reading between successive serum collections of four or more times (often called a "fourfold rise," although the interval censoring means the rise could actually be more modest) is typically assumed to reflect infection during that time interval. Cross-sectional designs can utilize residual samples of serology collected for other purposes and are logistically much simpler to implement than longitudinal serum collections. They can allow, in principle, real-time estimates of attack rates, 2 but have documented weaknesses, such as potentially leading to negative estimates of attack rates, 5 and while single cross sections may be accurate for a novel pathogen, repeated sampling is necessary if there is existing immunity in the population, or to track changes in incidence necessary for supporting real-time estimates of severity. 2 Traditionally, an HAI titre of 1:40 or more was chosen to indicate infection in cross-sectional studies, and it was found to be associated with a reduction in attack rates that varies between 20% and 80% depending on age group and setting in influenza vaccination studies. [6][7][8][9] This choice is somewhat arbitrary. On the other hand, a fourfold rise is the currently used threshold in longitudinal studies and has been found to have sensitivity of ~80% relative to a basket of other diagnostics. 10 Neither justification relates to overall diagnostic accuracy per se, however, which is the primary goal of analyses to determine attack rates, for surveillance or for intervention trials.
More robust analysis is made interesting by several complicating factors: (i) titres are interval-censored, with intervals that are too broad to justifiably ignore; (ii) longitudinal studies require accounting for repeated measurements; (iii) titre distributions are typically too skewed to assume normality.
This study proposes a new statistical approach to estimate attack rates for paired sera. In this, multinomial ordinal probit models account for censoring and non-normality by invoking a latent "titre propensity" and nonlinear threshold variables, while the titre propensity is made bivariate to account for within-individual correlations in time. In addition, using this to estimate attack rates directly, we assess the sensitivity, specificity and overall accuracy of alternative versions of traditional thresholds, using three scenarios: (i) a new strain of influenza against which there is little pre-existing immunity, (ii) an outbreak of seasonal influenza and (iii) a period with little influenza activity, using data from a community cohort established in equatorial Singapore.

| Data
Repeated serological samples were drawn from a cohort of adults (aged 21-75) participating in the Multi-Ethnic Cohort (MEC) of the Singapore Consortium of Cohort Studies, a long-term prospective cohort study, as described in detail elsewhere. 1,11,12 Blood samples were collected at six different time points in 2009 and 2010, and this study uses sera collected at four of those points: (i) baseline samples collected before July 2009, thus predating unlinked community transmission of the pandemic, (ii) a sample around October 2009, after the first but before the second wave, (iii) a sample in July 2010, which followed the subsequent two epidemic waves of influenza A(H1N1) pdm09 and (iv) a sample in September 2010, 10-12 weeks after the July 2010 sample. By the end of the study period, a total of 757, 690, 624 and 556 samples were obtained at the four time points, that is before July 2009, around October 2009, in July 2010 and in September 2010, respectively. The remaining two blood samples which were collected in the middle of period 1 and period 2 were not included in the analysis because they fell in the middle of outbreaks, and thus, it is hard to interpret attack rates involving them. A total of 38 subjects reported being vaccinated against influenza A(H1N1)pdm09 during the study period and were thus excluded from the analysis. This study focuses on three time windows: period 1 (spanning samples [1][2] which bracketed the first wave and in which the population had low initial immunity levels; period 2 (samples 2-3) which bracketed the second and third waves and in which the population had high initial immunity levels; and period 3 (samples 3-4) which had little influenza activity ( Figure 1A). A total of 758 participants who provided at least one blood sample at time points 1 and 2 (690 provided at both), 691 who provided at least one at time points 2 and 3 (544 provided at both) and 610 who provided at least one at time points at time points 3 and 4 (498 provided at both) were included in analysis. Participants gave informed consent, and the study was approved by the National University of Singapore Institutional Review Board.
All blood samples were titrated in twofold dilutions of phosphatebuffered saline from 1:10 to 1:2560 and analysed to determine the antibody titre, which is the reciprocal of the highest dilution of serum where haemagglutination was inhibited. 11 Titre values below the limit of detection were coded as <1:10, and a change in titre values from <1:10 to 1:10 was considered to be a twofold rise. Laboratory methods are detailed elsewhere. 11

| Statistical model
We developed a longitudinal multinomial ordinal probit model for HAI titres that incorporated latent variables for infection status and antibody propensity for each individual at both sample points (the slight modifications needed for those with a single observation are described later). A schematic diagram appears in Figure 1B This is then mapped to an observed titre T it =k, if τ k-1 ≤z it <τ k and t=1, 2. Note that the transformation from the latent variable's space to the observed titre allows the distribution to be distorted away from a normal distribution to reflect the empirical shape of the titre distribution. This is described in more detail in the Supporting Information.
As a consequence, the likelihood contribution from individual i given his or her infection status follows from the two-dimensional cumulative distribution function of a bivariate normal distribution.
Unconditional on infection status, the likelihood is instead a weighted average with weights p and 1−p for infected and uninfected distributions, respectively. For computational efficiency, we count the number of individuals with each combination of titres at the two time points and refer these to a multinomial distribution with probabilities determined by the foregoing description.
For individuals with observations at one time point only, the likelihood follows from the appropriate marginal distribution, either z i1~N (a, w+σ 2 ) for time point 1 and a weighted average of z i2~N (a+b, w+ν+σ 2 ) and N(a+b+c, w+ν+σ 2 ) at time point 2. Again, we calculate these probabilities for each possible titre and refer them to a multinomial to obtain the likelihood function.

| Sensitivity analysis
An important assumption in the above model is that the risk of infection is taken to be independent of baseline titres, which allows the infection probability parameter to represent average risk without knowing the titre distribution. An alternative formulation in which the probability is individualized to have a linear relationship to the titre category (on the logit scale), that is where logit(p i )=α+βT i1 and T i1 , is the observed titre for individual i at time point 1, was also developed and used to assess the sensitivity of our findings to this assumption. The sensitivity model is much more computationally intensive because it requires the individual-level serological data as the input.  Conventional terminology typically refers to the maximum tested titre at which there is a positive reaction. This corresponds to a bracket that interval censors the "true" titre, if more dilutions were tested. Y it is a coded version of the data; z it is a latent variable that, together with the thresholds τ k , conceptually determines the censored titre observed.

| Diagnostic accuracy of existing thresholds
Sensitivity, specificity, positive and negative predictive values (PPVs and NPVs, respectively) and overall accuracy were calculated to assess the performance of diagnostic tests for various titre thresholds, the latter two based on the prevalence estimated from the MEC data. We derived sensitivity/specificity by calculating the probability of a positive/negative test in the presence/absence of infection. To assess accuracy of different thresholds in future, plausible epidemics, PPVs, NPVs and accuracies were calculated directly from the sensitivity and specificity as the hypothetical true prevalence was increased from 0 to 1. Sensitivity, specificity, PPV and NPV were all calculated from simulations from the developed model instead of using an objective measure, such as PCR-confirmed infections.
Bias between the modelled prevalence and the hypothetical true prevalence was also calculated to compare the accuracy of the currently used thresholds (1:40 or a fourfold rise) with the model we developed. During the first wave (period 1), the majority of participants (79%) had no change in titre scores at the two time points, while the intervalcensored titres rose for 17% of participants and fell for the remaining 5% ( Figure 2A). During the subsequent two epidemic waves (period 2), 62% participants had the same titre scores at the two time points, while 17% had higher and 21% had lower titre scores at the postseason sampling ( Figure 2B). During the non-influenza period followed epidemic waves (period 3), 88% of participants had same titre scores at the two time points and 4% had higher and 8% had lower titre scores at the post-season sampling ( Figure 2C). In the absence of infection, 88%, 75% and 78% of participants had titre scores unchanged during the first pandemic wave, the subsequent two epidemic waves and the non-influenza period, respectively ( Figure 2G-I). Most of the participants who were infected had pre-season titres of less than 1:10 (86% for period 1, 73% for period 2, 74% for period 3) ( Figure 2J-L).

| Attack rates over three periods
Our primary estimand is the attack rate from pre-season to post-season.

| Threshold positive predictive values and sensitivity
The estimated PPVs for cross-sectional titres, that is the proportion of people with a titre of that level who were infected, were substantial even for low thresholds (Figure 3; Table S1) Tables S3 and S4.

| Threshold accuracy
For straight-forward prevalence estimates to be unbiased, the proportion of participants testing positive should equal the proportion infected, but for imperfect tests, this depends on the balance between sensitivity and specificity as well as the true prevalence. As depicted in Figure 4A, for cross-sectional studies in an influenza pandemic first wave, based on our estimated sensitivity and specificity, the most accurate threshold would be the standard 1:40 only for small pandemics with an attack rate between 4% and 12%; for larger pandemics, this threshold leads to estimates that are biased downwards. This bias could be reduced using a lower threshold of 1:20 (for attack rates of 12%-31%) or of 1:10 for a pandemic infecting a third or more of the population, as predicted by many simulation models 4 and in line with the 1918 pandemic. 17 Similarly, use of a fourfold rise as evidence of infection minimizes bias for small pandemics (of 2%-14% attack rates) but underestimates the impact of larger pandemic first waves, while a twofold rise would be more accurate for pandemics infecting more than 14% of the study population ( Figure 5A). These findings are similar when using the estimates from Singapore's subsequent two epidemic waves after the first pandemic wave ( Figure 5B,C).   13,22 or risk factor studies, 23 their imperfect sensitivity and specificity mean they may not adequately describe overall infection rates at a population level. As a result, estimates of severity may be similarly, and potentially substantially, biased.

| DISCUSSION
We argue that there are three approaches to remedying this situation. One would be to use a statistical method that explicitly accounts for the structure of serological data-in particular, their censored nature, the distinct forms of statistical error (assay error and between individual variability) and the differential response over time of those infected and uninfected.

Limitations
As the data analysed in this study were taken from a cohort study, it is not randomly selected and may not be representative of Singapore's population. This is a limitation common to most serological studies we have seen for influenza, with few exceptions (such as a large study in China 25 ). The method we developed in this study depended upon the availability of serial serum sampling that begins prior to an outbreak, which can be costly and logistically complex, but which accounts for baseline antibodies present due to cross-reactivity from different strains of a virus. 1 The main analysis assumed the infection risk to be independent of baseline titre levels, albeit that this assumption is known to be untenable. 26 Results from the sensitivity analysis that accounted for infection risk being influenced by initial titre were similar to those in the main analysis, but the alternative method was more computationally intensive. The developed model in this analysis only implicitly accounts for the waning effects in antibody titres over time, via a systematic reduction (or in principle increase) in mean latent titres between the time points. Cross-reactivity is another potential limitation when analysing serologic tests to estimate the prevalence of seasonal influenza. For example, in the 2009 pandemic, there was a high level of pre-existing seropositivity in older age groups due to cross-reactivity, because the virus subtype had been endemic in the population prior to the 1957 influenza A(H2N2) pandemic. 2 The approach further requires that the post-pandemic sera are collected sufficiently soon after the end of epidemic activity such that titres have not decayed too substantially over time. 15 There might be potential measurement errors in the titration of antibodies against A(H1N1)pdm09 infections; however, it cannot be assessed without duplicated assays at the same time point.
Thresholds suggested from the current analysis in response to the anticipated size of the outbreak are specific to the context of the influenza A(H1N1)pdm09 outbreak, and further work is needed to demonstrate their validity for other epidemic scenarios.
Despite these limitations, these results challenge the predominant threshold of a 1:40 HAI titre or a fourfold rise in HAI titres and, in turn, the accuracy of many prior estimates of H1N1 attack rates. Precise estimates are important to public health planning and risk mitigation, and, therefore, a reconsideration of the standard paradigm should be considered.