Non-annual seasonality of influenza-like illness in a tropical urban setting

In temperate countries, influenza and other viral respiratory diseases often have distinct seasonal peaks occurring during colder, wintertime months. However, little is known about the dynamics of influenza and viral respiratory disease dynamics in the tropics, despite high morbidity and a clear epidemiological link between tropical and temperate countries. In temperate countries, the dynamics of influenza and other respiratory diseases are often analyzed using syndromic surveillance data describing influenza-like illness (ILI) as ILI is highly correlated with virological surveillance for influenza. To obtain a detailed picture of respiratory disease incidence patterns in a large tropical city, we established an mHealth study in community outpatient clinics in Ho Chi Minh City, Vietnam (11N latitude). From August 2009 through December 2015, clinics reported daily case numbers of ILI using standard mobile-phone SMS messaging. A subset of these clinics performed molecular diagnostics for influenza A and B viruses. Unlike the annual patterns seen in temperate countries, ILI activity in Ho Chi Minh City exhibited strong non-annual periodicity and was not correlated with PCR-confirmed influenza. The dominant periodicity in the data was approximately 200 days. This was confirmed by a time series decomposition, a step-wise regression analysis on annual and non-annual covariates, and a forecasting exercise showing that forecasting was 30% to 40% more accurate when a 200-day non-annual cycle was included in the forecast. This suggests, for the first-time, that a non-annual cycle may be an essential driver of ILI dynamics in the tropics. This raises new questions about the seasonality and drivers of respiratory disease transmission in tropical countries.


Introduction
In temperate countries, influenza virus is one of the most studied disease systems, exhibiting a predictable wintertime transmission season and a robust relationship between syndromic and molecular surveillance. Little is known about the epidemiology of influenza virus in the sub-tropics and tropics despite a renewed research interest in tropical influenza over the past decade resulting from increased availability of influenza surveillance and sequence data [1][2][3][4][5] . To date, research on tropical influenza has concentrated on whether influenza epidemics exhibit annual seasonality [6][7][8][9][10][11][12][13][14] and whether influenza viruses show patterns of year-round persistence [15][16][17][18][19] . A third question that has received less attention is whether syndromic influenza-like illness (ILI) surveillance has the same peaks and troughs as molecular surveillance for influenza virus in these regions. In temperate countries, public health agencies are able to rely on ILI reporting to signal the onset of the influenza season [20][21][22] , but it is not known if ILI and influenza correlate in tropical countries 23,24 .
The majority of epidemiological studies looking at influenza and/or respiratory disease in the tropics have two major drawbacks. The first one is ignoring absolute case counts and reporting only the percentage of samples (nose/throat swabs) that test positive for influenza 12,14,[24][25][26][27] . Ignoring case counts makes it impossible to determine if samples are being taken during a potential influenza season or outside of it. The second one is underpowering the analysis by using a short time series or monthly data or both 23,24,[26][27][28][29][30][31][32] . Monthly data are normally too coarse to infer the presence of an annual transmission season or other periodic trends (if these exist) unless the time series is quite long. In fact, this is one of the reasons for disagreement in the current literature as some studies on respiratory disease in the tropics claim support for an annual transmission season 7,12,14,[26][27][28][33][34][35] while others show mixed or no evidence 8,13,31,[36][37][38][39][40] . Among these, some of the more weakly supported results are being used in public health policy to advocate for particular vaccination timings based on incorrectly identified seasonal signals 12,35 .
Understanding the dynamics of tropical influenza -especially the presence or absence of seasonalitymay allow the forecasting methods successfully deployed in temperate countries 41,42 to be used for tropical influenza. Current forecasting methods rely on mechanistic Susceptible-Infected-Recovered (SIR) models and known/inferred climate associations to accurately predict increases in influenza virus infections. In the tropics, it is not known whether influenza dynamics obey classic SIR models, whether they are characterized by low-level persistence, or a combination of the two. It is also not known which climate-influenza associations are expected to be present in tropical countries despite some recent advances on this topic 11,43 . If the intrinsic epidemiological dynamics and the presence/absence of climate associations can be understood, forecasting of influenza epidemics in the tropics may be possible. Thus far, the only attempt at influenza forecasting for the tropics or subtropics reported that the majority of forecast attempts (different leads, different methods) had accuracies below 50% when predicting the timing of an influenza peak 44 .
In addition, an accurate description of the basic epidemiology of tropical influenza is critical for inferring the likely routes of viral seeding from the tropics to temperate zones and vice versa 4,45  and then aggregated these into a single z-score time series. Several internal validations were done to ensure that the data followed certain expected behaviors for multi-site syndromic reporting and that arbitrary or random reports were not being sent during the course of the study (see Materials and Methods). In particular, note that individual clinic time series correlated with each other, and replacing a single clinic with a white noise signal of equal variance reduced the correlation between that clinic and the aggregate ILI trend ( Figure S2). Influenza-like illness trends in Ho Chi Minh City ( Figure 1) suggest that there are typically multiple ILI peaks per year, as has been observed in other tropical and sub-tropical regions 11,17,44 . Visually, no seasonal or annual cycle appears in these data.
In a subset of the clinics, molecular confirmations on naso-pharyngeal samples (n = 2,217) were taken from May 2012 to December 2015. Compared to other tropical settings, these clinics had a rate of influenza positivity (21.5% positivity for influenza A and 9.7% positivity for influenza B) in the high range of previously published studies 14,23,28,36,37,49 . We compared the confirmed influenza cases to the ILI data and found that there was no correlation between the two time series (Figure 2; Pearson correlation coefficient: −0.02, p-value: 0.86) and that this did not differ for influenza A and B individually (both p-values > 0.15). The time series showed periods of high ILI activity with a low level of influenza confirmation, likely representing epidemic waves of other respiratory viruses, as well as periods that were high influenza and low ILI, suggesting that an influenza epidemic may not contribute as much to the overall ILI trend as it does in temperate regions.
We identified a dominant periodicity in the data using an auto-correlation function and standard time series decomposition (see Materials and Methods). The auto-correlation function (ACF) identified 206 days (ACF = 0.262; p-value < 10 -15 ) whereas the discrete Fourier transform identified 199 days as the time series' dominant periodic signal (ACF = 0.244 for a lag of 199 days; p-value < 10 -15 ); see Figure 3. This non-annual signal is almost twice as strong as the annual cycle, with the 365-day lag exhibiting an auto-correlation value of 0.153 (p-value = 0.014); note that the large number of data points results in statistical significance for nearly all ACF values. A dominant non-annual signal is an unusual feature in disease incidence data. We verified that this result was not an artifact of our data renormalization and detrending methods by applying these same methods to temperate zone ILI data and showing that ILI time series in Europe and North America show their strongest periodic signals at 365 days, with no evidence of periodic signals shorter than one year ( Figure S3).
To determine the relative influence of annual and non-annual signals on the ILI trend, we performed a stepwise regression of the ILI trend onto both annual climatic variables and the system's intrinsic non-annual cycle.
Lagged variables, interactions, and non-linear transformations of the climate variables were included; the nonannual cycle was constructed as a step-function with periodicity 206 days (see Materials and Methods). The stepwise regression indicated that the terms with explanatory power were the daily temperature, relative humidity (RH and √RH), the interaction term between RH and temperature, lagged climate terms (see eq. 2 in Materials and Methods), and the non-annual cycle. Clearly, the association between the non-annual effect and the ILI trend is statistically significant (Table 2) Several robustness tests were performed. Figure S6 shows that forecasting using a 202-day intrinsic nonannual cycle in combination with bootstrapped climate data gives the most accurate forecasts, and that a 211-day cycle was optimal when forecasting ILI trends using real weather data. These results are robust to whether mean or median prediction error is used as an evaluation criterion ( Figure S7). Using a simpler regression model with no lags and no non-linear climate terms, a 201-day cycle gave the lowest prediction errors ( Figures S8 and S9).
All analyses provided support for the existence of a non-annual cycle with periodicity of approximately 200 days.
Our decomposition, stepwise regression, and prediction analyses provide strong evidence that an intrinsic non-annual cycle of around 200 days exists for respiratory disease transmission in Ho Chi Minh City. This cycle is either unique to the dynamics of respiratory infections in tropical climates, or it is a natural part of respiratory disease epidemiology in all regions but not detectable in temperate countries as a result of being overwhelmed by the strong winter seasonality of respiratory disease transmission. An ILI indicator, showing whether ILI percentages are above or below the mean trend, is updated daily and publically available (www.ili.vn) providing a real-time surveillance system for patients and clinical providers.

Discussion
Our study demonstrates the value of community epidemiology studies for describing fine-scale dynamics of influenza-like illness in tropical settings where respiratory disease dynamics are non-annual and difficult to predict.
We were able to show that a network of community clinics can generate a high-quality syndromic time series that can be used to understand local transmission patterns of respiratory disease, and that such a network can generate a significantly larger data set (~6,000 data points per year) than traditional surveillance systems that report weekly or monthly measures of incidence. This volume of data increases statistical power to detect ILI associations, and in our study, the presence of non-annual forcing in the system. The present study does not achieve the data volume seen in 'big data' study designs 21,50-52 which can have tens of millions of observations per year, but the specificity of our data signal is higher than in the aforementioned studies as each data point in our study corresponds to a patient, seen by a physician, determined to have met or not met the clinical criteria for influenza-like illness.
The major quality control challenge we encountered was accounting for long-term trends in ILI (we had a downward trend in our data). In a multi-site time series, detrending must be done carefully, and changes in a site's reporting patterns must be investigated individually. From discussions with the reporting physicians in our study, the putative causes of the decreasing trend in ILI were likely to have been (i) a more than doubling of patient visit costs that would have reduced the likelihood of reporting a minor respiratory illness, (ii) increased clinical specialization at some sites, or (iii) more conservative interpretation of ILI guidelines after molecular diagnostics were introduced in May 2012. In addition, during 2011 and 2012 a few large clinics were enrolled in the study, and some of these had higher patient volumes but lower ILI percentages. All of these features of communitybased syndromic reporting systems need to be considered for both study design and surveillance purposes.
Detrending with a 12-month moving average appears to be the simplest way to detrend and preserve any potential annual structure in the data.
The lack of correlation between influenza trends and ILI trends suggests that the transmission dynamics of respiratory disease differ between tropical and temperate zones, consistent with the past decade's literature on this topic 10,11,13,17,43,46 . Given the observed pattern of multiple ILI peaks in our data, some of which are influenza epidemics and some of which are not, the natural hypothesis explaining this pattern is that multiple respiratory pathogens co-circulate and cause asynchronous epidemics. It is unknown if in such a system multiple respiratory pathogens should circulate independently or not. The putative mechanism that would create dependence or interference among waves of different co-circulating respiratory viruses would be post-infection raised antibody or cytokine concentrations 53-55 generated by one viral epidemic preventing an epidemic of a different virus from taking off immediately thereafter. Epidemiological interference among respiratory viruses has been observed in long-term time series in temperate 56,57 and tropical 58 regions, but the strength and duration of this effect is not well understood. In our community study, additional molecular confirmations for a range of respiratory pathogens are now underway to further describe this phenomenon.
The second major question that arises from the basic correlational analysis between ILI and influenza is why high influenza periods should be observed when ILI is low. To the best of our knowledge, this pattern has not been observed in other surveillance systems, as a wave of influenza infections is normally sufficient to generate a substantial uptick in the ILI signal. The likely explanation for a high-influenza low-ILI period is a larger than expected prevalence of other respiratory viruses among the reported ILI cases; this is possible as the community clinics in our study are almost exclusively outpatient and likely to see many mild cases of respiratory disease. If influenza infection represents only a small fraction of respiratory disease among these outpatients, a wave of influenza alone would not generate an ILI peak. In general, community-based studies of respiratory disease should aim to characterize the contribution of all respiratory viruses to the ILI trend to determine if it is a particular pathogen's dominance or synchrony among certain pathogens that generates an ILI peak.
The major finding in our study is that the dominant periodicity observed in our ILI time series is nonannual. This is the first report of a non-annual disease cycle in temperate or tropical respiratory disease data. The existence of an intrinsic non-annual cycle in the dynamics is supported by traditional time series decomposition, by a regression of the time series onto both annual and non-annual covariates, and by an analysis of the system's predictability showing that accurate forecasts of ILI trends are highly dependent on the system's non-annual cycle of ~200 days. The presence of non-annual periodicity is consistent with a mechanism of short-term non-specific immunity conferred by one respiratory virus that affords near-term protection ( Although a complete forecasting evaluation will require a separate analysis, we can already detect one clear limitation of ILI forecasting methods: that they must be based on future weather predictions which, in our analysis, were bootstrapped from past weather data. Nevertheless, this proved to be a small obstacle in our analysis as, for Ho Chi Minh City, the bootstrapped climate variables yielded accurate predictions of averages for temperature and relative humidity ( Figure S5). In other words, it is more likely that higher levels of ILI during a particular period are affected by the average climate behavior during that period, and not by any particular days that have extremes in temperature or relative humidity. This contrasts with the climate mechanisms proposed in temperate zones where it is postulated that the onset of abnormally low absolute humidity is closely associated with the onset of the influenza season 41 . The larger question on climate effects and influenza -why AH, RH, and temperature appear to have different transmission effects in temperate and tropical regions 11,43,59 -remains to be answered. Much work remains to be done before respiratory disease outbreaks in the tropics can be forecast accurately; our hope is that the non-annual signal identified in this study will help in this endeavor.
A second limitation in the current study design is the lack of age information. We experimented with several different reporting methods (email, log books) for this study, but only the log-book method was able to capture age information consistently. Unfortunately, this method was adopted by a minority of the clinics in our study, and it was not compatible with real-time reporting. The age distribution of ILI cases represents a critical data gap in our study and in other mHealth studies that aim at real-time reporting, as the age distribution could tell us whether the major disease burden skews towards childhood respiratory diseases or general respiratory diseases like influenza. As tropical countries have younger age distributions than temperate countries, this difference may have a profound epidemiological effect on differences in ILI dynamics between temperate and tropical zones, as well as the proportion of ILI cases that are caused by influenza versus other respiratory viruses.
The public health value of our mHealth reporting system is that ILI results can be fed back in real time to participating physicians and the community of health professionals in Ho Chi Minh City. Real-time ILI trends from our study are publicly available and updated daily. The two key questions raised by our study are (i) to what extent the transmission of non-influenza respiratory viruses in the tropics is a potential driver of complex multipathogen transmission system, and (ii) whether it is useful to attempt the timing of influenza vaccination in an epidemiological scenario where influenza epidemics occur irregularly. We aim to investigate the first of these questions by introducing more respiratory virus diagnostics into our study. The second question can be evaluated with a mathematical model of influenza epidemiology, but will necessitate a longer influenza time series and a better understanding of the key drivers of influenza virus dynamics in tropical settings. with log books and email is also available. SMS messages are automatically passed to a text-parsing and datacleaning system that was set up and is still actively managed by the Oxford University Clinical Research Unit (OUCRU) in HCMC. Every day, ILI reports are manually approved by a qualified project team member at OUCRU; on approval they are automatically entered into a mySQL database that holds all data points for the study.

Influenza-like illness data. In August 2009, a participatory epidemiology study was established in Ho Chi
A small number of clinics (about 8%) did not use SMS reporting (by their request) and instead emailed ILI numbers to the project team or wrote them down in a daily logbook provided by OUCRU. As part of the data processing pipeline, reports by email or logbook were regularly merged into the main mySQL database.
Community engagement meetings were run for the first several years of the study to distribute and explain the study protocol, and a basic leaflet outlining the goals of the study and the reporting methodology was distributed to interested physicians. All documents were translated into Vietnamese, and annual reports and ILI trends were fed back to the clinics on a regular basis. A total of 63 clinics were enrolled in the initial study period The series of daily climate data were smoothed with a 15-day moving average before being used in our analyses.

Time series detrending and standardization.
A total of 28 regularly reporting clinics (those who reported at least 200 reports from 2010-2015 and reported positive ILI numbers at least half of the time) were included in the time series analysis. A 29th clinic that met these inclusion criteria was removed for quality control reasons. The ILI data of 2009 were not used in the analysis due to the small number of reporting clinics during the first five months of the study. Each clinic's time series was converted to a z-score scale by computing the z-score of each ILI percentage inside a 12-month moving window (centered at the calculated data point), thus removing long-term trends in the data; we verified that window sizes of 6, 9, 15, and 18 months did not have any qualitative effects on the overall ILI trends. The daily z-scores were averaged across clinics and smoothed using a 15-day window to construct the ILI z-score time series that we used in our subsequent analysis (see Figure S1 for effects of different smoothing windows).
The time series was validated by verifying that is was not white noise (p-value < 10 -15 , Box-Ljung test) and by showing that the majority of individual clinics had a higher correlation to the aggregate time series than would be expected if reporting were random ( Figure S2).
Statistical analysis and forecasting. Periodicity and frequency decomposition in the smoothed ILI trend were assessed with a standard auto-correlation function (ACF) and a Discrete Fourier Transform (DFT). The ILI zscore time series was regressed (linear link function) onto linear and non-linear variants of the climate variables (T, RH, AH, √T, √RH, √AH, T 2 , RH 2 , and AH 2 ) to determine which non-linear effects were present, as there is some evidence of non-linear effects of climate on ILI 61 . In addition, a time-dependent fixed effect α j mimicking the dominant periodicity identified by the ACF (here, 206 days) was included on the right-hand side of the regression equation. Twenty-one α j were allowed for in the model, meaning that periodicity in the system is modeled with a piecewise constant function taking 21 different values during a full period of 206 days. This is equivalent to having 21 fixed-effect terms in a regression, each multiplied by an indicator variable describing whether that data point belongs to that period, ensuring that only one fixed-effect term is added at a time. The piecewise constant function has an advantage over the sinusoidal approach traditionally used in epidemiological analyses because the stepwise nature of the α j allows the periodicity in the system to take any shape determined by the data and does not require that the forcing function to be sinusoidal or continuous.
The non-annual cycle, T, √RH, and RH were the explanatory terms according to the Akaike Information Criterion (AIC) using the stepwise regression approach in R (step() function). The ILI z-scores were then regressed onto the non-annual cycle, T, √RH, and RH, and lagged versions of these climate variables, extending back five weeks in the past. The same stepwise regression approach (step() function in R) using the AIC was used to remove regression terms that did not add explanatory power. The selected regression equation is To determine if the regression approach offers any predictability in the system, we inferred the regression coefficients and the time-dependent fixed effects using the first three years of data from January 1st 2010 to December 31st 2012, and we compared the predicted and real ILI trends for 2013-2015. The median prediction error was defined simply as the median of all of the absolute differences between the predicted z-score time series and the real z-score time series. We varied the size of the training set to determine how many years of data would be needed to achieve robustness in predictability ( Figure S4).

Bootstrapping climate data.
To test the robustness of this prediction to changes in the annual climate cycle and

Acknowledgements
We are very grateful to all the participating clinicians in this study, to the study nurses Nguyen Thi Kim Cuong,     Table S2. The period length of each DFT can be calculated by dividing 2,191 (the number of days in the time series) by the corresponding number of cycles (the frequency of the DFT). Frequencies whose power is lower than 6.93 (i.e. periodic functions whose correlation with the z-score time series is lower than their correlation with a constant signal) are shown in gray. The DFT reaches its highest power at 11 cycles, corresponding to a cycle length of 199 days.