A Bayesian framework for modeling COVID-19 case numbers through longitudinal monitoring of SARS-CoV-2 RNA in wastewater

Wastewater-based surveillance has become an important tool for research groups and public health agencies investigating and monitoring the COVID-19 pandemic and other public health emergencies including other pathogens and drug abuse. While there is an emerging body of evidence exploring the possibility of predicting COVID-19 infections from wastewater signals, there remain significant challenges for statistical modeling. Longitudinal observations of viral copies in municipal wastewater can be influenced by noisy datasets and missing values with irregular and sparse samplings. We propose an integrative Bayesian framework to predict daily positive cases from weekly wastewater observations with missing values via functional data analysis techniques. In a unified procedure, the proposed analysis models severe acute respiratory syndrome coronavirus-2 RNA wastewater signals as a realization of a smooth process with error and combines the smooth process with COVID-19 cases to evaluate the prediction of positive cases. We demonstrate that the proposed framework can achieve these objectives with high predictive accuracies through simulated and observed real data.


November 8, 2023
A Wastewater RT-qPCR analysis RT-qPCR was used to quantify SARS-CoV-2-RNA in wastewater.United States Center for Disease Control (US-CDC) primers and probes were used to amplify two regions of the nucleocapsid gene (i.e., N1 and N2).Amplification reactions for SARS-CoV-2 detection were performed in triplicate following the N1 and N2 assays previously described (Acosta et al., 2021).Samples were considered positive for N1 or N2 gene targets if the threshold quantification cycle (Cq) was lower than 40 cycles (Acosta et al., 2021).Total mass flux of SARS-CoV-2 RNA concentration measured as copies per day for the city of Calgary was calculated as previously described by Acosta et al. (2022), in which the daily flow rate of each WWTP was multiplied to copies/mL of SARS-CoV-2 RNA measured in wastewater to generate the number of copies per day for each WWTP (Acosta et al., 2022).For this reason, the RNA concentration (copies per day) across three WWTPs in Calgary can be aggregated into one data point per sampling day to represent city-wide SARS-CoV-2 burden.

B Simulation results
Additional numerical analysis of the simulations are shown in Table 1 and 2

C Model diagnostics analysis of Calgary data
As shown in Figure S6, the residuals of the proposed regression with P = 1 shows strong evidence of non-linear trends, and the homogeneity of residuals has been significantly improved when the second-order effect is included in the proposed regression.The smooth lines in Figure S6 are fitted to the scatter plots of residuals using local polynomial regression (Cleveland and Grosse, 1991).
We also check the performance of predictions using the posterior predictive checking method described in Section 6.3 of Gelman et al. (1995).The method checks the proportion of MCMC samples of prediction values on the test data that are larger than the actual value: P ( Ŷt > Y t ), which can be seen as the tail-area probability of the actual value in    the distribution of predicted values.Therefore, this proportion is also referred to as the Bayesian p-value.A Bayesian p-value that is too large or too small (e.g., beyond the 99% confidence bounds) indicates that the distribution of predictions does not cover the true value very well.In Supplementary Information Figures S8 and S9, the Bayesian p-values are calculated for the predicted positive cases and positivity rates in the prediction period.For most of the predictions, their Bayesian p-values are between the lower bound and upper bound, indicating that the actual values are not in the tail-areas of prediction distributions and the prediction performance of the proposed framework is reasonably strong.As shown in Supplementary Information Figure S10, the MCMC samples of WBS signal function X(t) are compared with actual observations of N1 and N2 copies in wastewater of Calgary.The sampled X(t) functions have relatively small variation and reflect the overall dynamics of WBS observations except for the observations in early July 2021 when there was a sudden drop in WBS observations and in positive cases.Due to the smoothing of trajectories in the FPCA process, the sampled X(t) functions cannot drop as much as the actual observations do, which explains why the only time when the actual values fall outside the prediction bands in Figures 4 and 5 of the main text happens in early July of 2021.The distributions of posterior estimates of the regression coefficients β 1 , β 2 and β c 1 are shown in Supplementary Information Figure S11.The mean of β 1 's posterior estimates is 0.47, and the mean of β 2 's posterior estimates is 0.03.Also, the credible interval (CI) of β 2 's samples dose not cover zero, which further proves the significance of the effects of squared WBS signals on modeling case numbers.

D Predicting intensive care unit (ICU) admissions in the city of Calgary
As an extension of our Calgary study in the main text, we implement daily predictions for the number of ICU admissions with a positive COVID-19 test (referred to as the ICU admissions later) in the city of Calgary.The offset term used here is the number of hospitalized patients with a positive COVID-19 test (referred to as the hospitalized cases later) in Calgry.Note that the hospitalized cases include all admitted patients with a positive (and mandatory) COVID-19 test, but COVID-19 may not be the primary reason of hospitalization.Following the Calgary study on the number of positive cases, the predictive model is built on WBS and clinical data from June, 2020 to March, 2021, and the prediction period is from April, 2021 to January, 2022.The proposed framwork is fitted with a Poisson-distributed outcome and the first and second power of WBS signals (i.e., P = 2).
Figure S12 shows the predicted number of ICU admissions calculated from the MCMC samples of unknown parameters as specified in Section 2.4 of the main text.Compared to the actual number of reported daily cases, the predicted numbers can mostly track the actual trend of ICU admissions with reasonable scale of variation.However, the associations between ICU admissions and city WBS signals may not be as strong as those between positive cases and city WBS signals because all infected residents can shed viruses.The proposed framework can predict positive cases more accurately as shown in Section 3.2 of the main text than predicting ICU admissions.To improve predictions, a possible solution is to sample wastewater from the hospital community sewer system instead of city wastewater plants.The results for posterior predictive checking for the model predicting positivity rates, which is also known as the Bayesian predictive p-value.The value is calculated as the proportion of predicted values that are greater than the actual value, and a value that is close to 0 or 1 indicates poor prediction.The total tail-area probability (beyond the red lines) is 1%.
. Additional simulation results are shown in FigureS1, S2, S3, S4, and S5.See Section 3 of the main text for their descriptions.

Figure S1 :
Figure S1: The prediction results of the proposed framework with Poisson-distributed outcome in Simulation 3. (A): The simulated number of cases (black line) and the mean of the MCMC sampled curves (red line).The minimum and maximum of the MCMC sampled curves are shown in red envelop, with the 10 th and the 90 th percentiles indicated by the darker red inner envelope.(B): The MCMC sampled X(t) function and the simulated (logtransformed and centered) estimated number of viral genomes based on N1 and N2 gene targets.The light red curves are the MCMC sampled curves, and the dark red curve is the true X(t) function.

Figure S2 :
Figure S2: The prediction results of the proposed framework with Poisson-distributed outcome in Simulation 4. (A): The simulated number of cases (black line) and the mean of the MCMC sampled curves (red line).The minimum and maximum of the MCMC sampled curves are shown in red envelop, with the 10 th and the 90 th percentiles indicated by the darker red inner envelope.(B): The MCMC sampled X(t) function and the simulated (logtransformed and centered) estimated number of viral genomes based on N1 and N2 gene targets.The light red curves are the MCMC sampled curves, and the dark red curve is the true X(t) function.

Figure S3 :
Figure S3: The prediction results of the proposed framework with Poisson-distributed outcome in Simulation 5. (A): The simulated number of cases (black line) and the mean of the MCMC sampled curves (red line).The minimum and maximum of the MCMC sampled curves are shown in red envelop, with the 10 th and the 90 th percentiles indicated by the darker red inner envelope.(B): The MCMC sampled X(t) function and the simulated (logtransformed and centered) estimated number of viral genomes based on N1 and N2 gene targets.The light red curves are the MCMC sampled curves, and the dark red curve is the true X(t) function.

Figure S4 :
Figure S4: The prediction results of the proposed framework with Negative Binomialdistributed outcome in Simulation 6. (A): The simulated number of cases (black line) and the mean of the MCMC sampled curves (red line).The minimum and maximum of the MCMC sampled curves are shown in red envelop, with the 10 th and the 90 th percentiles indicated by the darker red inner envelope.(B): The MCMC sampled X(t) function and the simulated (log-transformed and centered) estimated number of viral genomes based on N1 and N2 gene targets.The light red curves are the MCMC sampled curves, and the dark red curve is the true X(t) function.

Figure S5 :Figure S6 :
Figure S5: The distributions of posterior estimates of the regression coefficients β 1 and β 2 in Simulations 1 and 2. The red line shows the real values of regression coefficients.The y-axis shows the number of posterior samples in the corresponding interval.

Figure S7 :Figure S8 :
Figure S7: Gelman-Rubin-Brooks plot of the training model in Calgary.The plot assess the convergence of MCMC chains.A PSRF value close to 1 indicates good convergence of random parameter.

Figure S10 :Figure S11 :
Figure S10: The MCMC sampled X(t) function and the observed (log-transformed and centered) numbers of copies of N1 and N2 viruses in Calgary.The sampled X(t) functions are generated from the proposed integrative framework.

Wave− 3 Figure S12 :
Figure S12: The posterior mean of the MCMC sampled curves (red line) and actual number of cases (black line).The minimum and maximum of the MCMC sampled curves are shown in the lighter red envelope, and the 5 th and the 95 th percentile of the sampled curves are shown in darker red envelope.

Figure S13 :
Figure S13: The log-transformed numbers of N1 copies (first row) and N2 copies (second row) in all Calgary wastewater plants before and after normalization.

Table 1 :
Numerical analysis of simulation results.For estimated WBS function, the average MSE is calculated, and the standard errors of the averages are included in the parenthesis.The results are obtain from a Poisson-distributed outcome.

Table 2 :
Additional comparisons of different methods on simulated data.For the proposed framework fitted with the Poisson-distributed outcome and P = 1 and P = 2, the average MSPE are calculated, and the standard errors of the averages are included in the parenthesis.For other frequentist methods, the MSPE is calculated by comparing predicted case numbers with true case numbers.In Simulation 5, the MSPE is calculated by comparing predicted positivity rates with actual positivity rates.