A multivariate statistical approach to predict COVID‐19 count data with epidemiological interpretation and uncertainty quantification

For the analysis of COVID‐19 pandemic data, we propose Bayesian multinomial and Dirichlet‐multinomial autoregressive models for time‐series of counts of patients in mutually exclusive and exhaustive observational categories, defined according to the severity of the patient status and the required treatment. Categories include hospitalized in regular wards (H) and in intensive care units (ICU), together with deceased (D) and recovered (R). These models explicitly formulate assumptions on the transition probabilities between these categories across time, thanks to a flexible formulation based on parameters that a priori follow normal distributions, possibly truncated to incorporate specific hypotheses having an epidemiological interpretation. The posterior distribution of model parameters and the transition matrices are estimated by a Markov chain Monte Carlo algorithm that also provides predictions and allows us to compute the reproduction number Rt. All estimates and predictions are endowed with an accuracy measure obtained thanks to the Bayesian approach. We present results concerning data collected during the first wave of the pandemic in Italy and Lombardy and study the effect of nonpharmaceutical interventions. Suitable discrepancy measures defined to check and compare models show that the Dirichlet‐multinomial model has an adequate fit and provides good predictive performance in particular for H and ICU patients.

For the analysis of COVID-19 pandemic data, we propose Bayesian multinomial and Dirichlet-multinomial autoregressive models for time-series of counts of patients in mutually exclusive and exhaustive observational categories, defined according to the severity of the patient status and the required treatment. Categories include hospitalized in regular wards (H) and in intensive care units (ICU), together with deceased (D) and recovered (R). These models explicitly formulate assumptions on the transition probabilities between these categories across time, thanks to a flexible formulation based on parameters that a priori follow normal distributions, possibly truncated to incorporate specific hypotheses having an epidemiological interpretation. The posterior distribution of model parameters and the transition matrices are estimated by a Markov chain Monte Carlo algorithm that also provides predictions and allows us to compute the reproduction number R t . All estimates and predictions are endowed with an accuracy measure obtained thanks to the Bayesian approach.
We present results concerning data collected during the first wave of the pandemic in Italy and Lombardy and study the effect of nonpharmaceutical interventions. Suitable discrepancy measures defined to check and compare models show that the Dirichlet-multinomial model has an adequate fit and provides good predictive performance in particular for H and ICU patients. others, Reference 1), both at the beginning of the outbreak, when scaling up health-care capacity is crucial to save lives, and in the later phase of the epidemic, to better plan the timing to return to usual capacity in each hospital ward.
Predictions should be trusted with care because the official data may present biases due to the observational nature and the delays of the collection process. For example with reference to Italy, it sometimes happens that some data collected over a period of several days are officialized in a single day, causing spikes in the time series. Furthermore, this reporting delay is not constant over time but rather emerges more prominently during the emergency phase. In this often biased observational context, an advantage of our proposal is that it automatically smooths spikes in the data that can be identified and later investigated to understand if they are due to reporting errors or other causes such as the resurgence of the pandemic, in which case a warning signal should be issued.
Besides timing related issues, as suggested by Roda et al 2 prediction is very difficult when there is a lack of reliable data sources. One of the main problems, especially in Italy, is related to the fact that swabs to identify COVID-19 virus infections have been dispensed only to those showing symptoms and having been in contact with a person who tested positive. This was the protocol at the beginning but it was changed a few times during the emergency period and these modifications also caused reporting biases. Therefore, due to nonrandomization of the tested people, lack of testing kits and, even more importantly, due to scarcity of labs accredited to process the swabs, as suggested by Roda et al 2 "the entire iceberg represents the total infected population and the tip of the iceberg above the sea surface represents the case data." This phenomenon is called hidden epidemic and caused a critical care crisis in Italy as well as in other countries. 3 A series of univariate models such as the Poisson 4 and models based on a negative binomial distribution, [5][6][7] as well as the generalized logistic growth model, 8 have been proposed for single time series of counts, especially to analyze Italian data. These models are often formulated with a temporal trend 9 through polynomials and splines. Differently from the univariate models outlined above, the model we propose explicitly considers that the count for a certain category at a certain time occasion is the sum of transitions from the same and other categories that these individuals occupied at the previous time occasion. In other words, our model directly formulates assumptions on the sequence of contingency tables of the "transition frequencies" between two consecutive time occasions. This assumption directly induces the distribution of the total counts that, differently from the unobservable transition frequencies, are directly observed and should result as column totals of the contingency tables. The advantage of our multivariate modeling framework is in terms of stability and precision, leveraging on the fact that some of the counts on the variables included in the model are inherently less prone to measurement errors. These errors may arise for many reasons such as because people in the population who were infected by the virus remain asymptomatic during the first days of the infection. We also have to consider the undercounting of deaths caused by social isolation and other factors as detailed in Reference 10, as well as due to delays in the reporting schedule.
The proposed approach may be seen as an extension of that proposed in Reference 11 for 2 × 2 contingency tables. Moreover, it is related to the SEIR (Susceptible-Exposed-Infected-Recovered) epidemiological model. 12 Indeed, we explicitly model the reduction of the number of susceptible individuals, the virus transmission rate, the transfer rate from exposed to infected, the diagnosis, and the recovery rate. We note that according to the Italian regulations during the period of time we consider, the category of susceptible also includes asymptomatic cases. This is due to the health policy measures in place in Italy during the first wave of the pandemic when only individuals with symptoms were tested. Indeed, asymptomatic and pauci-symptomatic cases are not reported among the "positive cases." We also stress that the deceased category includes both people who died because of COVID-19 as well as with COVID-19. A subsequent analysis of the mortality cards revealed that 89% of the deaths are directly attributable to COVID-19. * We can furthermore estimate the time evolution of the epidemic reproduction number together with credible bounds.
We cast our proposal in a Bayesian framework [13][14][15] because it allows us to incorporate expert prior information that, when data are lacking, helps in regularizing the likelihood function, and allows for predictions at the very beginning of the pandemic period. Priors can also be informed from data available in countries where the epidemic started earlier, like data from Hubei, where the first cases were reported on 22 January 2020, approximately 10 days earlier than in Italy. By this time, in the Hubei area, more than 5800 cases were already present. † The modeling strategy is flexible and proceeds in steps of increasing complexity. Our proposal is conceived to provide a model that can explore the available data and thus is first estimated with noninformative priors. Then, to account for epidemiological hypotheses, we introduce truncated priors enforced by imposing bounds to the admitted values of the odds of transition across categories. Finally we also account for public health non-pharmaceutical interventions (NPI) enforced to reduce the spread of the epidemic, thus causing changes to the time series of reported confirmed cases. This is achieved by introducing dummies 16 at certain time points to account for the effect of NPI. It is therefore important to retain the capacity to fit increasingly complex models. We also provide an estimate of the reproduction number R t following the method described in Reference 17, where the authors assume that the "serial interval" has a Gamma distribution with certain parameter values.
The multinomial and Dirichlet-multinomial autoregressive models may be considered as stochastic processes following a first-order Markov chain conditional on the latent disease status. 18 In our formulation these models include absorbing states, as that of deceased patients. For each category, including that of the susceptible individuals not previously ill, recovered, and deceased, we apply Bayesian inference 14 to estimate the persistence in each category, the transition probabilities between categories across time, and also the associated uncertainty of the estimates. Note that the assumption of first-order dependence is typically formulated in the literature on time-series categorical data. The motivation is that most of the relevant information to explain the counts at a given time occasion is contained in the counts at the previous occasion, and we consider this as plausible also in our context. In principle, this assumption could be relaxed by allowing for a higher-order dependence. However, this type of extension would be very complex to implement and also for this reason we choose to retain a first-order dependence.
For model estimation we adopt an MCMC algorithm 19 that simulates parameter values from their posterior distribution. The algorithm is based on a unified data augmented scheme 20 and comprises two iteratively repeated steps: the first step is based on sampling tables of transition frequencies using the technique described in References 21,22; the second draws new values of the model parameters on the basis of moves based on a Metropolis-Hastings (MH) acceptance rule. 23,24 Using a large number of iteratively generated MCMC draws, we obtain the estimated joint posterior distribution of parameters that is then summarized by marginal posterior averages and prediction intervals. To diagnose possible violations of the model assumptions and compare the performance of alternative models we use a suitable discrepancy measure and compute posterior predictive p-values. [25][26][27] The remainder of the article is organized as follows. In Section 2 we describe the proposed models. In Section 3 we illustrate the estimation of the model parameters, of the reproduction number together with predictions of various interesting quantities, and derive some discrepancy measures for model checking and comparisons. In Section 4 we show the results of the models estimated with the Italian data available during the first wave of the pandemic; we also report on the results obtained with data coming from the Lombardy region where the spread of the virus began in Italy. In Section 5 we provide some concluding remarks. In the Appendix some additional details are presented.

PROPOSED APPROACH
Data consist of counts, over T time occasions, for K disjoint and exhaustive categories that will be jointly modeled: y tk , t ∈  , k ∈ , where  = {1, … , T} and  = {1, … , K}. For each time occasion, these observed frequencies are collected in the vectors y t = (y t1 , … , y tK ) ′ , t ∈  . We assume, for simplicity, that the total population size is fixed over time, namely, ∑ k∈ y tk = N for all t. The corresponding random vectors are denoted by Y t and have elements Y tk that satisfy the same constraint on the sum over k. In the application referred to official data provided at the national level on the COVID-19 pandemic, individuals are classified in K = 6 ordered (in terms of their severity) categories: susceptible not previously ill (S), recovered (R), positive cases in quarantine (Q), hospitalized (H), intensive care (ICU), and deceased (D); for each of these categories we observe the frequency on a daily basis. The "now positive" (NP) category is obtained as the sum of individuals in the Q, H, and ICU categories.

Model assumptions
We consider the counts for the first time occasion, t = 1, as given, and, in formulating an autoregressive model for the vector Y t , we assume that every element Y tk is the column total of a contingency table having row totals equal to the elements Y t−1,k of Y t−1 for t > 1. In more detail, let X tjk represent one of the frequencies in this contingency table, a random variable corresponding to the number of individuals that at occasion t − 1 are in category j and at occasion t move to category k. In symbols we have that Y tk = ∑ j∈ X tjk , for all k, are the column totals and Y t−1,j = ∑ k∈ X tjk , for all j, are the row totals. These column and row sums are the only observable variables since they are the only publicly provided counts. This structure is clarified in Table 1, where zero values are inserted to denote that the corresponding 0 random variables are equal to zero with probability one (structural zeros). This is because state D is absorbing (zeros in the last row) and because we assume that, once infected, patients are not susceptible anymore (zeros in the first column). The unobserved random variables X tjk are collected in the vectors X tj = (X tj1 , … , X tjK ) ′ , j ∈ , t ∈  ′ , where  ′ = {2, … , T}, and are here named "transition frequencies." For instance, in our application on COVID-19 illustrated in the sequel, where we consider six categories, X t35 corresponds to the number of individuals that moved from category Q (number 3) at time t − 1 into category ICU (number 5) at occasion t.
It is natural to assume that every vector X tj , given Y t−1 , follows a multinomial distribution with size y t−1,j and specific vector of "transition probabilities" p tj = (p tj1 , … , p tjK ) ′ with elements summing to 1; in symbols, we have for t ∈  ′ and j ∈ , where j is the matrix of the regression vectors jk , k ∈  j , that are involved in the model for the probabilities in p tj as will be clarified below; see Equation (3). In particular, p tjk is the conditional probability that an individual is in category k at occasion t given that he/she was in category j at the previous time occasion. Assuming the multinomial distribution we can write It is well known that this conditional probability is related to the Poisson distribution. In particular, it is the conditional distribution of a set of independent random variables having Poisson distribution given their total. 29 The conditional expected value and the variance-covariance matrix under the multinomial distribution have the following expressions: In order to account for overdispersion, which may arise in the count data, we also consider a Dirichlet-multinomial distribution [29][30][31] for each vector X tj given Y t−1 , which is denoted by for t ∈  ′ and j ∈ , and depends on a vector of parameters tj = ( tj1 , … , tjK ) ′ . Consequently, we have that where tj+ = ∑ k∈ tjk , so that Parameters collected in j affect the parameters tjk as will be clarified below. Note that the level of overdispersion decreases as the total tj+ increases. This overdispersion may be motivated by the presence of measurement errors or unobserved heterogeneity, as the national counts are obtained by collapsing counts referred to different regions. Moreover, it is possible that, within our formulation, we omit important covariates because they are not available to us. These missing covariates may act as risk factors and influence the observed counts.
Obviously, either if we assume a multinomial or a Dirichlet-multinomial distribution, formulated in (1) or (2), respectively, the induced distribution for Y t given Y t−1 has a complex expression involving the sum of quantities like over all possible configurations of the contingency table with frequencies x t1 , … , x tK having certain column totals. In the multinomial case, the induced distribution is related to the multivariate hypergeometric that, however, may be difficult to compute in practice.

Adopted parametrizations
Under the multinomial model, in order to parametrize each probability vector p tj we introduce the subsets  j of  containing the indices of the elements of this vector that are not constrained to be equal to zero. Then, we assume the multinomial logit parametrization where the design column vectors f tjk contain the terms of a suitable polynomial of time t included in the model via the regression parameter vector jk . These parameters may be interpreted in terms of the logit of the probability of moving to category k starting from category j.
To ensure model identifiability, under the multinomial distribution we assume that jj ≡ 0, j ∈ , where 0 is a suitable dimensional column vector of zeros. In our application, in particular, we use common vectors across categories containing the elements of a second or third order polynomial, and we have f tjk = (1, t, t 2 , t 3 ) ′ for all t, j, and k. These vectors may also include covariates, such as dummies, to study the effect of epidemic containment policies 16 imposed at a certain time for mitigating the pandemic, as we show in our application. Alternatively, proper splines 32,33 may be considered. Overall, the free parameters of the multinomial model are collected in the matrix ; in particular, this matrix collects the vectors jk , The parametrization of the Dirichlet-multinomial version of the proposed model is simpler as we directly assume that on the basis of the quantities already defined above. In this case we not need to introduce identifiability constraints on the jk parameters and then we define the overall parameter matrix as that collecting the vectors jk , j ∈ , k ∈  j . The corresponding probabilities may be computed as In a Bayesian framework, we assume that a priori the regression parameters in each vector jk are independent and have a diffuse prior distribution. The initial and most natural choice is that of a multivariate Gaussian distribution, that is, where k ∈  ′ j for the multinomial version and k ∈  j for the Dirichlet-multinomial case, I is a suitable dimensional identity matrix, and 2 is the variance hyperparameter that can be fixed to be large, in case we want to assume a noninformative prior to perform exploratory analysis; for instance, we use a value equal to 100 in our application. However, in order to include certain epidemiological hypotheses in the model, such as the fact that some transitions are very rare or even impossible, and in order to increase the numerical stability of estimation and prediction, we introduce inequality constraints on convenient transformations of the parameters. These constraints can be used in a very flexible way and may be reduced to equality constraints. Let o tjk be the odds referred to category k with respect to category j at time occasion t, which are defined as o tjk = p tjk ∕p tjj . Our approach allows us to include the constraint that, for selected pairs of indices (j, k) in the set  to be appropriately chosen, for all time occasions t these odds are bounded from above and/or below although, in the application illustrated in Section 4, we only use upper limits. More precisely, we assume that where  * = {2, … , T * } and T * refers to the time until which predictions are computed. In summary, from a Bayesian perspective, we can also assume truncated priors 34 based on the multivariate Gaussian distributions in (6) under the constraint given in (7). Then, the model prior formulation amounts to specify the value of the variance 2 together with the set  and the limits a jk and b jk for the odds having indices in this set. An alternative to the approach to formulate prior distributions described above could rely on a reparametrization ensuring that inequalities in (7) are always attained, so that a prior Gaussian distributions can be assumed on the transformed parameters without introducing any truncation. However, we prefer to retain the proposed way to specify prior distributions as the above inequalities can be easily accounted for in the MCMC estimation algorithm.
Possible extensions of the above formulation could consist in differentiating the order of the polynomial for the multinomial or Dirichlet-multinomial parameters considered in (3) and (4) across the possible pairs (j, k), but we prefer the approach based on truncated priors to avoid model selections issues and because the proposed truncation is based on a clearly interpretable criterion. Moreover, in the proposed framework, it is also possible to use informative prior distributions for the parameter vectors jk . In particular, for each of these vectors we can assume a Gaussian prior distribution with mean vector and variance-covariance matrix equal to the posterior estimates obtained from data available in countries that entered earlier in the pandemic emergency phase. In this way, we can have more accurate predictions at the beginning of the pandemic, when only data referred to a few time occasions are available.

Comparison with alternative models
The main employed models to predict the expected number of infections use univariate counts assumed to follow a Poisson or a negative binomial distribution. However we jointly model all the observable counts while the full process of transitions between categories is unobservable. An approach of this type for 2 × 2 contingency tables has been proposed in Reference 11 on the basis of a Binomial distribution assumed for each row of these tables. Even this approach follows a Bayesian formulation based on Beta prior distributions assumed for suitable probability parameters and it relies on an MCMC estimation algorithm, with special attention to inference on the odds ratio as a measure of association in each contingency table.
It is worth noting that our proposal may be cast in the literature on hidden Markov (HM) model. [35][36][37] A model of this class was first introduced to monitor epidemiological surveillance data for poliomyelitis counts. 38 However the literature in this field is not rich and this model is generally estimated by considering a penalized likelihood approach where the choice of the penalty is crucial. In our context, we consider a simpler model avoiding the definition of the latent states and we propose a fully Bayesian formulation by considering an MCMC algorithm which allows us to dispose of the simulated posterior probabilities of the model parameters.
Similarly to the Poisson model, our model has an epidemiological interpretation in line with the more common SEIR models 12 and we also provide an estimate of the reproduction number R t defined as the expected number of individuals a single infected person will infect over the course of his/her infection period. The R t can be considered as the average number of secondary cases per primary case; see Reference 39 for a study on the differences between the estimation of R t in a deterministic SEIR-type model and in a stochastic model like the Poisson model. A first attempt to estimate this number for COVID-19 is provided by Shao et al 40

BAYESIAN INFERENCE
In this section we provide some details about estimation of the parameters and of the reproduction number. We also deal with methods for model checking and for model comparison across different specifications.

Parameter estimation
The model is estimated through a Metropolis sampler by implementing a fixed scan algorithm based on two steps that are iteratively repeated. In the first step, we update the contingency tables X t with elements X tjk , j, k ∈ , given the current value of the parameters and the observed margins y t−1 and y t , for t ∈  ′ . In the second step we update the model parameters jk and an MH ratio is computed for each parameter vector in order to decide if the candidate move may be accepted.
Given the complexity of sampling tables with fixed margins under the assumption that frequencies in each row of the contingency table follow a multinomial or a Dirichlet-multinomial distribution with parameters defined in (3) or (5), we use the technique of Reference 41 based on: (i) randomly selecting two rows and two columns of the current table so that a 2 × 2 subtable is identified; (ii) performing a switch that consists in adding (or subtracting) to the two cells in the main diagonal of the subtable a random integer number that is subtracted (or added) to the off-diagonal cells; (iii) provided that the table proposed on the basis of the random switch has all positive frequencies, accepting this table with probability equal to , where x tj is the vector of the frequencies in the jth row of the current table, x * tj is that of the proposed table, and j is the matrix containing all current regression vectors jk , k ∈  j . On the basis of the definition of the probabilities involved in the expression given in (1) or (2), several simplifications are possible in computing the acceptance MH ratio.
After having updated the tables, and when the multinomial formulation is adopted, for each j ∈  and k ∈  ′ j , we update the regression parameters with a random walk Metropolis step and propose a new value of jk from a normal distribution centered on the current value of this parameter vector and with a proper variance-covariance matrix. Then, provided that inequalities in are verified (7), the proposed vector, denoted by * jk is accepted with probability = min where † jk is the same matrix as j with jk substituted with * jk , and ( jk ) is the prior density of the regression parameters. For the Dirichlet-multinomial version, the updating step of the jk parameters is performed as above for each k ∈  j .
At the end of the algorithm we obtain the simulated tables X (s) t , t ∈  ′ , and the parameter vectors (s) jk drawn from the posterior distribution, for s = 1, … , S, where S is the number of MCMC iterations. At every iteration we also include estimation and prediction of the reproduction number R t , as we detail in Section 3.3.

Frequency prediction
After having updated tables and regression parameters, at each iteration of the MCMC algorithm, we also make in-sample and out-sample prediction of the frequencies y tk . In particular, the MCMC algorithm draws parameter vectors (s) jk on the basis of which it is possible to obtain the probabilities p (s) tjk computed according to Equation (3) or (5) depending on the assumed count distribution, multinomial or Dirichlet-multinomial. Consequently, a prediction of the frequency y tk at step s of the algorithm is given byŷ The same formula may be applied for predicting y tk for t = T + 1, obtainingŷ T+1,k , whereas we can apply the recursive formulaŷ for t > T + 1. These predictions are collected in vectorsŷ (s) t in a suitable way. An alternative way to perform out-sample predictions of the frequencies y tk is based on simulating, at each step, the table for occasion T + 1, denoted by X (s) T+1 for iteration s, from the assumed distribution on the basis of the current parameter values and the observed frequencies in the vector y T . By summing the columns of table X (s) T+1 we obtain the predicted frequenciesỹ (s) T+1,k collected in vectorỹ (s) T+1 . This process is performed recursively so as to obtain the simulated tables X (s) t on the basis of the frequenciesỹ (s) t−1 and the corresponding predicted frequenciesỹ (s) tk collected in vectorsỹ (s) t for t > T + 1.
At the end of the algorithm we can obtain summary statistics for the predictionsŷ (s) tk andỹ (s) tk starting from simple means across the iterations, denoted byŷ tk andỹ tk , respectively. We can also associate measures of precision that take into account the variance of the posterior parameter distribution. In particular, by computing the variance of theỹ (s) tk predictions, we appropriately measure the level of uncertainty of such predictions that are directly generated from the model.

Estimation of a time-evolving reproduction number
In order to estimate the net reproduction number R t we take inspiration from the method already applied to the Italian context 17,42 and that is rather popular in epidemiology; see Reference 43, among others. In particular, we start from the assumption that the "serial interval" for COVID-19 follows a Gamma distribution with parameters 1.87 and 0.28, so that the mean is of 6.6 days, as established in Reference 17. However, note that from the literature uncertainty emerges about the length of the serial interval, as highlighted in the meta-analysis provided in Reference 44. The assumed length of this interval may strongly affect the final estimate of R t . Still, the reference model we select for the serial interval has been already used as a standard value for Italian data, and we thus adopt it for uniformity and comparability purposes. At every iteration s of the MCMC algorithm described in Section 3.1 we predict the reproduction number at issue aŝ where r,t−1 is a weight obtained by normalizing the density of the Gamma distribution with the above parameters (1.87 and 0.28) so that ∑ t−1 r=1 r,t−1 = 1 andΔI (s) t is the number of individuals in category NP predicted by the model for day t. The latter is directly computed on the basis of the sum of suitable elements of the transition probability matrix from the first category, namely,Δ Finally, we take the overall prediction as a mean across the MCMC iterations. These means are denoted byR t . This procedure allows us to estimate the net reproduction number for the observed time occasions and out of sample. We may also obtain a measure of precision and credible intervals to be associated with the predicted R t values, also accounting for the variability of the parameter estimators.
Overall, the method we adopt to estimate the reproductive number presents elements of novelty with respect to the previous proposals that require the use of an ad hoc MCMC algorithm involving a likelihood function for the counts of NP individuals based on the Poisson distribution.

Model checking and comparison
From the MCMC algorithm we obtain the predicted frequenciesŷ (s) tk andỹ (s) tk as illustrated at the end of Section 3.2. In order to evaluate the goodness-of-fit of the model we then consider the following discrepancy measurê which is computed at every MCMC iteration s. The overall measure of fit of the model may then be obtained as the mean of these quantities across the MCMC iterations, obtainingDist. In order to calculate the corresponding posterior predictive p-value we follow the procedure illustrated in Reference 26, see also References 27 and 45. In particular, for every iteration, we also compute a version of the discrepancy measure, denoted byDist (s) , using formula (8) but with each observed frequency y tk substituted by a simulated frequency from the model with the current parameter value and based on the previously observed frequencies y t−1,j . The mean of these statistics across iterations is denoted byDist. Then the posterior predictive p-value is obtained as the proportion of times thatDist (s) is at least equal toDist (s) across all the MCMC iterations. Although this procedure has been criticized because the observed data are used twice, once for parameter estimation and once for model checking, the resulting posterior predictive p-values is still a useful check of model fit provided that it is correctly interpreted. In particular if the model has an adequate fit, then p-values close to 0.5 should be observed. 45 The above discrepancy measure may also be used to compare different models when they are used for obtaining forecasts and for this aim, we consider an out-sample version that is time specific. In particular suppose that further observations are available with respect to those used to estimate the model, that is, suppose we know y tk for t ∈  † , where  † = {T + 1, … , T † } with T † > T. Then, at every MCMC iteration we compute the discrepancy measurê which is again summarized by a mean denoted byDist t and for which we obtain an out-of-sample posterior p-value according to the same method illustrated above. In this case, using different data for parameter estimation and model checking we consider p-values larger than 0.05 as adequate.
Finally, it may also be of interest to understand with respect to which categories, among the K considered, the proposed approach presents a higher or lower performance in terms of forecasting. In this regard we use the following type of discrepancy measureD Note that in this case we directly use the predictions available at the end of the algorithm, and we compare them with those produced by sampling from the assumed distribution and denoted asỹ tk .

APPLICATION
Following the spread of COVID-19 in Europe, we consider the daily counts for K = 6 categories illustrated at the beginning of Section 2 and denoted by S, R, Q, H, ICU, and D. Results are shown with reference to the Italian data collected from 24 February until 24 April 2020 (day 61) in order to evaluate the performance of our proposal at the beginning of the pandemic. First, we compare the goodness-of-fit of the estimated models, and then we report the results of the best model TA B L E 2 along with some results of another model for comparison. Then in Section 4.2 we show additional results obtained with data collected on the same period referred to individuals who reside in the Lombardy region where the Italian wave of the epidemic started. In the Appendix we provide some additional details on the estimation algorithm, data, and codes.

Model comparison
We started our analysis with a variety of models formulated according to the proposed approach. In particular, we considered both the multinomial and the Dirichlet-multinomial autoregressive versions with polynomials of order two or three and with or without constraints on the odds, as formulated in (7). The constraints imposed on the odds for the transitions between categories (o tjk ) are displayed in Table 2, reporting the maximum values that these odds can take (b jk ). For example, in Table 2 the odds for the transition from ICU to H is 0.25, meaning that the probability to move from ICU to H can be at most one fourth of that of remaining in ICU in a given day. Overall, we considered eight models by combining two distributions (multinomial or Dirichlet-multinomial), two orders of the polynomial (two or three), and two specifications (with or without constraints). All these models included two dummy variables to account for the effect of NPI enforced on 24 February and on 8 March 8 2020, aimed at containing viral transmission by closing schools, limiting movements, and imposing social distancing. Two dummy variables have been added on days 7 and 20, corresponding to the 1st and 14th of March, considering that the effects of these NPIs can be detected approximately 1 week after their enforcement.
In order to compare the eight models we used the discrepancy measures illustrated in Section 3.4. In Table 3 we report the observed values of statisticsDist andDist and the corresponding average posterior predictive p-values. These results suggest that the multinomial model shows a lack of fit, a problem that is resolved in the model based on the Dirichlet-multinomial distribution. In particular, all the Dirichlet-multinomial autoregressive models have a much better fit with respect to the models based on the multinomial distribution: they are more capable of reproducing the amount of variation observed in the data. Among the models based on the Dirichlet-multinomial distribution, Model 7 is the best to explore the information contained in the data and has a posterior predictive p-value very close to 0.5 as expected when the model is adequate. In contrast Model 8, imposing upper limits on the odds as those illustrated in Table 2, is suitable to provide an epidemiological interpretation in line with the most common SEIR epidemiological models. 12 We forecasted the total number of reported cases according to the posterior predictive distribution over the course of 10 days after the estimation time window and compared them with the observed cases during these days. Table 4 shows the realized values of the proposed discrepancy measure defined in (9) andDist t , which is based on the simulated frequency along with the corresponding posterior p-value for each predicted day resulting from Models 7 and 8. We observe that for Model 8 the p-values are never less than 0.05. Moreover, as expected, the overall p-value decreases with the increasing number of predicted days.
In Table 5 we also report the measure provided in (10) for both models and notice that the best predicted categories are D and ICU. Category H is also very well predicted. This is an important feature of the model since ICU is the crucial count to correctly predict in order to save lives by optimal management of health-care resources. In the following, we TA B L E 3 Average realized and predicted discrepancy measures for the autoregressive multinomial and Dirichlet-multinomial models and posterior predictive p-values for Italian data provide more details on the results obtained with the selected models, starting from Model 8 that incorporates constraints and provides a more straightforward epidemiological interpretation.

4.1.2
Results of obtained from Model 8 Figure 1 shows the daily observed and predicted counts for each category with a time horizon of 10 days and the estimated 95% prediction intervals depicted in gray. The posterior means of the predicted transition frequencies referred to the 25th and the 26th of April, stored in the transition matrix, are reported in Table 6. The corresponding 95% upper and lower predicted limits are reported in Table 7. The configuration of the predicted frequencies in Table 6 is recovered from the fixed marginal frequency of total counts. We observe that some transitions are absent and that the highest frequency is predicted for the transition from S to Q (2219 individuals), and the second highest transition is predicted from H to Q (757 individuals). Some patients are predicted to transit from Q to H (516) and from H to ICU (73).
For 26 April compared with 25 April, it is estimated that 149 deaths are expected among ICU patients, with a credibility interval from 123 to 282. These estimates imply an average length of stay in ICU ranging from 10 to 22 days. Another interesting observation is that on the same day of 26 April, 73 hospitalized patients (0.33%) are predicted to require intensive care with a credibility interval of 25 to 137 patients (0.11%, 0.62%). On the other hand, 197 hospitalized patients are predicted to die on the same day.
The estimated posterior means and the 95% predicted interval for the increase in totals for H and ICU from 26th to 29th April are reported in Table 8. These are of particular interest, since during the first period of the pandemic, there was a significant daily increase in the demand for hospital beds and ICU, especially from the general population at risk of being affected by the virus.
In order to show the temporal dynamics of the estimated daily reproduction number R t , Figure 2 depicts the estimated averages and 95% credible bounds obtained using all the data of the 61 days along with the predicted values for 10 days. From this figure we observe that, on average, this value increases over time during the early phase of the epidemic before containment measures became effective, and it only begins to decrease on the 11th day, corresponding to the 5th of March.
Trend and values resemble those provided by the Italian National Institute of Health. 42

Some results obtained from Model 7
In the following, we show some results obtained from Model 7 for Italy since, as stated above, this model is reasonable, especially for exploratory data purposes, as no constraints are imposed on the odds for the transitions across categories. The posterior means of the predicted transition frequencies referred to 25th and 26th April stored in the transition matrix are reported in Table 9. Comparing this table with Table 6 we note some differences, for example, the fact that 1243 people are predicted to transit from S to R; however, this transition is rather implausible. This confirms that, as stated above, the proposed constraints are suitable to comply with the epidemiological features of the pandemic. The estimated posterior mean and the 95% predicted interval for the increase in totals for H and ICU from the 26th to the 29th April are reported in Table 10. Comparing this table with Table 8 we notice that the main difference is observed for the predicted frequencies for category H, and, as expected, these intervals are wider.

Lombardy data
We show additional results obtained when the proposed models are estimated with data referred to the Lombardy region. The realized values of the discrepancy measures of the eight models estimated with the available data are reported in Table 11. We note that, as for the Italian data, the Dirichlet-multinomial autoregressive models are more suitable to explain the variability observed in the data with respect to the models based on the multinomial distribution. The posterior predictive p-value closer to 0.5 is the one calculated for Model 8. Table 12 shows the forecasted total number of reported cases according to the posterior predicted distribution and the realized values of the proposed discrepancy measures defined in (9) andDist t , along with the out-of-fit posterior p-value for each day. We observe that the p-values obtained with the Dirichlet-multinomial models are higher than those in Table 4 referred to the Italian data, thus showing a better predictive power, probably because the observations collected within the region are more homogeneous than those collected over the entire nation.   Table 13 shows that the number of H and ICU patients is predicted with minimum error. This confirms that our proposal is particularly appropriate to predict mortality risk as well as progression to severe disease.

4.2.2
Results obtained from Model 8 for Lombardy Figure 3 shows the daily observed and predicted counts with a time horizon of 10 days along with the estimated 95% prediction intervals depicted in gray. The posterior means of the predicted transition frequencies between categories referred to the first predicted day are shown in Table 14. The estimated 95% prediction upper and lower bounds are reported in Table 15. The estimated posterior mean and the 95% predicted interval for the increase in totals for H and ICU are reported in Table 16. Figure 4 displays the dynamics of R t as obtained from the estimated model. We found that, on average, this value increases over time during the early phase of the pandemic. Its decrease begins on the 11th day, corresponding to the 5th of March, a few days after the issuance of the NPIs.

4.2.3
Some results obtained from Model 7 for Lombardy In the following we show some results obtained adopting Model 7 for the Lombardy region. The posterior means of the predicted transition frequencies referred to the 25th and the 26th of April are reported in Table 17. In this model no constraints are imposed on the odds for the transitions across categories.
The estimated posterior mean and the 95% predicted interval for the increase in totals for H and ICU are reported in Table 18. We notice that the length of the predicted intervals is higher than that estimated with Model 8, and it increases with the number of predicted days.

DISCUSSION
We propose a novel Bayesian approach based on multinomial and Dirichlet-multinomial distributions for time-series counts which can be useful to understand the diffusion of the coronavirus pandemic and to forecast, with good accuracy, for some days ahead, the expected number of people in the following categories: susceptible not previously ill, recovered, positive cases in quarantine, hospitalized, intensive care, and deceased. However, the models are formulated in a general way, and may be adapted to a different number of categories according to data availability. Moreover, they transcend the COVID-19 context since they can be suitable for many other situations where the assumption on the sequence of contingency tables of the "transition frequencies" between two consecutive time occasions is appropriate. For example, they could be used for the analysis of the transitions between categories of malignant tumors as in the tumor, node, metastasis classification when it is conducted on aggregated data or for the analysis of the transitions between levels of severity of other diseases. The problem of low data quality has been illustrated by Wynants et al. 46 We recognize that Italian official data may suffer from high variability and may lack representativeness over the population. This might cause an underestimation of the posterior and predictive uncertainties even if the proposed Dirichlet-multinomial model is also meant to mitigate this issue. We remark that in the present work, we use data provided officially by the national authorities that do not account for the counts of asymptomatic cases. This subcategory of individuals whose infection and recovery both go silent, could not be included in the model, and this constitutes a limitation of the study.
In particular, the proposed prediction models based on a Dirichlet-multinomial distribution can be used to support medical decisions, especially the management of intensive care units, and to plan increase in critical care bed capacity during the emergency. As stated by Remuzzi and Remuzzi,47 the prediction is very important to plan new facilities all over the countries and regions. The early identification of needs is a crucial aspect both for policy makers, and for physicians. Once epidemiological hypotheses are introduced, our model can be interpreted in the same spirit of more standard SEIR models and it can be used to estimate the daily reproduction number. With respect to SEIR models, it is more exploratory because it requires fewer assumptions and hypotheses. Moreover, we do not preclude transitions among the observed categories but we only place minimum requirements in the odds that are knowledge domain driven, which results in more stable estimates.
We are also aware that the pandemic may be seen as many local epidemics that are dependent on each other. Still, we believe that this does not create problems in interpreting our results since we model the joint time series without explicitly dealing with the interactions.
A possible extension of the proposed model would be to consider a set of nodes that correspond to different regions, and a network would describe the flow of people (either infected or susceptible) between nodes. A model having a multinomial or a Dirichlet-multinomial distribution could be fitted at each node, and interactions between nodes could be explicitly modeled. However, interesting comparisons can be made even within our proposal since the model can be estimated with data at the regional level and then a comparison of transition rates among categories across regions can be performed and the time spent in each category can be compared across regions. This analysis may be useful also to better understand the dynamics of the deceased, which are remarkably different among the Italian regions. We are confident that our proposal may help better plan active public health interventions in the feature and avoid the development of critical illness for patients.

DATA AVAILABILITY STATEMENT
We use publicly available data and in the Appendix we provide the link to the data source.