Estimating Absolute and Relative Case Fatality Ratios from Infectious Disease Surveillance Data
Abstract
Summary Knowing which populations are most at risk for severe outcomes from an emerging infectious disease is crucial in deciding the optimal allocation of resources during an outbreak response. The case fatality ratio (CFR) is the fraction of cases that die after contracting a disease. The relative CFR is the factor by which the case fatality in one group is greater or less than that in a second group. Incomplete reporting of the number of infected individuals, both recovered and dead, can lead to biased estimates of the CFR. We define conditions under which the CFR and the relative CFR are identifiable. Furthermore, we propose an estimator for the relative CFR that controls for time‐varying reporting rates. We generalize our methods to account for elapsed time between infection and death. To demonstrate the new methodology, we use data from the 1918 influenza pandemic to estimate relative CFRs between counties in Maryland. A simulation study evaluates the performance of the methods in outbreak scenarios. An R software package makes the methods and data presented here freely available. Our work highlights the limitations and challenges associated with estimating absolute and relative CFRs in practice. However, in certain situations, the methods presented here can help identify vulnerable subpopulations early in an outbreak of an emerging pathogen such as pandemic influenza.
1. Introduction and Background
The case fatality ratio (CFR), a measure of the virulence of an infectious disease, is the fraction of cases that die after contracting the disease. A relative CFR is defined as the CFR of one group divided by that of a reference group. However, incomplete reporting of the number of infected individuals, both recovered and dead, makes it difficult to accurately estimate the CFR.
In public health response, both the absolute and the relative CFRs have important roles to play. The absolute CFR has an obvious role—it provides a measure of the severity of the disease. For setting public health priorities, an order‐of‐magnitude estimate of the absolute CFR may be adequate, and sophisticated methods may not be needed. Once the severity of the disease has been established, the relative CFR takes on primary importance as it becomes the guiding principle for targeting interventions to those populations that are most at risk of severe outcomes. In the 2009 pandemic of influenza A (H1N1) we have a prime example. Early on, limited supplies of antivirals were the only option for prophylaxis and treatment of the disease, and public health agencies needed to decide how these should best be deployed. Later, as limited vaccine became available, officials had to prioritize subpopulations for vaccination. In many cases, targeting decisions came down to the relative severity of disease in different populations, i.e., the relative CFR. Hence, accurately estimating this quantity early in an epidemic in a changing surveillance environment is of great importance.
Because an accurate and precise estimate of the relative CFR depends on complete observation of the number of cases and deaths in an outbreak, reliable data can often be difficult to find or collect. In many settings, nonfatal cases will go undetected either because of mild symptoms or insufficient public health surveillance infrastructure. Fatalities may be misreported as well, perhaps due to poor surveillance or incorrect attribution to another cause. Furthermore, depending on the survival times of cases, many deaths will be unreported simply because they have not yet occurred.
Due to variation in the host and the disease, the CFR may vary across a population. For example, one group of individuals may be more or less likely to succumb to a disease than others because of previous exposure or an underlying condition.
Although estimation of the CFR has been the focus of several papers in recent years, none of them address directly the challenges posed by an incompletely observed epidemic. Using data collected during the 2002–2003 severe acute respiratory syndrome (SARS) outbreak, Ghani et al. (2005) developed methods to address the challenge of estimating the CFR in real time when the survival distribution of the disease is not known and when not all infected cases have died. Also motivated by the SARS epidemic, Jewell et al. (2007) developed nonparametric methods to estimate the CFR using a competing‐risks framework from survival analysis. However, because the SARS outbreak was assumed to be fully observed—an assumption verified by subsequent serological analysis (Leung et al., 2004)—neither of these papers addresses issues of incomplete case ascertainment or changes in case reporting rates.
In the 2009 H1N1 influenza pandemic, a few attempts were made to estimate the CFR. Garske et al. (2009) gave a concise summary of the issues surrounding accurate estimation of the CFR. They proposed an estimator that adjusts for the survival distribution of influenza but not for changes in reporting. Nishiura et al. (2009) developed methods to estimate the CFR in the middle of an epidemic. This method adjusts a naïve estimator (which divides the total number of observed deaths by the number of observed cases) based on information about the (assumed known) survival distribution of the particular disease. These two papers proposed analogous estimators, similar to what we propose as the E‐step of the expectation maximization (EM) algorithm for the lag‐adjusted estimator (see Section 3.3 below). However, neither of these recent works investigated the effects of changes in reporting rates on their estimate of the CFR. Other papers have taken more direct approaches to estimate the CFR, using surrogate measures for the number of individuals who become infected and who die (Presanis et al., 2009) or simply by making ad hoc adjustments to the observed case counts to account for underreporting (de Silva et al., 2009; Donaldson et al., 2009; Wilson and Baker, 2009).
Our goal with this article is to define situations where we can identify the relative CFR and to develop methods to estimate it. We define a typology of reporting rates for fatal and nonfatal cases that can be used to classify surveillance systems. We use this typology to determine conditions under which the relative CFR is identifiable. When it is, we utilize log‐linear regression to estimate the relative CFR—a method analogous to standard relative risk estimation (Frome and Checkoway, 1985).
Section 2 presents the notation and surveillance system typology. Section 3 introduces methods to estimate the relative CFR that can control for time‐varying reporting and disease‐delayed mortality. Section 4 presents results from a simulation study. Section 5 demonstrates the methods with an analysis of data from the 1918 influenza pandemic.
2. The Structure of Case Fatality Data
2.1 Observed Data and Notation
During an outbreak, health organizations periodically report the number of incident (or cumulative) cases and deaths. These counts may only represent a fraction of the actual cases and actual deaths. Suppose we have T time periods, and a covariate (e.g., age) with J levels. We define Ntj be the total number of reported cases with symptom onset at time t for covariate level j. We define Dtj as the number of reported deaths with symptom onset at time t for covariate level j. (For now, we assume that the onset time of all deaths is known and that survival time is always shorter than the reporting interval, i.e., deaths are reported in the same interval in which they fall sick. We relax this assumption in our discussion of varying survival times in Section 3.3). Also, Stj is the reported number of recovered cases with onset at time t for covariate level j. Therefore Ntj=Dtj+Stj. However, underlying the reported data are recovered cases and deaths that go unreported. Let
be the total number of cases, both reported and unreported, at time t for covariate level j.
The CFR, ptj, is the probability of death, conditional on being a case. We assume the following reporting rates for time t and group j: φtj is the probability a recovered case is reported and ψtj is the probability a dead case is reported. Table 1 illustrates the probabilistic structure of the observed and unobserved data.
| Recovered | Died | Total | |
|---|---|---|---|
| Reported | πtj1 = (1 −ptj)φtj | πtj2 = ptjψtj | (1 −ptj)φtj+ptjψtj |
| Not reported | πtj3 = (1 −ptj)(1 −φtj) | πtj4 = ptj(1 −ψtj) | (1 −ptj)(1 −φtj) +ptj(1 −ψtj) |
| Total | (1 −ptj) | ptj | 1 |
We wish to clarify our use of the term reported. If a recovered case is reported then by that we mean it is counted as a case and is included in the denominator but not the numerator of the CFR. On the other hand, if a dead case is reported, by that we mean it is counted as both a case and a death and accordingly is included in both the numerator and the denominator of the CFR. By allowing different reporting probabilities for dead cases and recovered cases, we are making the event of being reported depend on the vital status of the case. We suggest in Section 6 alternative set‐ups that may also be useful to consider.
2.2 A Typology of Reporting Rate Scenarios
We assume that φtj and ψtj vary independently across groups and times and that ptj = pj, i.e., the group‐specific CFR stays constant over time. We return to our central question: what conditions or assumptions allow us to estimate an absolute or relative CFR?
Observed relative CFRs and reporting rates for cases and deaths are mathematically intertwined. For example, say we calculate an observed relative CFR for group A over group B at two different times. An increase in this observed relative CFR over time could be due to any number of factors: an increase in death reporting in group A, an increase in case reporting in group B, a decrease in case reporting in group A, a decrease in death reporting in group B, an actual change in the relative CFR, or some combination of all these factors. Without some assumptions on how the case reporting rates vary over group and time, the relative CFR itself is unidentifiable.
We developed a typology of reporting rate scenarios that classifies surveillance systems by how reporting rates for recovered and dead cases vary across time and covariate group.
In a general model, we make no assumptions about how the reporting rates vary by t and j. In this scenario, both the absolute and relative CFRs are unidentifiable. In a covariate‐independent model, the reporting rates do vary across time but, at a given t, are constant across covariate j. For example, reporting rates may improve over time as an outbreak develops and public health surveillance teams are mobilized. However, at any given time, the detection rates may be identical for, say, men and women. In a constant proportion model there is a constant factor which relates the reporting rates of cases to that of deaths. This type of model might be appropriate in nonoutbreak or endemic disease contexts. Finally, in a complete fatality reporting model, we assume that the reporting rate for recovered cases does not vary by covariate group and that the reporting rate for fatal cases is nearly 100%. This may be a feasible model to assume in an enclosed population or when there is good reason to believe that a surveillance system is identifying all deaths.
Table 2 displays this classification that can be applied to disease surveillance systems and Figure 1 depicts the relationships between the different models. In the three models with assumptions about reporting rates, the relative CFR is identifiable. The absolute CFR for group j will only be identifiable if the reporting rates for deaths and recovered cases are identical for all t.
| Model type | Reporting rate assumptions | identifiable
|
|
|---|---|---|---|
| Cases | Fatalities | ||
| General | φtj=φtj | ψtj=ψtj | No |
| Covariate independent | φtj=φt | ψtj=ψt | Yes |
| Constant proportion | k·φtj=ψtj | Yes | |
| Complete fatality reporting | φtj=φt | ψtj= 1 | Yes |

Venn diagram showing which models are subsets of the others.
3. Estimation of the Absolute and Relative Case Fatality Ratios
3.1. The Naïve Estimator
The naïve estimator for the absolute CFR is based on the observed case counts. We define the naïve estimator for group j as
.
(1)3.2. Adjusting for Time‐Varying Reporting
In the context of an outbreak, where surveillance and public awareness of a disease change over time, a major weakness of the naïve estimator is that it does not adjust for time‐varying rates of case reporting. In this section, we develop a model which, under certain conditions, provides accurate estimates of the relative CFR.
We begin by introducing a conditional binomial approach that is derived from the multinomial model in equation 1. By conditioning on Ntj, we can model Dtj directly as realizations from a binomial distribution. At each (t, j) pair, we say that Dtj follows a Binomial distribution of size Ntj and success probability
. However, the sum in the denominator of the probability of death will make this conditional binomial model hard to fit in practice because the likelihood contains nonlinear functions of the parameters.
is well approximated by
(see derivation in Web Appendix B) and the multinomial framework from equation 1 gives us that
. Because death is assumed to be a rare event, we can use the Poisson approximation to formulate the model as
(2)
(3)Constant proportion of reporting rates
. We can reparameterize this model as
(4)
is the log relative CFR comparing group j and group 1. Also, β0= log (p1k).
Covariate‐independent reporting
, which can be reparameterized as
(5)
and
.
Complete fatality reporting
, which can be reparameterized as
(6)
and 
Models 4, 5, and 6 can be fit with standard software by declaring log Ntj as an offset. The relative CFR will be estimated by
. We will call this quantity the reporting‐rate‐adjusted estimator of the relative CFR.
3.3 Adjusting for Survival Time
If reporting rates change over time and deaths are observed only at the time of death and not at the time of infection, the methods outlined in the previous sections may yield biased estimates of the relative CFR. These methods can be extended to account for incomplete observation of deaths due to the variable survival times from the time of infection. Real‐time application of this method is dependent on death occurring in response to an acute infection, as with a disease such as influenza, where most deaths would be expected to occur within a fixed number, L, weeks of infection. With diseases such as HIV/AIDS, where the survival time can be years or decades, this model may be hard to apply in real‐time settings.

that is similar to the earlier models for
:
(7)
are not explicitly observed during an outbreak. We have developed a method to reconstruct an estimate of
from the observed data, which allows us to then fit the model in equation 7. To reconstruct
we need to estimate how many of the dead cases observed at times t+ 1, …, t+L would be expected to have symptom onset date of time t. Assuming a known survival distribution for the disease, our method employs the EM algorithm (Dempster, Laird, and Rubin, 1977) to generate estimates of the relative CFR.
We propose the following EM algorithm, assuming a covariate independent model:
- 1
Fix
, a vector of starting values for the αt, for t= 2, …, T.
- 2
Fix iter= 0 and choose a tolerance δ to determine convergence.
- 3
Fix iter=iter+ 1.
- 4
(E‐step) Find the expected reported deaths with symptom onset at time t, using equation 8, for each covariate group j, conditioning on
(the vector of survival probabilities, assumed known) and
.
- 5
(M‐step) Fit the model from equation 7 using the E‐step results as the outcome.
- 6
Store
as the fitted coefficients from the model.
- 7
Repeat steps 3.3–3.3 until the parameters from the covariate‐independent model converge, i.e., until each component of
is less than the tolerance δ.
- 8
Use the supplemented EM algorithm (Meng and Rubin, 1991) to calculate the standard errors for parameter estimates.
(8)
is a set of all (t, i) pairs that contribute to the observed death count on day t+i and D·j and N·j are the vectors of observed values of Dtj and Ntj for t= 1, …, T. The M‐step may be run using standard GLM software. The full EM routine may be implemented using the EMforCFR() function in the coarseDataTools package for R (see Web Appendix F), freely available from CRAN, the Comprehensive R Archive Network (Reich, 2010). We refer to this estimate of the relative CFR as the lag‐adjusted estimator.
4. A Simulation Study
To test the performance of our proposed estimators in a wide range of circumstances, we conducted several simulation studies. In the central scenario, a population experiences an epidemic made up of staggered outbreaks in two distinct subgroups of the population. We compared the performance of the naïve, reporting‐rate‐adjusted and lag‐adjusted methods in estimating the relative CFR between the two groups, which was fixed at
. Details of the data‐generation model can be found in Web Appendix D.
Our study followed three lines of inquiry. First, we compared the performance of the naïve and reporting‐rate‐adjusted estimators in scenarios where the reporting rates followed simple step functions and where the date of onset was assumed known for all deaths. Second, we compared the performance of the naïve, reporting‐rate‐adjusted and lag‐adjusted estimators when reporting rates changed over time and when the time of onset for deaths was unobserved but the survival distribution was assumed to be known. Third, we tested the sensitivity of our model to the assumption that the pj are small. Results from these simulations are summarized below. Complete methodological details and results are provided in Web Appendix D.
4.1 Naïve versus Reporting‐Rate‐Adjusted Estimators
We considered scenarios where the reporting rates ψt and φt follow a step function with a changepoint halfway through the outbreak. For a fixed set of reporting rates, ψt and φt, we simulated and analyzed 500 outbreak datasets. For each dataset we estimated the absolute and relative CFRs using the naïve estimator and the relative CFR using the reporting‐rate‐adjusted estimator while assuming the correct reporting model.
The results, given in Table 3, show that while the reporting‐rate‐adjusted estimator consistently performed well, with minimal bias across all models, the naïve estimator’s performance was more erratic. In some cases, notably the constant proportion models, the naïve estimator was close to unbiased. (These empirical results are supported by the theoretical asymptotic results in Web Appendix A and more detailed simulations in Web Appendix D.) However, in most scenarios, the naïve estimator missed the target by a wide margin. In a few cases, it reversed the direction and on average estimated the relative CFR to be over 1.
The standard errors of the estimates are given in the parentheses. If the outbreak were fully observed, i.e., when both reporting rates are 100%, then the absolute CFR would be estimated as 2.00 and relative CFR would be estimated as 0.33. The first five rows of the table represent data coming from complete fatality reporting systems; the next three rows are from covariate independent models and the final two rows are from constant proportion models| Reporting rates, % | Avg. observed counts | CFR Naïve | Relative CFR | |||||
|---|---|---|---|---|---|---|---|---|
| φa | φb | ψa | ψb | Deaths | Cases | Naïve | RR‐adj | |
| 90 | 10 | 100 | 100 | 2749 | 739,226 | 3.7 (0.07) | 1.02 (0.044) | 0.34 (0.02) |
| 10 | 90 | 100 | 100 | 2740 | 635,985 | 4.3 (0.08) | 0.09 (0.004) | 0.34 (0.02) |
| 50 | 50 | 100 | 100 | 2742 | 687,602 | 4.0 (0.08) | 0.33 (0.015) | 0.33 (0.02) |
| 70 | 1 | 100 | 100 | 2739 | 533,522 | 5.1 (0.09) | 1.37 (0.058) | 0.36 (0.02) |
| 1 | 1 | 100 | 100 | 2745 | 16,439 | 167.0 (3.14) | 0.39 (0.016) | 0.39 (0.02) |
| 90 | 10 | 30 | 100 | 1365 | 737,833 | 1.8 (0.05) | 2.30 (0.117) | 0.34 (0.02) |
| 10 | 90 | 30 | 100 | 1366 | 634,612 | 2.2 (0.06) | 0.19 (0.010) | 0.33 (0.03) |
| 70 | 1 | 30 | 100 | 1365 | 532,141 | 2.6 (0.07) | 3.10 (0.171) | 0.38 (0.03) |
| 10 | 25 | 40 | 100 | 1561 | 231,592 | 6.7 (0.17) | 0.34 (0.018) | 0.34 (0.03) |
| 5 | 30 | 10 | 60 | 660 | 224,203 | 2.9 (0.12) | 0.33 (0.027) | 0.34 (0.04) |
4.2 Comparison of all Estimators, Accounting for Survival Time
We ran a simulation to examine the performance of our estimators when the symptom onset for deaths is unknown but the survival distribution, dependent on a parameter
, is assumed to be known. Details of the data‐generation algorithm are provided in Web Appendix D.
Under each of 10 different discrete survival distributions (scenarios A through J in Table 4) 1000 datasets were generated. The naïve, reporting‐rate‐adjusted and lag‐adjusted estimators were calculated for each dataset. The lag‐adjusted estimator was computed three times assuming different discrete survival distributions. First, we computed the estimate using the survival distribution used to generate the data (denoted by truth in Table 4). Second, we computed the estimate assuming a symmetric survival distribution with a mean of three time units and maximum possible survival of 4 (denoted by short in Table 4). The time‐unit‐specific probabilities of death for times 2 through 4 were (0.3, 0.4, 0.3). Lastly, we computed the estimate assuming a symmetric survival distribution, with mean survival of 8 days and maximum survival of 11 days (denoted by long in Table 4). The time‐unit‐specific probabilities of death for times 5 through 11 were (0.1, 0.15, 0.15, 0.2, 0.15, 0.15, 0.1).
| True η | MSE × 100 | Average estimate | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Naïve | RR‐adj | lag‐adjusted | Naïve | RR‐adj | lag‐adjusted | |||||||
| Truth | Short | Long | Truth | Short | Long | |||||||
| A | (0.5, 0, 0, 0, 0.5) |
|
47.85 | 7.38 | 0.12 | 0.12 | 1385.1 | 1.02 | 0.06 | 0.34 | 0.33 | 4.04 |
| B | (0.2, 0.2, 0.2, 0.2, 0.2) |
|
52.83 | 7.40 | 0.12 | 0.11 | 1383.6 | 1.06 | 0.06 | 0.34 | 0.34 | 4.04 |
| C | (0.2, 0.6, 0.2, 0, 0) |
|
66.62 | 5.52 | 0.08 | 9.70 | 2544.8 | 1.14 | 0.10 | 0.34 | 0.64 | 5.37 |
| D | (0.5, 0.2, 0.1, 0.1, 0.1) |
|
62.77 | 5.73 | 0.10 | 7.15 | 2430.6 | 1.12 | 0.09 | 0.34 | 0.59 | 5.25 |
| E | (0.1, 0.1, 0.1, 0.2, 0.5) |
|
40.40 | 8.44 | 0.10 | 1.96 | 687.1 | 0.96 | 0.04 | 0.34 | 0.19 | 2.95 |
| F | Uniform on 1,…,15 |
|
45.82 | 5.96 | 0.36 | 4.59 | 0.67 | 1.01 | 0.09 | 0.37 | 0.11 | 0.39 |
| G | Uniform on 1,…,10 |
|
38.34 | 8.19 | 0.23 | 5.03 | 122.5 | 0.95 | 0.05 | 0.36 | 0.11 | 1.43 |
| H | Discrete Weibull |
|
50.60 | 7.11 | 0.30 | 5.16 | 15.7 | 1.04 | 0.07 | 0.36 | 0.11 | 0.72 |
| I | Reverse discrete Weibull |
|
40.36 | 5.62 | 0.39 | 5.23 | 1.16 | 0.96 | 0.10 | 0.37 | 0.10 | 0.23 |
| J | 0.08 on 1–10, 0.2 on 15 |
|
55.64 | 6.24 | 0.28 | 4.26 | 7.13 | 1.08 | 0.08 | 0.36 | 0.13 | 0.59 |
| Short assumed distribution |
|
|||||||||||
| Long assumed distribution |
|
|||||||||||
The 10 discrete survival distributions were chosen to represent a range of maximum survival times, different degrees of skewness, and varying levels of heterogeneity. A discretized version of a Weibull distribution (shape = 1.5, scale = 8) was truncated at 15 units and used for scenarios H and I. The survival probabilities for the 15 units are (0.05, 0.08, 0.09, 0.1, 0.1, 0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.04, 0.03, 0.02). This distribution has a mean of 6.75. Scenario H used these probabilities as the probabilities of death at time 1 through 15. Scenario I used the reversed vector of probabilities. Other distributions are described in Table 4.
Table 4 shows the simulation results and compares the estimators’ MSE and bias. Comparing MSEs of the five estimators across the different scenarios, we see that the adjustment for survival time provides a large gain in efficiency when the assumed survival distribution has roughly the same mean as the truth. This holds for distributions with short survival times (scenarios A–E) as well as those with long possible survival times (scenarios F–J). In the 10 scenarios we studied, the MSEs for the lag‐adjusted estimates that assumed a known survival distribution with the same mean as the truth were an order of magnitude smaller than those for the reporting‐rate‐adjusted estimator and two orders of magnitude smaller than the naïve method. When the incorrect survival distribution was assumed, the lag‐adjusted estimator’s performance varied. When the true survival distribution was short but we assumed a long distribution, we overestimated the relative CFR. When the true survival distribution was long but we assumed a short distribution, we underestimated the true relative CFR. However, in these cases the lag‐adjusted estimator was less biased on average and had a lower MSE than the reporting‐rate‐adjusted estimator. These results suggest that the central tendency of the survival distribution is more important to specify correctly than the spread. A more detailed analysis of these patterns could provide additional insight.
4.3 Sensitivity of Estimation to Large Case Fatality Ratio
As discussed earlier, approximating N* by
provides an important link between a model that is unidentified and one where the relative CFR is estimable. This approximation, as discussed in Web Appendix B, relies on the pj being small. We examined the degree to which this approximation can impact the reliability of results.
We used the same data‐generating process as in the previous simulations to generate simulated datasets with case and death counts for two subgroups of the population. The relative CFR between the groups was held fixed at
, and the larger of the two CFRs was allowed to vary from
to 1. Deaths were assumed to be completely reported and case reporting followed the step‐function pattern described earlier. For a given pair of CFRs and case reporting step function, 500 datasets were simulated and the naïve and reporting‐rate adjusted estimators calculated.
Figure 2 shows the sensitivity of the estimates of the relative CFR to the true CFR for a particular case reporting step function. We found that until the larger of the two CFRs reached the rough threshold of
, the reporting‐rate‐adjusted estimator remained within 10% of the true value of the relative CFR. The naïve estimator showed large bias for all values of the group 1 CFR. Further simulations (not shown) uphold these conclusions for other case reporting rate patterns and confirm that the bias depends largely on the magnitude of the maximum CFR.

This graph compares the percent average bias for the naïve estimator (dashed line) and the reporting‐rate‐adjusted estimator (solid line) for different magnitudes of the true CFRs. The x‐axis is indexed by the larger of the two group‐specific CFRs. For all simulations, the relative CFR was fixed at
. The lines trace out the average of 500 estimates from simulated datasets. The shaded regions demarcate the 5th and 95th percentiles of the 500 point estimates at each CFR magnitude.
5. Data Analysis: 1918 Influenza Case Fatality in Maryland
We analyzed data from the 1918 influenza pandemic from counties in the state of Maryland, USA. The Annual Report of the State Board of Health of Maryland for the year ending December 31st, 1918 provides counts of influenza cases and deaths for the final 4 months of 1918 (Maryland State Board of Health, 1922). Influenza was made a reportable disease in September of 1918 and there are virtually no records of cases or deaths prior to this time.
Case and death counts were crosstabulated for the months of September through December, 1918 and for a subset of Maryland counties. These data are presented in Web Table 1. Inclusion criteria for counties were established to create a subset of counties for which the assumption of covariate independence might be reasonably assumed to hold. Covariate independence implies that at every month t, the rates of case reporting are the same across all counties and the rates of reporting of deaths are the same across all the counties. The following inclusion criteria were chosen to control for county‐level factors that we believed could impact reporting rates. The county must (1) have a hospital, (2) have population density less than 100 individuals per square mile, and (3) not contain military bases or installations. Eight counties met these criteria: Carroll, Cecil, Dorchester, Frederick, Somerset, Talbot, Washington, and Wicomico counties. Data on these county‐level characteristics were obtained from the state Board of Health annual report, U.S. Census data, and an article on hospitals in the United States (Anonymous., 1921; Maryland State Board of Health, 1922; Department of Commerce, USA, 1924). Because the time period of reporting (1 month) is generally thought to be greater than most survival times for influenza, we did not additionally adjust for prolonged survival after symptom onset. Reporting‐rate‐adjusted relative CFRs were estimated by fitting the covariate‐independent model from equation 5.
It has been shown that socioeconomic status of geographical regions was associated with mortality from influenza in 1918 (Murray et al., 2007). We sorted the eight counties by percentage of the population that was not native‐born white (range: 4.8–36.6%), a proxy for socioeconomic status. We chose Somerset, the county with the lowest percentage of native‐born white population, as the reference county for the data analysis.
The estimated relative CFRs and accompanying 95% confidence intervals are shown in Figure 3A. In Figure 3B, the percent nonwhite population for each county is plotted against the adjusted estimate of the relative CFR with a linear regression line drawn through the data. We find that the counties with a higher proportion of white population have on average lower CFRs than counties with higher proportions of minorities. Three of eight counties had percent native‐born white population above 90% (Carroll, Frederick, and Washington counties) and these were the only three counties where we observed a significant difference in the reporting‐rate‐adjusted CFR when compared with Somerset county, with the lowest percent white population.

This graphic summarizes the results of the data analysis presented in section 5. Panel A shows the estimated relative CFRs for the seven counties with respect to Somerest, the reference county. The vertical tick marks indicate the point estimates for each county and the horizontal lines indicate 95% confidence intervals for each county. The vertical tick marks have been scaled to represent the total number of cases observed in each county. Panel B plots the estimated relative CFRs against the percent of the county’s population whose race is not native‐born white. The linear regression line is drawn through the points to illustrate an observed association between these two variables. The Pearson correlation coefficient for these two variables is 0.76.
We postulate several possible explanations for this observed pattern. First, the existence of county‐level variation in the CFR may be due in part to socioeconomic status. Second, differential case or death reporting in the counties may violate the assumption of covariate independence and lead to biased estimates of the relative CFR. For example, if the case reporting rate was higher in counties with higher white population, this may increase the denominator for estimating the CFR when compared with the other counties, leading to a reduction in the estimate of the relative CFR. Third, the variation in the estimated relative CFR appears to depend on geographical location as well. Carroll, Frederick, and Washington counties are all counties in central Maryland, near to Baltimore County and Baltimore City. Dorchester, Somerset, Talbot, and Wicomico counties are all located on the eastern shore of Maryland, a peninsula bordered by the Chesapeake Bay and the Atlantic Ocean.
6. Discussion
The CFR can play a large role in establishing the public health threat of a given disease. Accurate estimates of the relative CFR can help determine the optimal allocation of resources for surveillance, prevention, and treatment of disease. However, outbreak settings often generate data that are incomplete, where both recovered and fatal cases go unreported. In these situations, it is important to understand the assumptions necessary to identify an absolute or relative CFR, and to what extent those assumptions are realistic. We have shown that the absolute CFR is only identifiable when reporting rates for cases and deaths are equal to each other at every observation timepoint—a very unlikely scenario in practice. Furthermore, we have shown that only with fairly stringent assumptions about the way that reporting varies over time can a relative CFR be identified. However, when it is identifiable, our reporting‐rate‐adjusted and lag‐adjusted estimators can provide unbiased or nearly unbiased estimates of the relative CFR while the naïve estimator is virtually always biased, often severely so.
Our work identifies and defines several important structural aspects of case fatality data. First, a 2 × 2 table (see Table 1) elucidates the structure of the observed and unobserved data. Second, our typology of reporting rates provides a simple classification scheme for disease surveillance systems. The new methods proposed in this article require case and death counts that are crosstabulated by units of time (i.e., weeks or months) and a categorical covariate such as gender. By providing such crosstabulations, surveillance reporting systems could make more data available for estimating the virulence of an infectious disease.
Although our model for a surveillance system reporting framework is general and applicable to a wide range of settings, other setups may be worthy of consideration. For example, a framework could define reporting rates without conditioning on the outcome status.
There are several limitations to this work. Some of our methods rely on the assumption that the reporting rates are independent of the covariate of interest. This is likely not true for a covariate such as age, as surveillance may more easily target school‐age children than adults. But it may hold for other covariates such as gender or geographical location. Another limitation is that delays in reporting not due to survival times are not accounted for in this model. If these delays are uniform across all cases, then this may not introduce bias. However if the delays are different for different subgroups, this may impact the performance of the proposed estimators. Also, when using the lag‐adjusted estimator, we assume a survival distribution for the disease in question. Our simulation results (see Section 4.2 and Web Appendix D) suggest that knowing the exact distribution is not vital as long as it the center of mass is roughly centered on the truth (Table 4), although it may increase the estimated variance of the estimates (Web Table 4). However, knowing the center of mass of a survival distribution may be difficult with an emerging pathogen. An in‐depth analysis could provide more detailed information about the performance of these methods.
We also rely on the assumption that the true CFR is small in each of the population subgroups. This assumption enables us to make a key algebraic simplification. The impact of the assumption may be small for diseases such as influenza, whose CFR is thought to be on the order of 1 in 1000 (see Section 4.3 and Figure 2). However, an avenue for further research could be to find ways to adapt this method to work for diseases with larger CFRs. We also assume that the CFR does not change over time. This may not be case with an emerging pathogen, as disease treatment and management may improve as clinical and epidemiological understanding of the disease evolves (Yip et al., 2005).
Finally, our method for calculating the lag‐adjusted estimator is not computationally simple to implement. However the code, an example dataset, and a vignette are available in the coarseDataTools package (see Web Appendix F).
Developing the framework to include additional covariates would be a useful extension to this work. For example, methods to estimate and compare the relative CFRs between men and women across several different countries could be useful. This could be achieved by inclusion of more covariates directly in the GLM framework or by fitting a multilevel model. Either way, the ability to test whether there is evidence that the CFR is different between two locations (while controlling for a second covariate) could be a helpful addition to this model. Further, allowing for a survival distribution that varies across levels of a covariate would be a useful addition to this framework. Finally, extending these methods to incorporate available evidence on reporting rates would be valuable.
Because disease outbreaks are often only partially observed, estimating absolute and relative CFRs remains a challenging problem for epidemiologists and public health officials. The methods developed in this article contribute a new set of tools for obtaining accurate estimates of the relative CFR in some scenarios. Such estimates could inform a timely and targeted response to an infectious disease outbreak.
7. Supplementary Materials
Web Appendices, Tables, and Figures referenced in Sections Section 3, 4, 5 and 6 are available under the Paper Information link at the Biometrics website http://www.biometrics.tibs.org/.
Acknowledgements
NGR and RB were supported by the National Center for the Study of Preparedness and Catastrophic Event Response (PACER), which is funded by the U.S. Department of Homeland Security (N00014‐06‐1‐0991). NGR and DATC were supported by the National Institute of General Medical Sciences (Award R01GM090204). JL and DATC were supported by grants from the NIG Fogarty Institute (1 R01 TW 0008246‐01) and the Bill and Melinda Gates Foundation (705580‐3). DATC holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund.
References
Citing Literature
Number of times cited according to CrossRef: 7
- Morteza Abdullatif Khafaie, Fakher Rahim, Cross-Country Comparison of Case Fatality Rates of COVID-19/SARS-COV-2, Osong Public Health and Research Perspectives, 10.24171/j.phrp.2020.11.2.03, 11, 2, (74-80), (2020).
- Qifang Bi, Yongsheng Wu, Shujiang Mei, Chenfei Ye, Xuan Zou, Zhen Zhang, Xiaojian Liu, Lan Wei, Shaun A Truelove, Tong Zhang, Wei Gao, Cong Cheng, Xiujuan Tang, Xiaoliang Wu, Yu Wu, Binbin Sun, Suli Huang, Yu Sun, Juncen Zhang, Ting Ma, Justin Lessler, Tiejian Feng, Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study, The Lancet Infectious Diseases, 10.1016/S1473-3099(20)30287-5, (2020).
- Hiroshi Nishiura, Real-Time Estimation of the Case Fatality Ratio and Risk Factors of Death, Disease Modelling and Public Health, Part A, 10.1016/bs.host.2017.05.002, (167-174), (2017).
- Barbara Rath, Tim Conrad, Puja Myles, Maren Alchikh, Xiaolin Ma, Christian Hoppe, Franziska Tief, Xi Chen, Patrick Obermeier, Bron Kisler, Brunhilde Schweiger, Influenza and other respiratory viruses: standardizing disease severity in surveillance and clinical trials, Expert Review of Anti-infective Therapy, 10.1080/14787210.2017.1295847, 15, 6, (545-568), (2017).
- Alberto Pasanisi, Côme Roero, Emmanuel Remy, Nicolas Bousquet, On the Practical Interest of Discrete Inverse Pólya and Weibull‐1 Models in Industrial Reliability Studies, Quality and Reliability Engineering International, 10.1002/qre.1845, 31, 7, (1161-1175), (2015).
- Kenji Mizumoto, Akira Endo, Gerardo Chowell, Yuichiro Miyamatsu, Masaya Saitoh, Hiroshi Nishiura, Real-time characterization of risks of death associated with the Middle East respiratory syndrome (MERS) in the Republic of Korea, 2015, BMC Medicine, 10.1186/s12916-015-0468-3, 13, 1, (2015).
- Camille Pelat, Neil M. Ferguson, Peter J. White, Carrie Reed, Lyn Finelli, Simon Cauchemez, Christophe Fraser, Optimizing the Precision of Case Fatality Ratio Estimates Under the Surveillance Pyramid Approach, American Journal of Epidemiology, 10.1093/aje/kwu213, 180, 10, (1036-1046), (2014).



identifiable

