C.-Y.H. and G.J.C. conceived and designed the study. C.-Y.H. and Y.X. input the data and undertook statistical analysis of the data. C.-Y.H. and G.J.C. drafted the article. G.J.C. and J.N.C. reviewed and revised the article. All authors have read and approved the final version of the article.
Assessing the utility of cancer-registry–processed cause of death in calculating cancer-specific survival
Article first published online: 13 FEB 2013
Copyright © 2013 American Cancer Society
Volume 119, Issue 10, pages 1900–1907, 15 May 2013
How to Cite
Hu, C.-Y., Xing, Y., Cormier, Janice. N. and Chang, G. J. (2013), Assessing the utility of cancer-registry–processed cause of death in calculating cancer-specific survival. Cancer, 119: 1900–1907. doi: 10.1002/cncr.27968
- Issue published online: 6 MAY 2013
- Article first published online: 13 FEB 2013
- Manuscript Accepted: 26 DEC 2012
- Manuscript Revised: 6 DEC 2012
- Manuscript Received: 19 OCT 2012
- We are grateful to the National Cancer Institute for providing the Surveillance, Epidemiology, and End Results Program
- cause of death;
- End Results;
- cancer-specific survival;
- relative survival;
Cancer registries use algorithms to process cause of death (COD) data from death certificates, but uncertainties remain regarding the accuracy and utility of those data in calculating cancer-specific survival (CSS). Because it is impractical to reconfirm the COD through meticulous review of the primary medical records, the observed cancer deaths could be compared with the number of attributed deaths, as estimated by using a relative survival (RS) approach, to determine utility in CSS estimation.
Six major cancer types were evaluated using Surveillance, Epidemiology, and End Results (SEER) data (1988-1999 cohort). The COD utility was quantified by using the observed-to-expected ratio (O/E ratio) approach, which was calculated as the SEER-documented observed number of cancer-specific deaths divided by the number of expected deaths attributed to the malignancies as estimated using a RS approach. Favorable utility would have an O/E ratio close to 1.
In total, 338,445 patients were identified; and their O/E ratios were 0.97, 0.98, 0.90, 1.07, 1.02, and 0.92 for breast, colorectal, lung, melanoma, prostate, and pancreas cancer, respectively. O/E ratios varied slightly with patients' age, race, and tumor stage, but not by sex. CSS for patients with lung cancer appeared to be overestimated considerably. Patients with multiple cancer diagnoses had poor O/E ratios compared with those who had only 1 cancer.
The utility of COD in calculating CSS depended variously on the risk of cancer-related mortality and nontumor factors. However, the impact of this variation on CSS generally was small. The current results indicated that the COD assigned by cancer registries has acceptable validity, and CSS is considered an acceptable surrogate for RS in most circumstances. Cancer 2013. © 2013 American Cancer Society.
The Surveillance, Epidemiology, and End Results (SEER) Program1 of the National Cancer Institute (NCI) provides information on cancer incidence and survival statistics that are largely applied in daily medical practice. SEER also reports cause of death (COD), as determined by cancer registries, using predefined algorithms to process COD from death certificates to identify a single, disease-specific, underlying COD. However, questions have long been raised concerning the reliability of the COD assignment and its accuracy for estimating cancer-specific survival (CSS).2-6
When COD reliability is uncertain, relative survival (RS)7 is commonly used as an alternative. Although both CSS and RS are classical net survival measurements used to quantify the excess mortality attributable to the disease, CSS differs from RS with several advantages and disadvantages. CSS is defined as the proportion of patients alive with a specific disease, whereas deaths from causes other than the disease of interest are censored or uncounted in this measurement. RS is defined as the ratio of observed survival to the expected survival in a comparable cohort of the general population.8 The primary advantage using RS is that no COD information is required, thereby bypassing the COD inaccuracy issue and difficulties in outcome definition. However, the RS approach requires detailed life tables for comparable populations that are not always directly available for research applications. Moreover, methods for RS analysis are less well recognized by clinical researchers and are not readily used in common statistical packages. Therefore, CSS may be a more practical and preferred measure of cancer survival statistics for a data set in which COD information is available.
One way to ensure the reliability of COD is to reconfirm the COD through meticulous review of the primary medical records,9 which is impractical for a large data set like SEER. Previous studies have indicated that the RS approach can be used to obtain the expected number of malignancy-attributable deaths.10 Then, the number of malignancy-attributable deaths can be compared with the SEER-documented, observed number of cancer-specific deaths to acquire the concordance of these 2 estimates.11 In the best-case scenario, perfect concordance indicates equivalence of not only the number of cancer-related deaths but also net survival estimates.
The objective of the current study was to evaluate the utility of the COD in estimating CSS and the concordance between RS and CSS. We hypothesized that concordance between RS and CSS would vary according to cancer site, cancer stage, and patient characteristics. Because cancer mortality statistics and survival outcomes derived from SEER are widely applied in medical practice (eg to inform treatment decisions and prognosis), it is important to ensure the CSS has acceptable concordance with RS so that CSS can be used to readily to obtain net survival instead of RS, which requires a detailed life table for a matched general population.
MATERIALS AND METHODS
Data Source and Case Identification
The SEER data (April 2010 release) we used contains 13 population-based cancer registries that, together, collect data for all malignancies diagnosed.1 SEER routinely collects information on patient demographics, primary tumor site, tumor morphology, disease stage at diagnosis, first course of treatment (radiotherapy and surgery), vital status, and COD using the combined methods of passive and active follow-up.
Patients who were eligible for the current study included those with microscopically confirmed adenocarcinoma of the breast (International Classification of Diseases for Oncology, third edition [ICD-O-3] codes C50.0-C50.9 with histology codes 8050, 8140-8147, 8160-8162, 8180-8221, 8250-8507, 8514, 8520-8551, 8560, 8570-8574, 8576, and 8940-8941); adenocarcinoma of the colorectum (ICD-O-3 codes C18.0-C18.9 with histology codes 8140, 8210-8211, 8220-8221, 8260-8263, 8470, 8480-8481, and 8490); adenocarcinoma, bronchioloalveolar carcinoma, large cell carcinoma, squamous cell carcinoma, and other nonsmall cell carcinoma of the lung (ICD-O-3 codes C34.0-C34.9 with histology codes 8140, 8251, 8255, 8260, 8310, 8323, 8480, 8481, 8570, 8250, 8252, 8253, 8012, 8031, 8052, 8070-8074, 8010, 8020, 8022, 8032, 8033, 8046, 8050, 8490, 8550, and 8560); superficial, nodular, acral, and other malignant melanoma of the skin (ICD-O-3 codes C44.0-C63.2 with histology codes 8720-8721, 8723, 8730, 8740-8745, and 8770-8772); adenocarcinoma of the prostate (ICD-O-3 code C61.9 with histology codes 8010 and 8140-8570); or adenocarcinoma of the pancreas (ICD-O-3 codes C25.0-C25.9 with histology codes 8050, 8140-8147, 8160-8162, 8180-8221, 8250-8507, 8514, 8520-8551, 8560, 8570-8574, 8576, and 8940-8941). These cancer sites and histologies were chosen because they represented common malignancies with both low and high underlying cancer-specific mortality. Our study cohort included patients who were diagnosed from January 1988 through December 1999 to secure all individuals who had at least 8 years of follow-up through December 2007. The selected 8-year follow-up was based on our preliminary findings, as illustrated in Figure 1. We noted that the relative survival curve was deemed flat beyond 8 years after diagnosis. The plateau of the curve indicates that the excess mortality of malignancy is minimized. We can then estimate the total number of deaths attributed the malignancy for the entire 8-year period and compare this number with the number of cancer-specific deaths as documented in the SEER data set. The observed-to-expected ratio (O/E ratio) still can be calculated using a different defined time to plateau. For example, the O/E ratio for colorectal cancer was 0.979, 0.977, and 0.971 using the seventh, sixth, and fifth years, respectively, as the defined plateau for calculating the O/E ratio.
We used SEER tumor size, extent of disease, and number of regional lymph nodes to restage patients according to the sixth edition of the American Joint Committee on Cancer (AJCC) Cancer Staging Manual,12 except for breast cancer, for which the available elements only permitted fifth edition staging. Common exclusion criteria for SEER-based research were applied, including patients with unknown age and patients aged ≤18 years or >90 years, or if the cancer-reporting source was autopsy, nursing home, hospice, or death certificate.
The SEER*Stat program (version 6.22; SEER Program, National Cancer Institute, Bethesda, Md) was used to obtain both CSS and RS through December 2007. CSS was calculated using the SEER COD recode variable to define the set of individuals who had died of the cancer (SEER COD recode 26,000 for breast cancer; 21,040-21,050 for colorectal cancer; 22,030 for lung cancer; 25,010 for melanoma of the skin; 28,010 for prostate cancer; and 21,100 for pancreatic cancer); patients were censored if the death occurred from other causes or if the patient was alive at the time of last follow-up. The Ederer I method was used to calculate the RS as the ratio of the observed (overall) survival in the study cohort to the expected survival of the general US population matched on the basis of age, sex, race, and single calendar year.
By using the 1988 to 1999 cohort, the 8-year cumulative number of expected deaths attributed to specific malignancies was calculated by subtracting the 8-year cumulative number of expected deaths in the matched general population as estimated by the RS approach from the 8-year cumulative observed (overall) deaths as documented within SEER. The 8-year observed number of cancer-specific deaths as documented within the SEER COD recode variable was then divided by this number to yield the O/E ratio. For example, a ratio <1.0 indicates that SEER documented a lower than expected number of cancer-specific deaths, which would result in an overestimated CSS; and a ratio >1.0 indicates that SEER documented a higher than expected number of cancer-specific deaths, which would result in an underestimated CSS. The chance variation of the O/E ratio was determined based on the Z-score using the normal approximation to the binomial:
where denotes the observed cancer mortality rate (the proportion of patients dying from the specific cancer during the 8-year follow-up), p0 denotes the expected cancer mortality rate, and n denotes the total number of patients with the specific cancer. Because of the large sample size, the Z-score was reported throughout to provide more information for assessing significance. For example, compared with a Z-score of 4, a Z-score of 30 (P < .01 for both) may indicate a larger sample size, a larger difference in the O/E ratio, or a higher mortality rate for the specific cancer. Given the large sample size of the current study, the underlying assumption,13 namely, np0(1 − p0), was validated. We also evaluated the O/E ratio by categories of age, sex, race, and tumor stage; these variables were selected for their biologic importance and wide application in SEER-based research. Because of the extremely low mortality (5-year RS, >99%) among patients with stage I breast cancer and those with stage I through III prostate cancer, the impact of COD inaccuracy on CSS was trivial; thus, the O/E ratios were not determined. Finally, the commonly reported 5-year RS was compared with the 5-year CSS to assess the impact of the O/E ratio on these 2 net survival measures.
In total, 338,445 patients were eligible for the analysis, including 77,266 with breast cancer, 95,647 with colorectal cancer, 101,444 with lung cancer, 29,380 with melanoma, 18,417 with prostate cancer, and 16,291 with pancreatic cancer. Baseline patient and tumor characteristics are provided in Table 1. In brief, the majority of patients were aged ≥50 years at diagnosis, although, for patients with melanoma, the median age was 53 years. Race was most commonly white, followed by black and other races. Lung and pancreatic cancers were more likely diagnosed as advanced disease rather than early stage disease.
|Characteristic||Common Sites of Malignancy With Both Low and High Cancer Mortality: Percentage of Patients|
|Overall no. of patients||77,266||95,647||101,444||29,380||18,417||16,291|
|Age at diagnosis, y|
The 8-year cumulative observed number of cancer-specific deaths documented within SEER, the 8-year cumulative expected number of deaths attributed to the malignancy estimated using the RS approach, and the corresponding O/E ratios with Z-scores for the 6 cancer sites are provided in Table 2. Taking colorectal cancer as an example, the 8-year cumulative observed number of deaths that SEER documented was 60,052 (Table 2, column B), whereas the 8-year cumulative expected number of overall deaths estimated by US life tables was 19,298 (Table 2, column C). The resulting difference of 40,754 (Table 2, column D), theoretically, was attributable to colorectal cancer. According to the SEER COD recode variable, 39,973 deaths (Table 2, column A) were actually documented as colorectal deaths (SEER COD recode 21,040 and 21,050), yielding a favorable O/E ratio of 0.98 (Z-score, 5.09; P < .001). Similarly, the O/E ratios for breast cancer, lung cancer, melanoma, prostate cancer, and pancreatic cancer were 0.97, 0.90, 1.07, 1.02, and 0.92, respectively (Z-scores, >3.29 for all; P < .001).
|Cancer Site||No. of Patients||O/E Ratio (Column A ÷ Column D)||Z-Scorea|
|Column A||Column B||Column C||Column D|
|Eight-Year Cumulative Observed No. of Cancer-Specific Deaths Documented Within SEER||Eight-Year Cumulative Observed No. of SEER-Documented Overall Deaths||Eight-Year Cumulative Expected No. of Overall Deaths Estimated by US Life Tables||Eight-Year Cumulative Expected No. of Deaths Attributed to the Malignancy (Columns B − Column C)|
|Breast (stage II-IV)||23,216||33,133||9409||23,724||0.97||3.95|
|Prostate (stage IV)||10,092||19,198||9400||9798||1.02||4.29|
The O/E ratios according to patient and tumor characteristics are detailed in Table 3. Because of the large sample size, the confidence interval for O/E ratios was very tight (P < .05 for all; data not shown). Our analyses indicated that there was little variation in the age-stratified O/E ratios. We noted that elderly patients generally had higher O/E ratios than younger patients, indicating that deaths among older patients were more likely to be coded as cancer-specific in SEER; thus, their CSS was expected to be underestimated. In general, differences in the O/E ratio according to sex were small, with the exception for breast cancer, in which the estimated O/E ratio was 0.87 (Z-score, 2.33; P < .05) for men and 0.97 (Z-score, 3.76; P < .001) for women, likely reflecting a COD ascertainment error given the rarity of breast cancer among men. The effect of race on the O/E ratio was examined, but the variation by race was small. However, white race appeared to be associated with an overall favorable O/E ratio closer to 1.0 than black race for all but prostate cancer.
|O/E Ratio||Z-Scorea||O/E Ratio||Z-Score||O/E Ratio||Z-Score||O/E Ratio||Z-Score||O/E Ratio||Z-Score||O/E Ratio||Z-Score|
|AJCC 6th edition tumor stage|
|No. of primaries|
|First and only||0.97||3.95||0.98||5.09||0.90||70.1||1.07||5.12||1.02||4.29||0.92||39.1|
|Multiple (excluding first and only)||0.83||16.5||0.84||23.2||0.81||49.1||0.83||6.83||0.67||17.9||0.86||21.8|
For early stage cancers with a favorable prognosis at baseline, the O/E ratios were more likely to be >1.0, such as stage I colorectal cancer (O/E ratio, 1.33; Z-score, 10.9) and stage I melanoma (O/E ratio, 2.12; Z-score, 25.7). These findings indicate that the number of cancer-specific deaths documented in SEER was over-coded by 1.33 times and 2.12 times, respectively; thus, an estimated CSS lower than RS would be expected. In contrast, for cancers with a generally poor prognosis (lung and pancreatic cancers) or for advanced-stage cancers (eg, breast and colorectal cancers, melanoma), SEER tended to under-code the number of cancer-specific deaths (O/E ratio, <1.0); thus, an estimated CSS higher than RS would be expected. The O/E ratios also were examined for patients who had more than 1 cancer diagnosis. Not surprisingly, O/E ratios that varied the greatest from 1.0 were observed consistently over all 6 studied cancers. For example, patients who had other cancer diagnoses in addition to colorectal cancer had an O/E ratio of 0.84 (Z-score, 23.2; P < .001).
Finally, the 5-year CSS and 5-year RS rates are compared in Table 4. Because of the large sample size, the reported survival estimates were associated with very tight confidence intervals; thus, these data are not provided. Taking colorectal cancer as an example again, the results indicated that the 5-year CSS rate (58.1%) was slightly higher than the 5-year RS rate (57.6%). These results correspond to the finding that an O/E ratio <1.0 (0.98 for colorectal cancer) would result in an estimated CSS that is higher than RS. Despite the aforementioned unfavorable O/E ratios for stage I colorectal cancer and melanoma, the differences between 5-year CSS and 5-year RS were small (1.9% and 1.5%, respectively). In contrast, for stage IA lung cancer, the difference (5.4%) was relatively large. For cancers with high mortality like pancreatic cancer, RS and CSS were concordant, and the absolute difference was approximately 1%.
|AJCC 6th edition tumor stage|
|No. of primaries|
|First and only||74.2||74.7||57.6||58.1||13.3||16.1||88.4||88.0||49.3||48.2||2.8||3.7|
|Multiple (excluding first and only)||78.4||81.5||68.8||72.2||32.3||39.2||90.0||91.3||59.7||69.4||7.4||11.6|
In the absence of a meticulous review of primary medical records, the COD as assigned by cancer registries has long been questioned for its utility in measuring a valid CSS. In the current study of 6 common malignancies characterized by a broad range of baseline cancer-associated mortality, we quantified the COD utility by using the O/E ratio overall and according to categories of patient and tumor factors, and we assessed the O/E ratio in relation to agreement between CSS and RS. Our study provides investigators a better understanding of the direction and extent of bias in CSS estimation using the SEER-provided COD. For example, the calculated CSS for pancreatic cancer using SEER is expected be slightly overestimated compared with RS, although, in general, the difference is small.
In this study, we based our analysis on the methodology from a prior work by Weinstock and Reynes. They reported that, for patients diagnosed with melanoma, the CODs generally are accurately certified, with 4237 expected melanoma deaths based on the RS approach and 3946 documented deaths according to COD coding, representing 93% concordance.10 This methodology is less recognized but is efficient in quantitatively accessing the utility of COD, which may be applicable not only to the SEER database but also to other large cancer and noncancer databases in which a meticulous review of medical records is not possible.
In our analysis, we were able to further elucidate factors that influenced the utility of the COD coding in estimating CSS. We noted that, although the O/E ratio for early stage colorectal cancer was apparently poor (1.33), the resulting impact on the difference between CSS and RS was trivial (95.0% vs 93.1%, respectively) because of the low underlying mortality in early stage disease. Such observation remained true for stage I melanoma, in which the O/E ratio was poor (2.12), but the agreement between RS and CSS was acceptable (98.1% vs 96.6%, respectively). Our findings suggest that CSS is relatively free from the O/E difference for cancers with a favorable prognosis. In addition, we noted that the O/E ratio was considerably discordant among patients with lung cancer. For example, the O/E ratio for stage IA lung cancer was 0.79, and the resulting difference between RS and CSS was relatively considerable (5.4%). This finding is primary because the cohort of patients at risk for lung cancer have additional tobacco exposure-related, noncancer comorbidities, such as cardiopulmonary disease; therefore, the general population may not represent an appropriate reference population for determining expected survival.14 Consequently, CSS may be a more accurate measure of net survival than that estimated using the RS approach, which fails to account for such cancer-associated comorbidities.
We noted that the O/E ratio was poor for patients who had multiple cancer diagnoses compared with patients who had only 1 cancer diagnosis, emphasizing the need to exclude patients who have multiple primaries from survival outcome research using SEER. This difference was observed consistently for all 6 cancer sites and may be true for other cancer sites that were not included in the current investigation. The finding may be related to difficulties in determining which cancer was directly attributed to the death.
The utility of the COD was subject to variations based on the patient age at diagnosis. Specifically, we observed a trend toward a slight overestimation of the number of documented cancer-specific deaths (O/E ratio >1.0) for elderly patients compared with younger patients. This is particularly noteworthy: The COD in elderly patients is often subject to speculation because of the competing effects of comorbidity-associated mortality (eg, cardiac events or other noncancer deaths). In sum, our findings suggest that the O/E ratio is associated with patient age, although that impact generally was small.
An alternative method for assessing the utility of the COD examines the survival rates at a point in time during which surviving patients can be considered to have been cured of their malignancies. At such a time point, the CSS and RS in surviving patients approach those of the general population. This point in time can be assessed graphically as the time when the RS or conditional survival reaches a plateau.10, 15
The use of RS as the reference for comparison assumes that RS is an unbiased outcome measure. However, an important limitation of the RS approach is the potential for noncomparability of the expected survival between the cancer groups and the matched general population.14 For example, as mentioned above, for malignancies associated with certain risk factors, such as smoking and lung cancer, which are highly correlated with overall health, the life table for the general population may not provide an accurate estimate of the expected survival for the lung cancer cohort. In situation like this, an assessment of net survival by CSS should be advocated unless a plausible adjustment for the effect of smoking on the life table could be performed.
Currently, 3 cumulative expected methods are used to calculate RS, including the Ederer I, Ederer II, and Hakulinen methods.16 The default Ederer I method was used in the current study, because this method has long been supported (since the late 1990s) in the initial release of the software package SEER*Stat and has been used by the Data Analysis and Interpretation Branch of the NCI in their Cancer Statistics Review (CSR). The Hakulinen method was supported later (in the early 2000 release of SEER*Stat). Both methods assume that matched individuals are considered to be at risk for the entire duration of follow-up. Hakulinen also adjusts for potential follow-up times; however, in general, RS estimates from the 2 methods are very similar.16 In early 2011, the Ederer II method was added as the default method for relative survival estimation in SEER*Stat. This method has been revived because matched individuals are considered to be at risk only until the patient is censored or dies, thereby mitigating the potential that RS tends to increase in the long term when using the Ederer I and Hakulinen methods.16 Preliminary analyses using the same cohort illustrated in Figure 1 revealed that 5-year RS calculated using the Ederer II method (63.4%) was slightly lower than that calculated using other 2 methods (63.6%); however, in general, the results were very similar for the same cohort.
In conclusion, commonly used net survival measurements are subject to various strengths and limitation. None of them is perfect, but we observed that, at least for patients in whom the cancer diagnosis was their only malignancy, CSS, as estimated based on the COD processed by cancer registries, generally was concordant with RS. The NCI SEER*Stat program provides a convenient and intuitive mechanism to generate RS; however, the lack in that program of a capability to be used in regression modeling (eg Cox regression) and related model diagnostics is a significant limitation. Our results justify the use of CSS in survival outcomes research, and the analyses can be readily performed by commonly used statistical programs.
This work was supported by National Institutes of Health/National Cancer Institute grants K07-CA133187 (G.J.C.) and CA016672 (The University of Texas MD Anderson Cancer Center's support grant).
CONFLICT OF INTEREST DISCLOSURES
The authors made no disclosures.
- 1Surveillance, Epidemiology, and End Results (SEER) Program. SEER Program research data (1973-2007). Released April 2010, based on the November 2009 submission. Bethesda, MD: National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch; 2010.
- 7The relative survival rate: a statistical methodology. Natl Cancer Inst Monogr. 1961; 6: 101-121., , .
- 8A stochastic study of the life table and its applications. II. Sample variance of the observed expectation of life and other biometric functions. Hum Biol. 1960; 32: 221-238..
- 9Annotated bibliography of cause-of-death validation studies. Vital Health Stat 2. 1982; 89: 1-42., .
- 12Greene FL, Page DL, Fleming ID, et al. eds. AJCC Cancer Staging Manual, 6th ed. New York: Springer-Verlag; 2002.
- 13Fundamentals of Biostatistics, 6th ed. Pacific Grove, CA: Duxbury Press; 2005..
- 16Estimating relative survival for cancer patients from the SEER Program using expected rates based on Ederer I versus Ederer II method. Technical Report 2011-01. Bethesda, MD: Surveillance Research Program, National Cancer Institute; 2011. Available at: http://surveillance.cancer.gov/reports/. [Accessed December 1, 2012.], , , .