The current study was undertaken to evaluate the spatiotemporal projection models applied by the American Cancer Society to predict the number of new cancer cases.
The current study was undertaken to evaluate the spatiotemporal projection models applied by the American Cancer Society to predict the number of new cancer cases.
Adaptations of a model that has been used since 2007 were evaluated. Modeling is conducted in 3 steps. In step I, ecologic predictors of spatiotemporal variation are used to estimate age-specific incidence counts for every county in the country, providing an estimate even in those areas that are missing data for specific years. Step II adjusts the step I estimates for reporting delays. In step III, the delay-adjusted predictions are projected 4 years ahead to the current calendar year. Adaptations of the original model include updating covariates and evaluating alternative projection methods. Residual analysis and evaluation of 5 temporal projection methods were conducted.
The differences between the spatiotemporal model-estimated case counts and the observed case counts for 2007 were < 1%. After delays in reporting of cases were considered, the difference was 2.5% for women and 3.3% for men. Residual analysis indicated no significant pattern that suggested the need for additional covariates. The vector autoregressive model was identified as the best temporal projection method.
The current spatiotemporal prediction model is adequate to provide reasonable estimates of case counts. To project the estimated case counts ahead 4 years, the vector autoregressive model is recommended to be the best temporal projection method for producing estimates closest to the observed case counts. Cancer 2012;. © 2012 American Cancer Society.
The number of cancer cases diagnosed in the current calendar year in the United States overall and in each state is not known because the most recent year for which incidence data are available lags 4 years behind because of the time required for data collection, compilation, and dissemination.1 Furthermore, high-quality incidence data have not yet been achieved in all states and the total number of cases for the most recent 1 to 3 data years are incomplete because of delays in reporting.2 For more than half a century,3, 4 the American Cancer Society has published the estimated number of new cancer cases in the current year in the US overall and in each state to provide broad perspectives on the contemporary cancer burden. These estimates are widely cited in the scientific literature.
The methods used to project the estimated cancer cases ahead of time have evolved over the years as population-based cancer registries have expanded, from a single registry (Connecticut) in the 1950s to nearly national coverage in the 2000s, and as new statistical techniques developed. Before 1995, spatial prediction using incidence-to-mortality ratios from the National Cancer Institute's Surveillance, Epidemiology, and End Results (SEER) Program5 was used to project counts to the rest of the nation, which were then projected ahead using a simple linear projection model. From 1995 to 2006, a time series quadratic autoregressive model was used.6 Beginning in 2007, a spatiotemporal regression model with ecologic covariates was used to provide case counts for every state that were then projected ahead 4 years using the joinpoint methodology (segmented linear projection).7 In this article, we first examined the goodness of fit of the spatiotemporal prediction model. We then compared the accuracy of 5 temporal methods for projecting the number of new cancer cases to the current year nationally and at the state or registry level. The remainder of the current study is organized as follows. Cancer incidence data and covariates used in the spatiotemporal prediction model and in the validation of the temporal projection methods are introduced and the overall modeling process of the current method is described and 5 temporal projection methods are evaluated for future use. The “Results” section presents the findings and the “Discussion” section provides insights into the performance of each method.
The process used to estimate the numbers of new cancer cases expected in the current calendar year is comprised of 3 steps.
A hierarchical Poisson mixed effects model7 is applied to observed data from high-quality cancer registries, as certified by the North American Association of Central Cancer Registries (NAACCR),8 to provide estimates of annual case counts over the available time period for every US county. This step provides an estimated case count even for those states with missing data for a particular year and smoothes the observed case counts over time through the modeling process. This step can fill in any “holes” in a state's time series before the state became a certified high-quality registry, or fill in “holes” in the map for a year in which some states did not report their data.
The predicted case counts from step I are summed to the state level and then inflated to account for expected delays in case reporting.9 These delay-adjustment factors range from negligible for some cancers to 15% for leukemia in the most recent reporting year (ie, the case count reported at first is expected to be 15% higher when all cases have been identified).
The delay-adjusted predicted case counts from step II are projected ahead 4 years to the upcoming calendar year. For this validation test, we projected ahead to 2008, the latest year for which observed data are available.
Because of the complexity of this process, we validated the spatiotemporal prediction and temporal projection steps separately. A residual analysis was performed on the results of step I to determine whether additional covariates or interaction terms were needed. The temporal projection was validated by projecting delay-adjusted observed case counts 4 years ahead, comparing alternative methods by several fit statistics. The temporal projection methods we evaluated are the Nordpred (NP) method, the joinpoint method, the state-space (SS) model, the Bayesian state-space (BSS) method, and the vector autoregressive (VAR) model. The remainder of this section will provide a brief description of the data and methods used for spatiotemporal predictions and for temporal projections. More detailed information can be found in the technical report.10
In the spatiotemporal prediction validation, we used an updated version of the Cancer in North America (CINA) Deluxe incidence data from NAACCR11 that were used in the study by Pickle et al.7 In that article, data were available from 1995 through 2003 and included 40 states, the District of Columbia, and the Detroit metropolitan area, covering 86% of the US population. The updated data set contains data from 1995 through 2007, and includes 46 states and the District of Columbia, covering 95% of the US population.
The covariates for the spatiotemporal model were constructed from various sources. The only information available on the individual cases was their age, gender, race, county of residence, cancer site, and year of diagnosis. Approximately 30 other ecologic covariates10 were available at the county level, including measures of income, education, housing, racial distribution, foreign birth, language isolation, urban/rural status, land area, and Census division (region)12; availability of physicians and hospitals13; health insurance coverage and rates of cigarette smoking, obesity, vigorous activity, and cancer screening14; and rates of mortality due to the same type of cancer.15 Of these, measures of the rates of foreign-born individuals, language isolation, land area, obesity, and vigorous activity have to our knowledge not been included in the previous analyses. Lifestyle and medical facility covariates were updated with more recent values. Approximately 50% of the initial covariates were selected through a principal component analysis to avoid collinearity.
Because we are interested in the validity of the predicted case counts, we calculated error rates as the difference between the predicted number of cases and the reported observed number of cases (delay-adjusted for 2007) divided by the population size and stratified by state, sex, and cancer site. Population-weighted linear regressions were run on the error rates for all cancers combined as well as cancers of the prostate, breast, cervix, lung, and colon and leukemia, for males and females separately. Only fixed main effects were included. A separate population-weighted log-linear regression was run on county-level relative residuals, defined as the difference between the model-predicted and reported number of cases divided by the reported number plus a small constant, to avoid a division of 0. Only main effects and their interactions with race were included.
Not enough incidence data are yet available from every registry to test a projection 4 years ahead for the entire country. US cancer registries reporting data to NAACCR participate in SEER, the Centers for Disease Control and Prevention's (CDC) National Program of Cancer Registries16 (NPCR), or both. The CINA Deluxe incidence data set includes registry data beginning in 1995. The latest CINA Deluxe data set now includes incidence data for 1995 through 2008, a 14-year span. Because one of the methods to test (NP) requires time to be specified in 5-year blocks, we extended the required observed data time span to 15 years. Thus, 19 years of observed data are required for this project: 15 years (1990-2004) for model input plus 4 years for projection ahead (to 2008). Only the older SEER registries (Atlanta, Connecticut, Detroit, Hawaii, Iowa, New Mexico, San Francisco-Oakland, Seattle-Puget Sound, and Utah) can provide this much data. In addition, the remainder of California and the state of New Jersey had available data and gave permission for their use in this project. We used the aggregate of the SEER 9 registries plus these 2 additional areas as a proxy for the United States. All case counts were stratified by gender, race (white, black, or other), age (birth-4 years, 5-9 years,…75-84 years, and ≥ 85 years), diagnosis year, and county of residence. Only malignant cancers of the breast, colon, rectum, esophagus, lung, prostate, and testis; melanoma; and non-Hodgkin lymphoma were included. This selection of cancer sites includes both very common and very rare cancers and is the same set of sites used to develop the original model.7
We evaluated 5 temporal methods including the joinpoint methodology previously used in the study by Pickle et al.7
Joinpoint regression has been applied to cancer trend analysis to summarize cancer trend changes using segmented line regression.17 Cancer incidence rates are modeled as a function of time, which is comprised of piecewise linear segments. The model is fitted by the least-squares method for a given number of change points called joinpoints, and then the number of joinpoints is estimated. The joinpoint software18 provides the point and interval estimates of the joinpoints and the slope parameters and, equivalently, the annual percentage change rates. To select the number of joinpoints, the first version of the software applied the permutation test procedure, in which the permutation distribution of the test statistic is used to estimate the P value to determine whether data demonstrate enough evidence to add more joinpoints. Since version 1.0a was released in 1998, the joinpoint software has been updated to improve its accuracy and efficiency and to include additional features. One of the updates is the addition of 2 model selection procedures: the Bayesian information criterion (BIC) and modified Bayesian information criterion (MBIC). The BIC was added as a faster alternative to the permutation procedure, which is computationally expensive, and the MBIC was proposed to improve the performance of the BIC. In this study, we used the 3.5.1 version of the joinpoint software and implemented both the permutation procedure and the MBIC. The joinpoint program allows users to change model specifications. In our comparison, we considered several different choices of these specification values and used the notations “JP-perm-xyz” or “JP-MBIC-xyz” to indicate a joinpoint model with a permutation test or MBIC, a maximum of x joinpoints, a minimum of y observations from a joinpoint to either end of the data, and a minimum of z observations between 2 joinpoints.
The NP model used in this analysis is essentially model B1 from the study by Moller et al.19 Model B1 is an age period cohort model,20 with incidence counts assumed to follow a Poisson distribution with linear effects of age, period, and cohort linked to the expected rate through a fifth root function. The period effect is further modeled as a linear regression on period with the regression coefficient being the drift parameter. In this analysis, predictions involved only 25% of the drift regression parameter for the first future period estimated rates and no drift for the second future period estimated rates. Age and period effects are for 5-year groupings. Five-year predictions are translated to single-year estimates using linear interpolation and to projections 4 years ahead.
The SS method has been used to project the cancer mortality counts 3 years ahead to the current calendar year since 2004.21 In the SS method, we used a local quadratic model to obtain projections of the time series of incidence counts 4 years into the future. The method proceeds as follows. First, to account for the uncertainty in measuring the observed incidence counts, the count at any point of time is assumed to be a realization of a normal distribution. This is the so-called measurement equation. Next, we modeled the average trajectory (ie, the trend in incidence counts obtained by joining the averages of these normal distributions) using a locally various quadratic trend. Such a trend is obtained by representing the year-to-year variation of trend in the form of a so-called SS model, whereby one relates the parameters (or state) of the model at a particular point to those of the previous point through a set of transition equations. These transitions are assumed to have a stochastic component and as before, errors in these transitions are assumed to have normal distributions. An iterative procedure called the Kalman filter can then be used to estimate the trend and project it into the future. The whole model fitting and projection is done using the R statistical software.22
As an alternative method of projection of time series, we explored the use of dynamic generalized linear models fitted in the Bayesian paradigm. We will refer to this overall method as the BSS method. The method proceeds as follows. Because the observed values are annual incidence counts, we assume that for any year, the observed count is a realization of a random variable, taken for our purposes to have the Poisson distribution. The parameters of the Poisson distributions are assumed to be unknown and vary smoothly from year to year according to a random process. In particular, we assume that the logarithm of the parameter (henceforth called the state) in any year is a normal perturbation around the state of the previous year, with the amount of variation in this perturbation constant from year to year. All this is used to put together the likelihood function of the data, which contains all the information on the unknown model parameters from the data. Combined with prior information concerning the unknown parameters in the form of distribution of the initial state (at time t = 0) and the variance of the year-to-year transition of the states, the likelihood is used to generate the posterior distribution of the parameters. Although the entire posterior distribution is unavailable in closed form, the univariate conditionals are easy to sample from. The generated predictions are estimates of the posterior mean of the unknown quantities of interest.
Spectral analysis is a complementary tool for analyzing time series. Fourier transforms, wavelet transforms, and some other spectral analysis methods have been developed to analyze stationary time series data. These techniques are widely used by many scientists in the fields of electrical engineering and physics, among many others. Recently, the Hilbert-Huang transform (HHT) based on empirical mode decomposition (EMD) was developed for nonlinear and nonstationary processes.23 The advantage of HHT-EMD is that it does not require a set of prespecified functions. Instead, it uses a set of adaptive intrinsic mode functions (IMFs) derived from the time series data itself. To project the incidence count, we applied the EMD to decompose the data to several IMFs and then applied the multivariate time series technique on these IMFs for projections. Among many multivariate time series models, the VAR is a standard instrument for forecasting. VAR is a multivariate version of autoregressive model. Although univariate models may be useful for describing short-term correlation, the multivariate models may provide a better description of the underlying structure of the time series and a better forecast. The R software package is used for obtaining the forecast counts.22
To measure the accuracy of the prediction counts, the prediction error was defined as the difference between the predicted incidence counts and the delay-adjusted observed incidence counts in 2008. Relative deviation was defined as the ratio of the prediction error to the observed counts plus 0.5. Adding 0.5 to the observed counts was done to avoid a division of 0. To evaluate the temporal projection methods, 6 statistics were computed to compare the different methods. The Average Absolute Relative Deviation (AARD) is the average of the relative deviations across all cancer sites and/or geographic areas in the data. AARD is interpreted as the average percentage deviation from the true value. This measure attempts to take into account the relative differences in observed incidence counts as we attempted to assess the extent to which the estimates deviate from the observed. Smaller values in AARD indicate closer estimates to the true values across different cancer sites and geographic areas.
Other measures of discrepancy are also obtained by comparing the projection methods. The maximum absolute relative deviation is a measure of the maximum deviation from the observed values that might exist. The mean relative sums of squares deviation is similar to AARD, except that only deviations are squared resulting in higher weights being applied to larger deviations in the average. The root mean square error (RMSE) is an estimate of variability of estimates about the true value. The normalized RMSE is RMSE expressed as a fraction of the mean. The average rank of the relative sums of squares is the average rank of deviations among the methods. These measures produce results similar to AARD; therefore, in the current study, we used AARD as the default measure for comparison.
Initial comparisons of the totals of the previously published predicted state case counts24 for 2007 with the observed data (released in 2010) demonstrate close agreement (Table 1). However, the predicted counts were adjusted for an expected delay in reporting whereas the 2007 observed counts were not. For comparability, the observed counts were then adjusted for expected reporting delay by multiplying the sex- and cancer site-specific delay adjustment factor.2, 9 For most cancer sites, this factor is constant across race and age. For the few sites for which this factor is age-dependent, the age ≥ 65 years delay factor was used (consistent with the median age of diagnosis for most cancers) because age is not provided on the state data file. As shown in Table 2, the published case counts underestimated the reported (and delay-adjusted) counts for all cancer sites combined in 2007 by 2.5% for women and 3.3% for men. A similar comparison stratified by cancer site indicated that breast and prostate cancer case counts were greatly underestimated (by 27,042 and 12,874 cases, respectively) and colon cancer was slightly overestimated (by 6313 cases), whereas the other 45 cancer sites demonstrated close agreement between observed and predicted counts (Fig. 1). Proportionally, several of the rare cancer sites were predicted less accurately, not surprising due to the greater variability of small numbers. For the more common sites, the numbers of cancers of the breast, cervix, and liver were found to be underestimated by > 10% (Fig. 2). Examination of differences by state demonstrates that case counts are more likely to be underestimated in southern states than elsewhere.
|Males and Femalesa||Males||Females|
|Model projected no.||1,376,801||714,935||661,865|
|Males and Femalesa||Males||Females|
|Model projected no.||1,376,801||714,935||661,865|
|Observed no., delay-adjusted||1,416,451||738,588||677,863|
Between 27% and 59% of the total variation is explained by the regression models of 2007 state error rates. The significance of each covariate varies by cancer site, but measures of language isolation or foreign birth, poverty, and cancer screening have consistently significant effects on the error rates. Influence diagnostics (Cook distance) have shown that Hawaii is very influential on the results for prostate and all cancers among males (a very high percentage of Asian Pacific Islanders), and the District of Columbia is very influential with regard to the results for male lung cancer (very high percentage of black individuals, percentage urban residence, and densities of medical physicians). There were no clear spatial patterns in the distribution of the error rates.
As a more specific validation of the spatiotemporal prediction model, we repeated the regression analysis of relative residuals for 2003 data, updated in the 2007 CINA file, stratified by county, gender, race, and cancer site. The use of 2003 data removed the need to project ahead to 2007 and to adjust for delay, because case ascertainment was nearly complete by that time. The larger errors are in county/race/gender strata with < 10,000 people. Despite the inclusion of several new covariates and interactions with race, the percentage variance explained is small: 7% for males, all cancers; 6% for females, all cancers; 6% for prostate cancer; and 5% for breast cancer.
Because there was little consistency with regard to the significance of the covariates across the types of cancer at either the state or county level, the more detailed models of county residuals cannot explain > 7% of their total variation, and there was minimal spatial trend in the residuals. Therefore, we conclude that the spatiotemporal model with the original set of covariates still appears to provide reasonable estimates of the state-level case counts across the time span of the observed data.
One question regarding the prediction process is whether the delay adjustment factors calculated from SEER 9 data are applicable for all US cancer registries. In Table 3, we note that the number of breast cancer cases reported for 2003 on the 2003 file (released in 2006) was increased by 3.8% in SEER registries and 5.1% in NPCR registries after 4 more years of data collection (2007 data released in 2010).
|Female Breast Cancer, 2003||No. of Cases in 2007 File||No. of Cases in 2003 File||Ratio of 2007 to 2003 File|
We summarize the comparison of the temporal projection methods using AARD at the US level (Table 4) and the state/registry level (Table 5). As discussed earlier, for the purposes of this analysis, we considered the SEER 9 registries plus the rest of California and the state of New Jersey as a proxy for “the nation,” and the 9 registries plus the rest of California and New Jersey as a proxy for 10 “states.” (San Francisco-Oakland registry and the remainder of California were combined as 1 state). At the US level (Table 4), across the 15 cancer sites included in the current study, all the temporal projection methods produced estimates whose difference from the observed incidence counts for year 2008 were < 9%. The BSS method produced estimates that were slightly closer to the observed counts than that of the VAR model and others. We then grouped the cancer sites according to the delay-adjusted observed incidence counts in 2008 so that group “< 1000” indicates a rare cancer and group “> 20,000” indicates a very common cancer. In general, the temporal projection methods perform better for common cancers than for rare cancers. For rare cancers, BSS is the best method, but for the most common cancers, VAR produces estimates that are closest to the observed incidence counts. In Table 5, AARD was calculated by summing the relative differences stratified by registry as well as the delay-adjusted observed incidence counts in 2008. Stratified this way, all the temporal projection methods produced estimates whose differences from the observed incidence counts for year 2008 were below 12%. The VAR method produced an estimate that was the closest to the observed incidence count. The grouping was done according to the registry and cancer site combination, so that group “<400” includes rare cancers in a registry with a small population and group “401-1000” includes moderately rare cancers in a small registry or rare cancers in a medium-sized registry, and so on. Table 5 shows that the VAR model is the best temporal projection method in that it outperforms other methods when all cancer sites and registries are included in the comparison, and also for 3 of the 5 groups. Table 6 summarizes the number of times each method was the best temporal projection model. VAR outperformed all other methods because it won 5 of the 10 groups across the national and the state/registry-level comparisons.
|Level||JP-Perm- 234||JP-Perm- 233||JP-Perm- 244||JP-MBIC- 233||JP-MBIC- 234||JP-MBIC- 244||NP||SS||BSS||VAR|
|Groupb||JP-Perm- 234c||JP-Perm- 233||JP-Perm- 244||JP-MBIC- 233||JP-MBIC- 234||JP-MBIC- 244||NP||SS||BSS||VAR|
|All cancer sites||0.054||0.053||0.048||0.052||0.052||0.052||0.084||0.065||0.040||0.044|
|Group||JP-Perm- 234||JP-Perm- 233||JP-Perm- 244||JP-MBIC- 233||JP-MBIC- 234||JP-MBIC- 244||NP||SS||BSS||VAR|
|All registries, all cancer sites||0.097||0.099||0.098||0.099||0.099||0.099||0.111||0.116||0.102||0.094|
Five years have passed since the spatiotemporal method was developed. Many factors that may impact cancer incidence have changed, including, but not limited to, changes in population, screening, diagnostic technology, behavior, and other risk factors. Reevaluation of the projection methods turns out to be timely.
We first validated the spatiotemporal prediction model that fills in “holes” in observed incidence case counts. Even with an updated and expanded list of covariates in the model, no substantial improvement was found and therefore we concluded the model provides reasonable estimates of state- and national-level case counts across the time span of the observed data.
One concern about the second step in the 3-step prediction process is the application of delay-adjustment factors that were derived only from SEER 9 data to all registry data. That is, the data collection systems for SEER registries and NPCR registries are different and therefore their reporting delay patterns may not be the same. Our initial comparison of changes in the number of cases after 4 additional years of data collection suggests that the delays in reporting are not the same in SEER and NPCR registries. A study is currently underway to derive more accurate delay-adjustment factors for all US registries.
Next, we performed a more thorough search for a better temporal projection method with which to project 4 years ahead for cancer incidence counts, the third step of the 3-step prediction process. Five projection methods were evaluated and compared. At the US level, on average, the BSS method produced projections that were closer to the observed counts than other projections. However, for the most common cancer sites, the VAR model outperformed the BSS method. At the state level, the VAR model produced projections that were closest to the observed counts. Although these 2 methods could be recommended (the BSS method at the US level and the VAR model at the state level), the decision was made to use a single model, the VAR model, at both the US and the state levels starting with the 2012 version of Cancer Facts & Figures and Cancer Statistics, 2012.
The VAR model is capable of capturing subtle changes in incidence trends resulting from changes in population, screening guidelines, diagnostic technology, etc. BSS performs very well in many cancer sites at both the US level and the state level. However, the BSS method presents more of a computational burden, especially when the algorithm needs to be run on a large number of cancer sites and registries. Although the method previously used (JP-perm) performs relatively well, especially for moderately rare and moderately common cancer sites, the VAR method appears to outperform it enough to justify changing the methods. NP has been used in the Nordic countries and the United Kingdom to predict cancer incidence and mortality.19, 25 The model does not perform as well for the US incidence counts, possibly due to the shorter time series of the US incidence data. The SS model is less sensitive to the fluctuations in cancer incidence counts, but may perform just as well in less “noisy” series.
Because each method has its strengths and limitations, it is difficult to find 1 method that is superior for every cancer site and every registry. It is interesting to note that for mortality, joinpoint regression was selected as the best method because it provided the most accurate projections at the national level and performed reasonably well at the state level. Similarly, VAR was selected for incidence projection because it was the best method overall at the state level and for the most common cancer sites at the national level. In general, incidence is a more volatile measure than mortality, and is sometimes impacted by the rapid introduction of screening such as prostate-specific antigen,26 changing risk factors (eg, withdraw of hormone replacement therapy),27 or new medical technologies.28 Changes in mortality are generally attenuated by the finding that the deaths in a particular year are a blend of cases diagnosed across many years. The VAR method is more adaptable to rapid changes in the number of new cases.
Therefore, it is important to revisit the projection methods periodically to account for changes in population, screening and diagnostics, and risk factors, as well as the development of new statistical methods to provide the most accurate estimated cancer cases in the current year. These estimates are widely cited in the scientific literature and are used to allocate scarce resources at the state and local jurisdictions in their cancer prevention and control efforts. Actual incidence data from cancer registries for the most recent year lag 4 years behind due to the time required for data collection, compilation, and dissemination. Furthermore, these data are affected by delays in reporting and subnational coverage of high-quality registries. Overall, the estimates provided by the projection methods fulfill the need for contemporary and most accurate estimates.
No specific funding was disclosed.
CONFLICT OF INTEREST DISCLOSURES
Dr. Pickle's work is supported by National Institutes of Health (NIH) contract HHSN261201100094P. Dr. Ghosh's work is supported by NIH contract HHSN261201000671P. Dr. Kim's work is partially supported by NIH contract HHSN261201000509P.