All‐cause mortality versus cancer‐specific mortality as outcome in cancer screening trials: A review and modeling study

Abstract Background All‐cause mortality has been suggested as an end‐point in cancer screening trials in order to avoid biases in attributing the cause of death. The aim of this study was to investigate which sample size and follow‐up is needed to find a significant reduction in all‐cause mortality. Methods A literature review was conducted to identify previous studies that modeled the effect of screening on all‐cause mortality. Microsimulation modeling was used to simulate breast cancer, lung cancer, and colorectal cancer screening trials. Model outputs were: cancer‐specific deaths, all‐cause deaths, and life‐years gained per year of follow‐up. Results There were large differences between the evaluated cancers. For lung cancer, when 40 000 high‐risk people are randomized to each arm, a significant reduction in all‐cause mortality could be expected between 11 and 13 years of follow‐up. For breast cancer, a significant reduction could be found between 16 and 26 years of follow‐up for a sample size of over 300 000 women in each arm. For colorectal cancer, 600 000 persons in each arm were required to be followed for 15‐20 years. Our systematic literature review identified seven papers, which showed highly similar results to our estimates. Conclusion Cancer screening trials are able to demonstrate a significant reduction in all‐cause mortality due to screening, but require very large sample sizes. Depending on the cancer, 40 000‐600 000 participants per arm are needed to demonstrate a significant reduction. The reduction in all‐cause mortality can only be detected between specific years of follow‐up, more limited than the timeframe to detect a reduction in cancer‐specific mortality.

be classified under other cause deaths, instead of deaths related to the cancer. Because more people in the screen arm can experience this cause of death, this bias is in favor of screening. 1 The second is sticky diagnosis bias: because the target cancer will be diagnosed more frequently in the screened group than in the control group, deaths may be more likely to be attributed to the target cancer in the screened group. Therefore, the cancer-specific mortality will be biased against screening. 1 Third, a decrease in cancer specific mortality should not be counter parted by an increase in deaths from other causes (corrected for follow-up).
All-cause mortality is not affected by these biases. However, the major drawback is that since only a few percent of individuals in a screening trial will die from the cancer for which is being screened, the power of a screening trial to detect a difference in all-cause mortality is very low. Even the most common cancers account for only 3%-4% of all deaths. Thus, a 20% cancer-specific mortality reduction would translate to at most a 0.8% reduction in all-cause mortality. Therefore, to detect a significant reduction in all-cause mortality the trial would require a large sample size, estimated to up to 2.6 million participants. [5][6][7][8][9] Nevertheless, there are many reviews and commentaries published to criticize screening trials for the lack of a reduction in all-cause mortality, for example. 3,[10][11][12] To date, the only cancer screening trial targeting a single cancer type, which showed a significantly reduced allcause mortality is the US National Lung Cancer Screening Trial. 13 In this trial 26 722 participants were randomized to low-dose computed tomography (CT) screening and 26 732 participants to chest radiography screening. After 6.5 years of follow-up the lung cancer mortality rate ratio (RR) was 0.80 (95% confidence interval (CI): 0.73-0.93) for the CT arm, compared to the radiography arm, and the all-cause mortality rate ratio was 0.93 (95%CI: 0.86-0.99). 13 No other cancer screening trials have shown a significant difference in all-cause mortality. Even some large trials (more than 100 000 participants) such as the Two-county (breast cancer), 14,15 ERSPC (prostate cancer), 16 PLCO, UKFSST, and Nottingham (colorectal cancer), [17][18][19] that did show a reduction in cancer-specific mortality failed to show a statistically significant effect in all-cause mortality. A meta-analysis of the Swedish breast cancer trials (247 010 participants) showed a nonsignificant effect on all-cause mortality (RR 0.98; 95%CI: 0.96-1.00). 14 For colorectal cancer, a meta-analysis of four flexible sigmoidoscopy trials, including 458 000 participants, found a statistically significant effect on all-cause mortality (RR 0.975; 95%CI: 0.959-0.992). 20 Recently, the Prostate Lung Colorectal Ovarian (PLCO) trial including 154 887 participants screened for three cancers showed a reduction in allcause mortality (RR 0.966; 95%CI: 0.943-0.989). 21 Aside from the cause of death, the timing of evaluating the effects of screening is also important. 22 In the first years after the start of a screening trial, no substantial difference in cancer-specific or all-cause mortality can be expected. However, after a long follow-up, when almost all participants have died, no difference in all-cause mortality can be expected, while a reduction in cancer-specific mortality could still be detected.
The aim of this study was to assess in three simulated screening trials (lung, breast, and colorectal cancer): (a) the current available evidence on the possible effect of screening on all-cause mortality; and (b) the sample size and follow-up period to find an all-cause mortality reduction due to cancer screening. The results of this study can be used to inform the debate on all-cause mortality as an endpoint of screening trials.

| Systematic review
We performed a systematic review to find previous modeling studies that have evaluated the effect of screening programs on all-cause mortality through Scopus and Web of Science databases. The query consisted of four linked baskets of keywords. The first basket was the cancer sites: breast, lung, colorectal (colon and rectal were also used separately). The second focused on synonyms for screening (including early diagnosis, early detection, and cancer prevention). The third basket focused on combinations of phrases describing outcomes, including all-cause mortality, overall mortality, all-cause death, and overall death. The fourth basket included keywords for modeling. In case of Scopus, keywords were limited to title/abstract for the phrases describing the cancer types and screening. In both databases two additional filters were applied: Article or Review type records + English language records (Appendix 1). The records from the databases were downloaded on 20 November 2018.
The hits were checked for duplicates. All papers were screened for title and abstract by two independent researchers. On the basis of the predefined study eligibility criteria, we defined the following exclusion categories: no abstract/no author, not lung/breast/colorectal cancer, not cancer screening, not modeling, no mortality data. Disagreements between the independent researchers regarding the inclusion were resolved by consensus. Two independent researchers conducted the full-text review of all included papers. The full-text review applied the following exclusion criteria: no population level data on overall mortality or life-years gained, data are not based on modeling and data are available only on lifeyears gained.
The included articles were subjected to duplicated data extraction completed by two experts independently. Disagreements were resolved by consensus.

| MISCAN modeling
To evaluate the effect of screening on cancer-specific mortality, all-cause mortality, and life-years gained, we used the MIcrosimulation SCreening ANalysis (MISCAN) lung, breast, and colorectal cancer models. The natural history of cancer is modeled by a progression through preclinical stages. At each preclinical stage, a tumor may be clinically diagnosed or progress to the next preclinical stage. Screening may detect the tumor in an earlier preclinical stage, which can improve the prognosis. The lung cancer model uses a two-stage clonal expansion model which estimates a person's risk of lung cancer as a function of age and smoking history. The model simulates the natural history of lung cancer for four different histologies: adenocarcinoma, squamous cell carcinoma, other non-smallcell carcinoma, and small cell carcinoma. The parameters of the model are calibrated to the NLST and the PLCO trial. 13,23 A detailed description of the model can be found in ten Haaf et al 2015. 24 In the breast cancer model, the natural history of breast cancer is modeled as a progression through five preclinical stages (DCIS, T1A, T1B, T1C, and T2+). Survival after clinical diagnosis or screen detection is based on data of the Dutch nationwide screening program. Survival rates after screen detection are estimated using data from the Swedish randomized controlled trials. 14,15,25,26 Probabilities of receiving adjuvant treatment (hormonal therapy, chemotherapy, or a combination of the two) and survival rates are incorporated using data from Dutch regional comprehensive cancer centers (by age, stage, and calendar year) and from the Early Breast Cancer Trialists' Collaborative Group (EBCTCG) meta-analysis. 27 A detailed description of the model has been published before. 28 In the colorectal model, multiple adenomas can occur and can progress from small (<5 mm), to medium (6-9 mm) to large adenomas (>10 mm) and eventually to cancer stage I-IV. The parameters of the model were calibrated using data on the age-specific, stage-specific, and localization-specific incidence of colorectal cancer in the Netherlands (before the introduction of screening), the age-specific prevalence of adenomas as reported in autopsy studies, and the results of several screening trials. 17,29,30 The model is described in detail in van der Meulen et al. 31 Three hypothetical cancer screening trials were modeled: annual CT lung cancer screening for ages 55-80 for men and women who smoked at least 30 pack-years and who currently smoke or quit less than 15 years ago (United States Preventive Services Task Force recommendations); biennial breast cancer mammography screening for women between ages 50-69; and one-time flexible sigmoidoscopy for men and women between age 55-75. The attendance rates were assumed to be 75% for lung cancer, 80% for breast cancer, and 73% for colorectal cancer. In the simulated control arms participants were not screened. We modeled populations with a uniform age distribution among the eligible screening ages at the start of each trial, because most trials are designed that way. Therefore, some of the simulated individuals will have had only one invitation to attend a screen. The models used a cure rate to model the effect of screening: patients with a screen detected cancer were either cured (and did not die from the cancer anymore) or were not cured and died at the same time they would have died without having been screened. The proportion that was cured, and the baseline survival were both dependent on cancer stage and age at diagnosis. In the colorectal model, screen detected cases were assigned a onestage better survival than the one for the clinically detected cases. This was because the stage-specific survival of screendetected colorectal cancer cases as seen in RCTs on guaiac fecal occult blood testing was substantially more favorable than that of clinically detected colorectal cancer, even after correcting for lead-time bias. 32 The output of the models were the number of cancerspecific deaths, all-cause deaths, and the life-years (until all-cause death), for each year of follow-up. The simulations were performed with a sample size of 10 million people eligible for screening, to reduce stochastic variation.
For each year of follow-up a 95% confidence interval (2sided) was calculated for the relative incidence rate ratios for the number of cancer-specific and all-cause deaths in each arm. When the confidence interval of the rate ratio was below 1 the results was determined statistically significant. The outputs of the runs were used to estimate the expected effects when using sample sizes between 2000 and 600 000 (in different step sizes as demonstrated in Figures  2-4) individuals in each arm, scaled from the 10 million simulated.

| Systematic review
The search resulted in 799 hits in Scopus and 594 in Web of Science. After removing 143 duplicates, 1250 records were screened. The title/abstract screening resulted in 103 papers eligible for full-text screening. The full-text screening yielded seven papers to include for data extraction. The complete flowchart of the literature review (based on the PRISMA statement 33 ) is described in Figure 1.
Out of the seven modeling papers, three investigated lung cancer screening with CT, 34-36 two mammography screening, 37,38 one FOBT testing, 39 and one mammography and sigmoidoscopy 9 (Table 1). Four papers used a simple mathematical calculation to estimate the effect of screening, one used a Markov model, one applied patient level microsimulation, and in one case the study design was not clear. Four papers were studying European populations, two the US and one the Australian population. Although most papers included did not report whether the all-cause mortality reduction was significant and the reductions were small: 1.4%-3.6% for lung cancer, 0.4%-1.8% for breast cancer, and 0.5%-1.2% for colorectal cancer (Table 1).

| Lung cancer
In the control arm there were 96 lung cancer deaths per 1000 high-risk participants after life-time follow-up (Table 2), compared to 76 in the screen arm (17% less). The maximum difference in all-cause mortality was 10 deaths after 15 years. In total, 1000 high-risk participants in the screened arm lived 195 years longer (on average 71 days per participant, or 9.8 life-years saved per lung cancer death prevented). A significant difference in lung cancer mortality could be shown after 16 years of follow-up for a sample size of 2000 high-risk people in each arm ( Figure 2). With larger sample sizes, a significant difference could be found after 3 years. To show a significant effect in all-cause mortality, 11-13 years of follow-up, and minimal 40 000 high-risk persons were needed in each arm.

| Breast cancer
There were 29 breast cancer deaths per 1000 women in the control arm after life-time follow-up ( Table 2) and 22 in the screen arm (24% less). The maximum difference in all-cause mortality was four deaths after 25 years. In total, 1000 women in the screened arm lived 88 years longer (on average 32 days per woman, or 12.6 life-years saved per breast cancer death prevented). In the simulated cancer trial, 6000 women in each arm were needed to show a significant difference on breast cancer mortality after 21 years of follow-up ( Figure 3). With increasing sample size, a significant difference could be shown after 3 years. A significant difference in all-cause mortality could be expected between 16-26 years of follow-up and a minimal sample size of more than 300 000 women in each arm.

| Colorectal cancer
In the control arm there were 23 colorectal cancer deaths per 1000 participants after life-time follow-up (Table 2), compared to 18 in the screen arm (22% less). The maximum difference in all-cause mortality was two deaths after 10-20 years. In total, 1000 participants in the screened arm  (Figure 4). With larger sample sizes, a significant difference could be found after 4 years. To show a significant effect in all-cause mortality, 12 years of follow-up and minimal 600 000 people were needed in each arm. An example of the confidence intervals of the rate ratios of cancer-specific mortality and all-cause mortality is presented in Appendix 3.

| DISCUSSION
The results show that cancer screening trials are potentially able to demonstrate a significant reduction in all-cause mortality due to screening, as long as the sample sizes of the trials are very large. Depending on the type of cancer 40 000 to 600 000 participants per arm are needed to demonstrate a significant reduction. On the other hand, timing is also important. For the smallest possible sample sizes, a significant effect can only be demonstrated between 11 to 20 years of follow-up. Besides differences in natural history of the cancers, also differences in screening ages, intervals, and the improvement in prognosis due to screening influence the required sample size. The model predictions were close to the predictions found in the literature review.
The differences in results between the three cancer types relate to the natural history of the cancers: the incidence level, lead-time, and survival. A lung cancer screening trial that includes high-risk individuals has the most potential to demonstrate a significant effect in all-cause mortality at reasonable sample sizes. This is because of the high incidence of the disease and the low survival rate. In addition, lung cancer is generally fast-growing and has a short lead-time, therefore a significant effect can already be demonstrated after a few years. In contrast, colorectal cancer grows slower, the leadtime is longer and the survival is higher. Therefore, the sample size needs to be much larger and the follow-up longer.
In most cases, the required sample size exceeds the sample sizes of the trials that have been performed: breast cancer screening trials had between 20 000 and 80 000 participants, 3 lung cancer screening trials 2400 and 54 000 participants, 13,40 and colorectal cancer screening trials 30 000 and 180 000 participants. 11 Therefore, it is not surprising that a reduction in all-cause mortality has been found in just one lung cancer screening trial so far. It would be unrealistic to require that cancer screening trials lead to a reduction in all-cause mortality, given that their primary aim is to evaluate the potential to reduce a cancer-specific mortality. However, othercause mortality should be carefully monitored in screening  In sensitivity analyses also female current smokers and other age groups.

| 6135
HEIJNSDIJK Et al trials, to assure that screening does not increase all-cause mortality. 6,41 Screening can increase the all-cause mortality when the screen test can lead to complications (e.g., colonoscopy), the treatment has complications, or when people that are screened maintain a unhealthy lifestyle due to a "health certificate effect" (e.g., smokers who continue smoking after a negative CT-scan). A meta-analysis of the breast cancer screening trials showed that the all-cause death rate was not significantly reduced by screening and that screening did not induce excess mortality. 42 In the hypothetical trial a difference in all-cause mortality could be found using 600 000 participants in each arm, whereas the meta-analysis of four flexible sigmoidoscopy trials of 458 000 participants already found a statistically significant effect on all-cause mortality (RR 0.975; 95%CI: 0.959-0.992). 20 Maybe this difference in required sample size is related to the characteristics of the four trials (e.g., target age, life expectancy, cancer incidence, all-cause mortality correction) that were not taken fully into account in our simulation of an average trial. Another explanation is that the meta-analysis found a significant result even though there was not the power to find it. Since a lot of countries implemented FIT screening, we also simulated a colorectal cancer screening trial using biennial FIT screening for the ages 55-75. The results are very similar to the simulated flexible sigmoidoscopy trial (Appendix 2). A limitation is that we did not take a healthy screenee effect into account, which may lead to a smaller difference in all-cause deaths. Also, the breast and lung cancer model did not include death due to cancer treatment. When more cancers are detected in the first years of a screening trial, or due to overdiagnosis, more deaths due to treatment are expected, especially for lung cancer patients who are often suffering from co-morbidities. Other cancers that can be included in this analysis are cervical and prostate cancer. The mortality of cervical cancer is probably too low in Western European countries to demonstrate a significant effect of screening on all-cause mortality. Also, there have been no trials for cervical cancer screening. In prostate cancer, the mean age of dying for the disease is high. Therefore, it is not expected that an effect in all-cause mortality can be found after the required follow-up. Another limitation is that we used fixed attendance rates. Although we have chosen these attendance rates based on existing screening trials or programs, other attendance rates are possible and will influence the required sample size. All three models used a cure rate, in which the time of death of the cancer can not be extended by screening, which may lead to an underestimation of the cancer-specific mortality in the last years of follow-up. For most years of follow-up in the simulated trials the difference in cancerspecific deaths between the screen arm and control arm was larger than the difference in all-cause deaths. An explanation is that some of the subjects whose cancer death is prevented will die within the same 5-year period from other causes. This probability of dying from other causes increases with increasing age.
Although there are only small differences the all-cause deaths between the arms in most follow-up years, there are large differences in the life-years gained. The model simulations showed that, depending on the cancer, 41-195 life-years per 1000 participants are gained, which is equal to 8-12 lifeyears gained per cancer death prevented. The natural history of the disease is important: the younger the age at diagnosis, the more life-years can be gained. However, life-years gained after life-time follow-up have never been measured in screening trials and can only be derived by modeling. In our systematic review, the majority of modeling papers that did not report all-cause mortality did report estimated life-years gained as a result of screening.
A strong point of this analysis is that the models used to evaluate each cancer screening trial are all MISCAN models, which means the models have comparable structures and assumptions, although of course the models are calibrated to various data sources and levels of evidence. The required sample size is often calculated using existing statistical sample size formulas. 9 However, screening trials are too complex, due to lead-time and overdiagnosis to calculate the reduction in cause-specific mortality for each year of follow-up without complex models.
In conclusion, cancer screening trials are in theory able to demonstrate a significant reduction in all-cause mortality due to screening, but would require sample sizes that are larger than most trials that have been performed so far. Therefore, statements on all-cause mortality reductions due to screening can not be made on present cancer screening trials. In addition, a reduction in all-cause mortality can only be demonstrated between specific years of follow-up.