Evaluating cancer screening programs using survival analysis

The main purpose of cancer screening programs is to provide early treatment to patients that are diagnosed with cancer on a screening test, thus increasing their chances of survival. To test this hypothesis directly, one should compare the survival of screen‐detected cases to the survival of their counterparts not included to the program. In this study, we develop a general notation and use it to formally define the comparison of interest. We explain why the naive comparison between screen‐detected and interval cases is biased and show that the total bias that arises in this case can be decomposed as a sum of lead time bias, length time bias, and bias due to overdetection. With respect to the estimation, we show what can be estimated using existing methods. To fill in the missing gap, we develop a new nonparametric estimator that allows us to estimate the survival of the control group, that is, the survival of cancer cases that would be screen‐detected among those not included to the program. By joining the proposed estimator with existing methods, we show that the contrast of interest can be estimated without neglecting any of the biases. Our approach is illustrated using simulations and empirical data.

CSPs have the potential to affect the survival probability of participants through various mechanisms.These mechanisms include, but are not limited to, early treatment, screening technology, and raised awareness of cancer symptoms.In this paper, we focus on the effect of early treatment and hence assume that the effect of other factors is negligible.Based on this assumption, only the survival probability of patients with cancer detected on the screening tests can be affected by the CSP.We wish to know how the survival probability of screen-detected cases differs from the survival probability of their counterparts not in the program who would receive delayed treatment prompted by the symptomatic detection of the disease.This difference (contrast) is recognized as the contrast of interest in this work.
So far, authors used natural language to define the contrasts of interest which may seem to be easy to understand but can often fail to clarify the details.Formal notation can be found in some methodological articles (e.g., Isheden & Humphreys, 2022;Lee et al., 2018;Strandberg & Humphreys, 2019) but is usually limited to the explanation of the proposed model.To our knowledge, a general notation is missing in this field of research making the literature hard to compare.Therefore, the first aim of our study is to develop a comprehensive notation that could be used to define different contrasts and other concepts that surround CSPs.
To estimate the contrast of interest, ideally a randomized trial should be performed in which screen-detected patients would be randomly divided into two groups: the experimental group would receive treatment at the time of screen diagnosis, while the control group would receive treatment later, at the onset of symptoms (Saha et al., 2021).We will refer to this experiment as the target trial (Hernán & Robins, 2016).To estimate the effectiveness of early treatment, we could compare the survival probability between the two groups.Ethical constraints prevent the realization of the target trial, hence one must rely on observational studies.Based on the data from observational studies, however, all screen-detected patients are invited for early treatment.Since the appropriate control group in this case does not exist, we can no longer directly (without any additional assumptions or modeling) estimate the contrast of interest.
Various surrogates for the control group have been proposed in the literature.Authors often use cancers diagnosed based on symptoms in the interval between the two consecutive screening tests (referred to as "interval cancers") as a control group (Fonga et al., 2014;Hamashima et al., 2015;Hofvind et al., 2016;Irvin et al., 2020;Lawrence et al., 2009;Niraula et al., 2020;O'Brien et al., 2018).Interval cancers present an attractive option for the control group because their treatment is prompted by the symptomatic detection of a disease, are equally affected by healthy volunteer bias as screendetected cancers (Duffy et al., 2008;Lawrence et al., 2009) and are easy to acquire (they emerge as a by-product of CSP).However, the comparison between screen-detected and interval cancer proves to be importantly biased.Previous authors have recognized three different biases that arise in this case, referred to as lead time bias, length time bias, and bias due to overdiagnosis (Baker et al., 2002;Cole & Morrison, 1980;Cox & Sneyd, 2013;Duffy et al., 2008;Gates, 2001;Kramer & Croswell, 2009;Walter & Stitt, 1987).Although the three biases are well-known in the literature they have not yet been precisely defined using formal notation.The second aim of our study thus is to formally define them and to explain the difference between the naive comparison and the contrast of interest.
Several approaches have been proposed with the aim to correct for some of the biases.Parametric models can be used to correct for lead time bias (Abrahamsson et al., 2020;Chen et al., 2012;Duffy et al., 2008;Lee et al., 2018;Mahnken et al., 2008;Walter & Stitt, 1987;Wu et al., 2007) and to reduce length time bias (Chen et al., 2012;Cucchetti et al., 2016;Hofvind et al., 2016;Mahnken et al., 2008).Some authors tried to reduce length time bias by using other control groups (Allgood et al., 2011;Kalager et al., 2009;Lawrence et al., 2009;Tsuruda et al., 2021;Walter & Stitt, 1987).However, to our knowledge, no existing approach can fully avoid the three biases.Therefore, the final aim of our study is to show how the proposed notation can be used to split the complex problem of estimating the contrast of interest into several simpler building blocks.With respect to the estimation, we show which parts can be estimated using existing methods (here we lean on existing parametric models that can be used to estimate lead time) and develop a new approach for the estimation of the so far missing part-the survival function of the control group.As a side effect of our approach we also show how one might estimate the three biases.The proposed method requires data of cancer cases invited to the program as well as data of cancer cases not invited to the program.Our approach is illustrated on a simulated and real data set from breast CSP.
The present paper is organized as follows.In Section 2, we explain the proposed notation and describe the assumptions made in this paper.In Section 3, we use the notation to define the contrast of interest and the three biases.The estimation procedure is explained in Section 4. In Sections 5 and 6, we illustrate our approach using simulated and real data, respectively.In Section 7, we discuss which problems have been addressed in this work and which remain.Section 8 concludes this paper.
F I G U R E 1 Four subjects, followed from age  = 50 who were screened every 2 years ( = 2; full adherence is assumed).Subject 1 was diagnosed with cancer in the interval between the two screening tests and died of cancer.Subject 2 was diagnosed with cancer on screening test and died of cancer, Subject 3 was censored before dying.Subject 4 died from other causes before being diagnosed with cancer.Subjects 1, 2, and 3 were diagnosed with cancer in the study period, their   = 1, while Subject 4 was not diagnosed with cancer in the study period, his   = 0.

The notation of potential outcomes
In this work, invitation to the program, denoted by  (0 = not invited to the program, 1 = invited to the program), is the intervention of interest.Subjects invited to the program are offered screening and receive early treatment if their cancer is detected on the screening test.The key events of interest will be cancer diagnosis and death, the distribution of the time elapsed between the times of these two events is given by the survival function.Both diagnosis and cancer death might occur at different times (ages) if subject is or is not invited to the program.To define estimands of interest, we adopt the notation of potential outcomes (Goetghebeur et al., 2020;Hernan & Robins, 2020).This notation enables us to observe each subject in two worlds: in world 0 (denoted by  = 0) we will assume that the subject is not invited to the program and in world 1 (denoted by  = 1) we will assume that the subject is invited to the program.Formally, for any outcome , let  =1 be the potential outcome that would have been observed had the subject been invited to the program, and  =0 the potential outcome that would have been observed had the subject not been invited to the program.In reality, we can only observe each subject in one world.

Observable quantities
We assume that cancer screening starts at age  and is done at equal intervals of length .In general, the population of interest includes all cancer cases as well as no cancer cases that meet eligible criteria to participate in the program.An example with this notation is presented in Figure 1.To improve clarity, we simplify the notation by assuming that subjects invited to the program fully comply with the program and continue with the screening until they are diagnosed with cancer, die from other causes, or are censored.These assumptions can be easily relaxed, and the notation as well as the proposed estimator can be adjusted accordingly to account for real-world scenarios (see Appendix A).

Events
Two events are of interest: cancer diagnosis and death.We measure age at each event.Censoring might preclude observing these events.

Observed detection mode
Subjects invited to CSPs ( = 1) can be divided into three groups based on their mode of cancer detection: • No cancer cases On the other hand, subjects not invited to the program ( = 0) can be divided into two groups: The observed detection mode is denoted by   .An example with four subjects invited to the breast CSP ( = 1) illustrates this notation in Figure 1.

Assumptions
Any statistical analysis or model, relies on certain assumptions, many of which are implicit.By clearly stating our assumptions, we ensure transparency and enable critical evaluation of our methodology.In this paper, we assume that CSP may change course of life only for screen-detected cancers as only these cases receive early treatment.As for the other cases invited to the program (interval and no cancer cases), we assume that their course of life would remain unchanged as if they were not invited to the program.In addition, we assume that early treatment may only delay death from cancer and does not cause any harm in the presymptomatic phase.These assumptions can be more formally written as follows: (  The usual assumptions of causal inference (causal consistency, positivity, and no interference; see Goetghebeur et al., 2020;Hernan & Robins, 2020) are also made in this paper.Note that based on assumptions AS1 and AS2 random variables   ,   , and   have the same value in both worlds.To avoid redundancy in such cases, we write them concisely as:

Hypothetical quantities
This section introduces notation for the quantities that can only be defined by considering information from both worlds.That means they cannot be observed in reality and are thus referred to as hypothetical quantities.As they are derived by merging both potential outcomes of an individual they are by definition equal in both worlds.

Overdetection
Since  =1 ≤  =0 (see assumptions AS3 and AS5) it follows that some cases might be diagnosed with cancer in world 1 ( = 1) but not in world 0 ( = 0), because they are lost from follow-up in between the two times.These cases are referred to as overdetected cases and are defined formally by the following indicator function equal to 1: Overdetection can occur due to: • overdiagnosis: min(  ,   ) =   ;  =1 ≤   <  =0 …the subject dies of other causes before being diagnosed with cancer in world 0 • censoring: min(  ,   ) =   ;  =1 ≤   <  =0  …the subject is censored before being diagnosed with cancer in world 0 Note that if a case is overdetected due to censoring he/she might still benefit from early treatment (this is not the case if a person is overdetected due to overdiagnosis).However, this improvement cannot be observed in practice due to censoring 1

Hypothetical detection mode
Next, we define hypothetical detection mode ( ℎ ) which allows us to differentiate between different cases by merging information from both worlds: (1) The relationship between observed and hypothetical detection modes is summarized in Table 1.
The hypothetical detection mode is in fact a time-dependent variable ( ℎ ()).Before time  the hypothetical detection mode is equal to 0  ℎ (0) = {0}) for all subjects in the study.Over time, this value changes for some subjects.To simplify the notation, we have omitted the dependence on time in this study.Note, however, that when conditioning on this variable 1 Although the term "overdetected" typically refers to cancer cases with tumors that do not progress or progress too slowly to cause symptoms or harm during a person's remaining lifetime (Brodersen et al., 2018), we have extended this definition in our study to also include cases with potentially aggressive tumors that would otherwise (in the absence of screening) remain undetected due to censoring.We acknowledge that our use of the term "overdetected" to encompass these cases may deviate somewhat from the strict definition used in the medical literature.However, we believe it is justified by the fact that these cases, like other overdetected cases, remain clinically concealed during the study period (i.e., until censoring occurs).
Based on the assumption AS3 it equals 0 for interval cases and is not defined for those without cancer.Since lead time is a hypothetical quantity it can also be defined for overdetected cases ( ℎ ∈ {2}).

Notation summary
The random variables defined in this study are presented in Table 2.
In Figure 2, we further illustrate our notation using data from Figure 1.For the purpose of this example, we will assume that outcomes in both worlds are known although in reality each subject can only be observed in one world.Using information from both worlds it is possible to determine the hypothetical detection mode ( ℎ ) for each subject.Note that age at death is delayed for Subject 2 in world 1 compared to world 0 due to the presumed benefit of early treatment provided in world 1.Other subjects do not benefit from CSPs under assumptions made in Section 2.3.

Contrasts
First, we formalize the question of interest, that is, the contrast that describes the causal effect of early treatment.As has been discussed in Section 1, we focus on screen-detected patients ( ℎ ∈ {2, 2}) as only these cases are presumed to receive early treatment.We wish to compare their survival probability in world 1 (early treatment) to world 0 (delayed treatment).The time to death can be measured either from screen diagnosis ( =1 ) or from symptomatic diagnosis ( =0 ).
In line with previous authors (Abrahamsson et al., 2020;Duffy et al., 2008;Walter & Stitt, 1987;Xu & Prorok, 1995), we have decided to measure time to death from symptomatic diagnosis-the most obvious choice, as any positive effect of The four subjects from previous example observed in world 0 and in world 1. Age at diagnosis and age at death is the same in both worlds for interval cases (Subject 1) and no cancer cases (Subject 4).Screen-detected cases (Subject 2 and 3) are diagnosed with cancer at an earlier time in world 1 compared to world 0.
early treatment can only be observed after this time.The contrast of interest, denoted by (), is expressed as the risk difference between two survival functions and can be formally written as follows: … the difference in survival probability of screen-detected cases between world 1 and world 0 (time to death is measured from diagnosis in world 0).The subgroup of patients with  ℎ ∈ {2} is excluded from () since time to death is measured from symptomatic diagnosis.Note that both survival functions are hypothetical and cannot be observed in reality.
Based on the observed data of patients invited to the program, one might compare the survival probability between interval and screen-detected cases and naively expect this to be a close approximation to ().We denote this contrast as () (naive risk difference) and write it formally as: … the difference in survival probability between screen-detected and interval cases.Note that in this case one can estimate both survival functions using the Kaplan-Meier estimator (Kaplan & Meier, 1958).The estimated difference in survival probability is denoted by n().

Biases
In general, () ≠ ().Therefore, it follows that n() is a biased estimator of ().Previous authors (Baker et al., 2002;Duffy et al., 2008;Walter & Stitt, 1987) have explained that biased estimates arise due to the combined effect of lead time bias, length time bias, and bias due to overdetection.In this paper, they are denoted by  1 (),  2 (), and  3 (), respectively.Here, we provide a formal definition of each bias: Lead time bias: …survival probability of screen-detected cases might be higher if time to death is measured from screen-diagnosis rather than from symptomatic diagnosis.
Length time bias: … interval cancers might have different baseline characteristics (e.g., tumor growth speed, breast size, tumor location) compared to screen-detected cancers.If these characteristics also affect patients' survival, the two groups might have different survival probabilities even in world 0. Bias due to overdetection: … Overdetected cases might have different survival probability compared to other screen-detected cases.The size of this bias also depends on the proportion of overdetected cases.Note that the magnitude of each bias may change over time.In total, only six survival functions are required to define the three biases and the two contrasts.For reference, see Table 3.

Bias decomposition
In this section, we show that the total bias that arises when contrast () is estimated using estimator n() can be decomposed as a sum of lead time bias, length time bias, and bias due to overdetection.First, notice that  2 () =  3 () based on the assumptions AS1, AS2, AS3, and AS4.Next, we show that the total bias can be expressed as a simple sum of  1 (),  2 (), and  3 (): In Equation ( 7), we have shown that there is no overlap between the three biases commonly mentioned in the cancer screening literature and hence the effect of each bias can be studied in isolation.We have also shown that there is no residual (unexplained) bias beyond these three; by subtracting each bias from () it is theoretically possible to calculate ().

ESTIMATORS
To present the main ideas clearly, we assume throughout this section that the allocation to  is random.If the assignment to  is not random, the population of interest for a given contrast must be chosen (e.g.,  = 1) and the control group (e.g.,  = 0) must be reweighted to match the population of interest (see Section 6.2 for an example of how this can be done in practice).

What we can estimate so far
In order to estimate quantities defined in Section 3, we need to estimate six survival functions (see Table 3).Three survival functions,  1 (),  2 (), and  3 (), are equal to survival probabilities referring to observable quantities: The equality in Equation ( 10) holds based on assumptions AS3 and AS4.The three survival functions can thus be directly estimated based on the observed data using standard statistical methods (e.g., Kaplan-Meier estimator).Next, to estimate  4 () and  5 () we first show: Which means that to estimate either survival function we first need to estimate lead time.This can be done using existing parametric methods (Abrahamsson et al., 2020;Chen et al., 2012;Duffy et al., 2008;Lee et al., 2018;Mahnken et al., 2008;Walter & Stitt, 1987;Wu et al., 2007).Once we obtain a person-specific predicted lead time it is possible to estimate  4 () and  5 () using the Kaplan-Meier estimator.Whether the obtained estimates are truly unbiased depends on the right specification of the parametric model.
One survival function remains to be estimated.To our knowledge, an estimator of  6 (𝑡) has not yet been proposed in previous literature.This is the last survival function that needs to be estimated to allow for estimation of () and  2 () (see Table 3).The aim of this paper is to fill this gap (see Section 4.2).

4.2
Filling the gap

Basic idea
We wish to estimate  6 (), the survival probability screen detected cases ( ℎ ∈ {2}) would have had they received delayed treatment as in world 0, with time to death measured from symptomatic diagnosis.The concept underpinning the proposed method is sketched in Figure 3. Here, same-colored blocks indicate which groups can in practice be observed meaning that it is possible to estimate directly (using empirical data) both their survival function and their proportion among other groups (note that Figure 3 is not to scale, and therefore does not attempt to represent the true proportions).
The observed survival function of cancer patients in  = 0 (purple block) is a weighted average of the survival probability of a hypothetical group of cases with  ℎ ∈ {1} (would be interval in world 1) and  ℎ ∈ {2} (would be screen-detected in world 1).Thus, by knowing the survival probability of cancer cases not invited to the program (which can be estimated based on the data from group  = 0) and the survival probability of the  ℎ ∈ {1} group as well as the appropriate weight (both quantities can be estimated based on the data from group  = 1 under assumptions made in Section 2.3), we can deduct our quantity of interest: the survival probability of  ℎ ∈ {2} cases with delayed treatment.
F I G U R E 3 Graphical depiction of available information (see text).The same-colored blocks represent the parts of the data that are observed as a single group.The dotted lines split the blocks into the groups according to the  ℎ (the sizes of the blocks do not attempt to represent the true proportions).

Formal derivation
To estimate  6 () = ( =0 −  =0 > | ℎ ∈ {2}, we first show (by the law of total probabilities): where  is the proportion of cancers that would be screen-detected in world 1 relative to all of the cancers detected in world 0: Next, we show that two of the three survival probabilities in Equation ( 13) equal probability of observable random variables: and hence can be estimated using standard statistical methods (e.g., Kaplan-Meier estimator).To estimate  (see Equation 14), we first show that: where ( ℎ ∈ {2, 2}| =1  = 1) = (  ∈ {screen-detected}|  = 1,  = 1) and can be thus directly estimated based on the observed data.The second probability in Equation ( 17), ( ℎ ∈ {2}| =1  ), corresponds to the proportion of overdetected cases.It can be shown that which means that the proportion of overdetected cases can be estimated nonparametrically, by comparing cancer incidence between the two groups ( = 0 and  = 1), or by estimating person-specific lead time based on parametric models.The first approach is also known as excess incidence approach (Etzioni et al., 2013;Lee & Etzioni, 2017;Ripping et al., 2017).
Once the two survival functions and the weight are estimated, the only unknown quantity left in Equation ( 13) is ( =0 −  =0 > | ℎ ∈ {2}) =  6 ().By rearranging terms, we can show that  6 () can be estimated as: The derivation of the proposed estimator of  6 () has been so far done under the assumption that subjects in  = 1 fully adhere to the program and are screened indefinitely, that is, until they are diagnosed with cancer, die from other causes, or are censored.Nevertheless, the estimator can also be derived without these assumptions.The generalized derivation, taking into account relaxed assumptions, can be found in the Appendix (see Appendix A).

SIMULATION STUDY
A simulation study was performed to illustrate the difference between the two contrasts, () and (), to show the decomposition of the bias (see Section 5.2.1), and to evaluate the proposed estimator (see Section 5.2.2).Simulations enable us to produce hypothetical data where each subject can be observed in world 0 as well as in world 1.

Data simulation process
The simulation has been designed to mimic the implementation of breast CSP.We simulated subjects as having been born uniformly from January 1, 1958 to January 1, 1959.Date of death due to other causes was simulated using Slovenian population mortality tables for women, and all subjects were censored on January 1, 2020.The CSP was assumed to have been implemented on January 1, 2008 with screening done every 2 years for all participants older than 50 years.Age at tumor onset was generated using age-specific breast cancer incidence rates obtained from Cancer Research UK (CancerResearchUK, 2023) that were subsequently shifted by 5 years toward lower ages.Other cancer-related data were generated using the biological tumor growth model (Abrahamsson et al., 2020;Isheden & Humphreys, 2019, 2022;Strandberg & Humphreys, 2019).This model consists of three submodels which, in turn, describe three continuous processes: tumor growth, time to symptomatic detection, and screening sensitivity.Following the procedure described in previous literature (Abrahamsson et al., 2020), we first simulated the inverse tumor growth rate for each subject, which is a latent factor that determines the speed of tumor growth.Based on this factor, we simulated the age at screen diagnosis (in the presence of biennial screening) and the age at symptomatic diagnosis (in the absence of screening) for each subject according to the specified models.We also calculated the tumor diameter for each subject at both points in time.
Next, we observed each subject in two worlds.In world 1 ( = 1), subjects were assumed to fully comply with the program while in world 0 ( = 0) subjects were assumed to not take any part in the program.In world 0, age at cancer diagnosis was set to age at symptomatic diagnosis and in world 1 it was calculated as the minimum of age at symptomatic and screen diagnosis.To determine the observed detection mode (  ), we combined the information from the date of death, date of censoring, age at symptomatic diagnosis, and age at screen diagnosis.The hypothetical detection mode ( ℎ ) was determined by merging information from both worlds; subjects that were diagnosed with cancer in world 1 but not in world 0 were flagged as overdetected cases (see Equation 1).
As a result of the models on which the data were simulated, significant differences emerged in the distribution of cancer growth rate and tumor size at the time of cancer diagnosis with respect to the  ℎ groups.To illustrate these differences, we simulated a large data set (default parameters were used in the simulation) to obtain estimates that closely approximate the true values.The results are displayed in Figure 4.As anticipated, interval cancers ( ℎ ∈ {1}) exhibited a more rapid growth rate on average compared to screen-detected cancers ( ℎ ∈ {2, 2}).In addition, in world 1 they were, on average, larger in size at the time of diagnosis.Note that tumor size at cancer diagnosis is shifted toward lower values in world 1 compared to world 0 for  ℎ ∈ {2} cases as these cases are detected at an earlier stage of a disease due to screening.For a more detailed discussion of the obtained simulation results, please refer to the study by Abrahamsson et al. (2020).
For the final step in our simulation, we sampled time from symptomatic detection in the absence of screening to death due to cancer from an exponential distribution.The parameter of the exponential distribution (Λ) was calculated for each subject based on the tumor growth rate and tumor size at the time of treatment, as described in Abrahamsson et al.

F I G U R E 5
The distribution of Λ values used to generate time from symptomatic detection in the absence of screening to death due to cancer for groups  ℎ ∈ {1} and  ℎ ∈ {2} under  0 and  1 for each world.The distribution of Λ is not shown for  ℎ ∈ {2} cases, as these cases are censored or have died prior to symptomatic detection that would have occurred in the absence of screening.
(2020).This resulted in each case () having a unique value of Λ  that influenced their time to death.Time to death was simulated under the null ( 0 ) and alternative ( 1 ) hypothesis.Under  0 , cancer was assumed to be treated at  =0 in both worlds, leading to the unique Λ  value for each case, which was identical in world 0 and in world 1 (the distribution of Λ can be observed in Figure 5).The distribution of Λ, however, varies between the  ℎ ∈ {1} and  ℎ ∈ {2} cases (see Figure 5) since the two groups have different distributions of tumor size and tumor growth rate (as illustrated in Figure 4).Consequently, even under  0 , the two groups have different probabilities of survival.Under  1 , the cancer was assumed to be treated at  =1 in world 1 and at  =0 in world 0. Because Λ parameter was made dependent upon tumor size at which cancer was treated screen-detected cancers had lower Λ  values in world 1 compared to world 0 under  1 as their tumor size was smaller at  =1 compared to  =0 (see Figures 4 and 5).Consequently, under  1 , the survival probabilities for screen-detected cases are higher in world 1 compared to world 0. Finally, we calculated age at death as the minimum of age at death due to other causes and age at death due to cancer and recorded censoring status for each subject.

Scenarios
To illustrate that the magnitude of each bias is strongly dependent upon the properties of the CSP and cancer site in question, we simulated data under four different scenarios: • Scenario (0) (Default scenario) • Scenario (1) (Increased sensitivity to symptoms): parameter  was changed so the cancer could be detected based on symptoms at an earlier age ( =  −5 ).• Scenario (2) (Older population): screening was done for participants older than 70 years.Date of birth was simulated uniformly from January 1, 1938 to January 1, 1939.• Scenario (3) (Fast growing cancer): the cancer grew on average at a faster rate ( 1 = 1.36).
Note that scenarios (1), (2), and (3) do not aim to replicate real-life data.Instead, the parameters are set to relatively extreme values as to demonstrate the importance of each bias.
The large data set, generated using default parameters under scenario (0), was used once again to calculate descriptive statistics for the selected observed variables (see Table 4).Note that the 5-year survival was calculated for all cancer cases with time zero set at cancer diagnosis in the respected world.Due to lead time bias and bias due to overdetection, the 5-year survival probabilities are higher in world 1 compared to world 0 even under  0 .
Various sample sizes are employed in our simulation.In Sections 5.1.1,5.1.2,and 5.2.1, a large sample size of  = 10 7 (approximately 250,000 cancers) was used to approximate the true (population) values.Conversely, in the part where the proposed estimator is evaluated (Section 5.2.2), smaller sample sizes are applied to assess its performance under a range of practical circumstances.

Bias decomposition (illustrative example)
In Figure 6a, we present the six survival functions (see Table 3) calculated under the scenario (0) using a large simulated data set.These survival functions are then used to calculate contrasts () and (), as shown on Figure 6b.In line  with the simulation setup, contrast () (black curve), which represents the true effect of the CSP, is equal to zero under  0 and is larger than zero under  1 .This implies a positive effect of early treatment.Contrast () (red curve), defined as the difference in survival between screen-detected and interval cases, is greater than zero even under  0 , illustrating that this contrast is an inadequate proxy of () since it might show a (possibly large) positive effect even when none exists.The area between () and () in Figure 6b is decomposed into the three biases (shaded regions), illustrating that the three biases fully explain the difference between () and ().The three biases vary in magnitude, and their contributions to the total bias change across time (note that the bias due to overdetection is negligible and cannot be easily observed on the graph due to its small size).Furthermore, note that the magnitude of lead time bias and bias due to overdetection change if data are generated under  1 compared to  0 since both biases are sensitive to changes in survival probabilities in world 1 (see Equations 4 and 6).Length time bias, on the other hand, remains the same in both scenarios (see Equation 5).Next, we present the results for the data generated under scenarios (1), (2), and (3) (see Figure 7).In this case, only the results generated under  0 were considered.Note that the magnitude of each bias varies across different scenarios.If subjects are more sensitive to their symptoms (Figure 7, scenario (1)), the duration of the preclinical screen-detectable phase (known also as sojourn time; Cheung et al., 2017) is shorter which in turn reduces the magnitude of the length time bias and lead time bias.If screening is offered to an older population (Figure 7, scenario (2)), more patients are overdiagnosed due to the increase in other-cause mortality, which increases the magnitude of bias due to overdetection.Note that this bias is negative in this case.Finally, if cancer is on average more fast growing (Figure 7, scenario (3)), we detect less cancers based on a screening test.The survival probability of screen-detected cancers remains almost the same because cancers that are detected on a screening test are in this case also more likely to be slow growing.Interval cancers, on the other hand, become even more fatal which increases length time bias.

F I G U R E 7
Bias decomposition for data generated under scenarios (1), (2), and (3).Shaded regions depict the decomposition of biases, with their areas corresponding to the magnitude of each bias.In instances where certain biases are negative, the shaded regions may overlap (see scenarios ( 1) and ( 2)).
F I G U R E 8 Estimated (100 repetitions) and true survival probabilities of  6 () for three different sample sizes.Estimates were obtained using the nonparametric estimator proposed in Section 4.2.

Evaluating the proposed estimator
For the second aim of the simulation study, we evaluated the proposed estimator using Monte Carlo simulations.We simulated data under scenario (0) of sample sizes of 2 * 10 5 (approximately, 2500 cancers), 4 * 10 5 (approximately, 5000 cancers), and 8 * 10 5 (approximately, 10,100 cancers).The simulation was repeated 100 times for each sample size.To mimic real data, subjects from the simulated data were randomly divided into two groups: in one group, subjects were observed as they would live in world 1; in the second group, subjects were observed as they would live in world 0. Data from alternative worlds were discarded.Based on the observed data, we estimated  6 () using the method proposed in Section 4.2.The proportion of overdetected cases was estimated nonparametrically using excess incidence approach (Equation 18).The large data set ( = 10 7 ) was once again used to estimate the true (population-based) values.The estimated survival functions were in the end compared against the true survival function to assess whether the proposed estimator is unbiased and to gain an estimation of its variance.
The results showed that the proposed estimator seems to be unbiased (see Figure 8).The estimator's standard error is the highest for the smallest sample size ( √ var( Ŝ6 ( 5)) = 0.013), and it decreases as the sample size increases twofold  5)) = 0.009) and fourfold ( √ var( Ŝ6 ( 5)) = 0.007).As time passes, more subjects are censored and the variance of the estimator increases.

EMPIRICAL EXAMPLE
In this section, we analyze data from the Slovenian breast CSP and estimate the effectiveness of early treatment provided to screen-detected cancers (()).The novel part here is represented by the estimation of the survival probability of the appropriate control group.Furthermore, we estimate the difference in survival between interval and screen-detected cases (()) and decompose the overall bias to demonstrate the importance of each source of bias.Note that the analysis serves only as an illustration of the proposed approach.The interpretation of our results is limited as the parameters of the model used to estimate lead time were estimated based on the Swedish data set (see Section 6.2.2).

Data set
The data for our empirical example were acquired from the Slovenian Cancer Registry (Zadnik et al., 2017) and from the registry of Slovenian national breast CSP (DORA, Jarm et al., 2021).The data set consists of all invasive breast cancers recorded in the Cancer Registry from 2008 to 2018 (code C50 according to ICD-10 classification) that were diagnosed in women between 50 and 70 years of age.Noninvasive breast cancers were excluded from the analysis.The end date of the study was April 5, 2022.For each subject, we acquired exact date of birth, date of cancer diagnosis, date of end of followup (date of death, date of lost to follow-up, or date of end of the study, whichever came first), and status at the end of the follow-up (alive, death due to other causes, death due to breast cancer, or lost to follow-up), and socioeconomic index (graded in five levels from most deprived to most affluent).
The breast CSP was introduced in Slovenia in 2008.Women aged 50-68 without any breast cancer history are invited to screening tests (mammography) every 2 years.Due to limited capacity and gradual rollout, only women from the central Slovenian region were invited to participate at first (in years 2008-2010).The program was gradually expanded to include all Slovenian municipalities by the year 2018.In this study, we distinguish between two groups of cancer cases: the first group consists of women that were invited to the program prior to being diagnosed with breast cancer ( = 1) and the second group consists of women diagnosed with breast cancer that had not been invited to the program ( = 0).Due to the gradual implementation of the program, we had data available for both groups throughout the whole study period.The group  = 1 comprised 2562 individuals (278 deaths), while the group  = 0 consisted of 4356 individuals (1005 deaths).A general overview of the data is presented in Figure 9.
Cancer cases invited to the program (group  = 1) were further divided into three groups based on their mode of detection: screen-detected cancers, interval cancers (less than 2.5 years from last negative screening test), and undeterminable cancers (more than 2.5 year from last negative screening test or never screened; this group is formally defined in Appendix A).We have extended the period during which cancer is defined as interval (from 2.0 to 2.5 years) as some participants may had postponed their appointments for a few months.For screen-detected cancers, we also collected data on tumor size (diameter in mm) as well as their entire screening history.Unfortunately, we could not acquire data on exact tumor size for interval cases.Descriptive statistics for subjects invited to the program are shown in Table 5.
There were two major differences between empirical and simulated data.The first difference involves the inclusion of a new group of undeterminable cases, which emerges because not all subjects participated in the program or adhered to it fully.As can be seen in Table 5, these cases have much lower probability of survival compared to interval cancers, and therefore, must be considered as a distinct group.We had to adjust our estimator of  6 () to account for this group as well (see Appendix A).The second difference was that the two groups  = 1 and  = 0 were not randomized in the real-world data, due to the uneven implementation of the program.For example, during the initial rollout, older women were prioritized, which may have resulted in a higher proportion of older women in the group invited to the program ( = 1).In addition, the two groups may differ with respect to their distribution of socioeconomic status, as more women invited to the program come from the central Slovenian region, where socioeconomic status tends to be higher.These and other potential confounders were therefore later controlled for in the analysis (see Section 6.2).

Data analysis
To estimate (), (), and the three biases, we must estimate the six survival functions defined in Table 3.As the two groups,  = 1 and  = 0, were not balanced, we had to define which population is of interest.Since the main goal of this analysis is to estimate the survival benefit provided through the Slovenian breast CSP we considered group  = 1 to be the population of interest.

6.2.1
Estimation of  1 (),  2 (), and  3 () First, based on the data from group  = 1 we estimated the survival functions  1 () to  3 ().No adjustment to account for confounding variable was needed in this case since group  = 1 was already chosen to represent our population of interest.Survival functions  1 (),  2 (), and  3 () were directly estimated using Kaplan-Meier estimator as described in Section 4.1.

6.2.2
Estimation of  4 () and  5 () Next, the biological tumor growth model (Abrahamsson et al., 2020) was used to estimate person-specific lead time which allowed for an estimation of  4 () and  5 ().Based on our data, we could not estimate the parameters of the model as we did not acquire any data on tumor size for interval cases (only tumor size data for screen-detected cases were available).Therefore, we used parameters estimated from a Swedish data set of a breast CSP (Abrahamsson & Humphreys, 2016;Abrahamsson et al., 2020) (see Section 5.1.1 for exact values).Following the approach proposed by Abrahamsson et al. (2020), we predicted lead time for each woman whose cancer was detected on a screening test based on her tumor size and screening history.For interval and undeterminable cases, lead time was set to be equal to zero.Once lead time was estimated we could predict age at which cases would be diagnosed with cancer if they were not included in the program ( =0 ).Screen-detected cases that were censored or died due to other causes prior to the predicted age were flagged as overdetected cases ( ℎ = {2}).The same was done if the predicted age was beyond 70 years as these cases would not be included in our study in world 0 (see Appendix A for details).Finally, we estimated the survival functions  4 () and  5 () as described in Section 4.1.The whole procedure was repeated 10 times and the results were averaged to ensure stable results.

6.2.3
Estimation of  6 () Finally, to estimate  6 () we used the estimator proposed in Section 4.2.2.The estimator had to be expanded to account for the additional group of undeterminable cases.To this end we combined interval and undeterminable cases into one group, as it was assumed that undeterminable cases, like interval cases, have the same probability of occurring in world 0 and in world 1 and have the same survival probabilities in both worlds (see Appendix A).Consequently, Equations ( 13), ( 16), and ( 19) had to be expended to include these cases as well (see Appendix A).Therefore, to estimate  6 () the following quantities had to be estimated (see Equation 19): • : Lead time approach was used to estimate the proportion of overdetected cases (see Equation 18), and consequently weight  (Equation 17), which means that no additional modeling to control for confounding was needed in this case since the weight was estimated using only the data from group  = 1.• ( −  > |  ∈ {interval, undeterminable},  = 1): Based on the data from  = 1 we have estimated the survival probability of interval and undeterminable cases using Kaplan-Meier estimator.• ( −  > |  = 1,  = 0): To account for nonrandom assignment, we included age at diagnosis, date at diagnosis, and proxy of socioeconomic status in the Cox model (Cox, 1972) and used this model to predict individual survival probability for cancer cases in group  = 0 (the relationship between hazard and age at cancer diagnosis was modeled using cubic splines).We then used the distribution of the covariates from cancer cases in group  = 1 to get the predicted marginal survival probability of cancer cases in group  = 0 that are comparable to cancer cases in group  = 1.Here, some care had to be taken as covariates for screen-detected cases are lead-time biased.Hence, we used the same lead times as generated to estimate  4 () and  5 () to obtain lead-time corrected values of age and date at diagnosis and to remove overdetected cases.
The estimated quantities (the weight and the two survival functions) are finally joined to get estimates of  6 () (Equation 19).The whole procedure was repeated 10 times and the results were averaged to ensure stable results.

Results
The results are presented in Figure 10a,b.As we can see in Figure 10b, () is positive, which suggests that screendetected patients that received early treatment have higher survival probability compared to their counterparts who received delayed treatment (their time to death measured from (hypothetical) symptomatic diagnosis).After 5 years, the difference in survival probability between the two groups stabilizes at approximately 5 percentage points ( = 1.06).The difference is smaller compared to the naive estimates obtained by comparing the survival probability between screendetected and interval cases.The magnitude of the bias associated with contrast () is especially large after 5 years.By decomposing the bias, we can see that the three biases are of different importance and are not constant over time.In our case, lead time is the most important source of bias during the first few years after cancer diagnosis.Length time bias increases over time and outweighs the other two.Our results therefore suggest that length time bias is substantial and should not be neglected.

DISCUSSION
The first aim of our study was to develop a comprehensive notation that would allow us to formalize different questions that may emerge when evaluating CSPs.The proposed notation is based on the notation of potential outcomes F I G U R E 1 0 Empirical results.(a) Six survival functions required to estimate contrasts and biases defined in this study (see Table 3).Note that  2 () =  3 () under assumptions AS3 and AS4.(b) The black and red step curves represent the contrasts () and (), respectively.Shaded regions depict the decomposition of biases, with their areas corresponding to the magnitude of each bias.(Goetghebeur et al., 2020;Hernan & Robins, 2020).This framework had previously been used to define the effect of early treatment on mortality (Baker et al., 2004;Saha, 2022;Saha et al., 2021).Here, we focused on the survival time.Two insights were crucial for developing the notation.The first thing to realize was that by considering both real and counterfactual worlds, one can divide subjects into four hypothetical subgroups which allowed us to define appropriate control group.
The second insight was that since the survival time is defined as the time elapsed between age at diagnosis and age at death both events should be treated as potential outcomes.The advantage of our notation is that many different contrasts can be defined using the same building blocks by merging the survival experiences of different hypothetical subgroups with different diagnosis times.Note, however, that it is essential to exercise caution and only use the information of hypothetical subgroups as known at the time of diagnosis.When evaluating CSP, different comparisons might be of interest.We focused on the comparison of survival probability of screen-detected cases as observed in real world to the survival probability of their counterparts not included to the program.We showed exactly how this contrast is different from the naive comparison between screen-detected and interval cases, and formally defined the three biases associated with the naive comparison (referred to as lead time bias, length time bias, and bias due to overdetection).In addition, we presented a proof that the total bias that arises with the naive comparison is equal to the sum of the three biases.This property is not self-evident and has been so far concealed through the use of natural language.The practical consequence of this finding is that each bias can be studied separately, for example, simulation studies can be used to uncover how different biological mechanisms and properties of the CSP affect each bias.Note, however, that the definition of each bias may change depending upon the contrast of interest and the naive estimator we compare it to.
Regarding estimation, we realized that based on existing estimators it is possible to estimate only one part of the contrast of interest-the survival probability of the exposure group (screen-detected cancers).In this study, formal notation was used to define the appropriate control group and to identify the building blocks needed to estimate their survival.We have shown how the estimation can be carried out using both parametric or nonparametric approaches (see Section 4.2), depending on the available data.A simulation study was conducted to demonstrate that the nonparametric estimator is unbiased.Whether the parametric version is also unbiased relies heavily on the correct specification of the model used to predict lead time.
Our approach requires data from subjects invited to CSP as well as data from subjects not invited to CSP.Such data can be obtained for example from a randomized experiment, from observational studies with staggered introduction of the CSP, or from observational studies where the comparison is made between prescreening and postscreening period.If data are collected from the observational study, it is important to consider whether subjects invited to the program are comparable to those not invited to the program.This might not be the case if subjects in one group have better access to health-care services (perhaps due to higher socioeconomic status) or if the two groups are observed in a completely different time period since treatment and care for cancer patients might improve over time.Comparable results can be achieved by collecting data on confounders which can be included in a model to balance the two groups as illustrated in the example.
Empirical data from the Slovenian breast CSP were used to illustrate our approach.The results show that when time to death is measured from symptomatic diagnosis screen-detected patients with early treatment have higher estimated survival probabilities compared to their counterparts who receive delayed treatment.The estimated difference in 5-year survival was approximately 5 percentage points.Based on our (illustrative) results, we could therefore say that early treatment provided through screening has small but positive effect on survival.Note that these estimates were importantly lower compared to the one obtained simply by comparing the survival probability between screen-detected and interval cases.A side effect of our approach is also the estimation of the biases.Our results suggest that the magnitude of length time bias is substantial and should not be neglected as has been sometimes implied in previous literature (Adami et al., 2017;Kalager et al., 2012).
To simplify the notation and to allow for estimation, several assumptions were made in this paper.The main assumption was that CSP may change the course of life only for screen-detected cases.The validity of this assumption may be questioned in some CSPs.The important advantage of our approach is that by clearly stating the assumptions in our work we enable the discussion on this topic.
Further work should be done to estimate the variance of the proposed method.The simplest approach would be to use bootstrap.Also, since lead time has so far been estimated using only parametric models an interesting avenue for future studies would be to estimate it nonparametrically, for example, by comparing the distribution of age at diagnosis between those invited to the program and those not invited to the program.By comparing the estimates obtained using parametric and nonparametric approach, one could further increase the validity of the obtained results.

CONCLUSION
In this work, we focused on estimating the effectiveness of early treatment provided by CSP using survival analysis.We developed a comprehensive notation that can be used to define various hypothetical quantities that emerge in this field.In our case, the notation was used to define the contrast of interest and the three biases that arise when comparing the survival probability between screen-detected and interval cases.With respect to the estimation, we developed a new estimator that allowed us to estimate the survival of the control group, that is, the survival of cancer cases that would be screen-detected among those not included to the program.The proposed estimator was evaluated using simulations.Finally, by joining the proposed estimator with existing methods we showed that based on certain assumptions the contrast of interest can be estimated without neglecting any of the biases and that each bias can be estimated separately.Our approach was illustrated using simulations and empirical data from Slovenian breast CSP.cancer, die from other causes or are censored.In this appendix, we relax both assumptions and adapt the notation as well as the proposed estimator accordingly to accommodate for more realistic scenarios.
As before, we will assume that cancer screening starts at age  and is done at roughly equal intervals of length .However, to better reflect reality, we now assume that subjects are only invited to screening tests until age  and are allowed to deviate from the screening protocol, potentially missing some tests or not participating in the program at all.To describe the number of screening tests for each subject, we use , and to denote their age at each screening test, we use   , where  ∈ 0, 1, … , .Note that  1 ≥  and   ≤ .We set  0 equal to  −  to simplify the notation.The number of screening tests and age at each test can vary among subjects.
Subjects invited to cancer screening program (CSP) ( = 1) can now be divided into four groups based on their mode of cancer detection: • No cancer cases (  = 0) • Screen-detected cancers: cancer cases detected on screening test (  = 1,  =   ) • Interval cases: cancer cases detected in the interval of length  following a previously negative screening test (  = 1,   <  <   + ) • Undeterminable cases: cancer cases for which time since last (negative) screening test is larger than , and cancer cases who do not attend screening tests at all (  = 1,  ≥   + ) A new group of undeterminable cases has thus been created.Similarly to interval cases, these cases are also detected based on symptoms and are (presumably) not affected by the screening program.As a result, the same assumptions that apply to interval cases, AS3 and AS4, now extend to undeterminable cases as well.These assumptions are revised as follows: (AS3)  =1 =  =0 if  =1  ∈ {interval, undeterminable} …if subject invited to the program is diagnosed with cancer based on symptoms, his age at diagnosis is identical to what it would have been, had he not been invited to the program (AS4)  =1  =  =0  if  =1  ∈ {interval, undeterminable} …if subject invited to the program is diagnosed with cancer based on symptoms, his age at death is the same as if he had not been invited to the program Note, however, that even though both groups comprise of cancer cases detected based on symptoms and share these two assumptions there is no reason to assume that a group of undeterminable cancers is in any way equal or similar to a group of interval cancers (e.g., with respect to their survival probability).
For the sake of completeness, we have also updated the definition of lead time and reexamined the relationship between hypothetical and observed detection mode (see Table A1):

(
A) With respect to death we use the following notation: •   …age at death from cancer •   …age at death from other causes •   …age at censoring which might preclude death •  = min(  ,   ) …age at death •   = ( <   ) …censoring status (for death) (B) With respect to cancer diagnosis we use the following notation: •  …age at cancer diagnosis •   = min(  ,   ) …age at censoring which might preclude cancer diagnosis •   = ( <   ) …censoring status (for cancer diagnosis) CSP does not affect the time of being lost from follow-up (AS3)  =1 =  =0

F
I G U R E 4 Marginal distributions of tumor size at cancer diagnosis and tumor growth rate in world 0 and world 1 for groups  ℎ ∈ {1},  ℎ ∈ {2}, and  ℎ ∈ {2}.The distribution of tumor diameter at the time of cancer diagnosis is not shown for  ℎ ∈ {2} cases in world 0, as the diagnosis of cancer in these cases is not recorded in world 0.

F
Simulation results.(a) Six survival functions required to estimate contrasts and biases defined in this study (see Table3 ).Note that  2 () =  3 () under assumptions AS3 and AS4.(b) The black and red step curves represent the contrasts () and (), respectively.Shaded regions depict the decomposition of biases, with their areas corresponding to the magnitude of each bias.

F
I G U R E 9 Overview of empirical data for individual cases by event for groups  = 0 and  = 1.Each point represents a case with the -axis showing the date of event, and the -axis indicating the date of birth.

TA B L E A 1
Relationship between observed and hypothetical detection modes.Cases with  ℎ ∈ {2} are referred to as overdetected cases.

TA B L E 1
Relationship between observed and hypothetical detection modes.Cases with  ℎ ∈ {2} are referred to as overdetected cases.List of random variables (the random variables are split according to those that are equal in both worlds (left column) and those that are not (right column)). Note.
TA B L E 4