Causal inference in survival analysis under deterministic missingness of confounders in register data

Long‐term register data offer unique opportunities to explore causal effects of treatments on time‐to‐event outcomes, in well‐characterized populations with minimum loss of follow‐up. However, the structure of the data may pose methodological challenges. Motivated by the Swedish Renal Registry and estimation of survival differences for renal replacement therapies, we focus on the particular case when an important confounder is not recorded in the early period of the register, so that the entry date to the register deterministically predicts confounder missingness. In addition, an evolving composition of the treatment arms populations, and suspected improved survival outcomes in later periods lead to informative administrative censoring, unless the entry date is appropriately accounted for. We investigate different consequences of these issues on causal effect estimation following multiple imputation of the missing covariate data. We analyse the performance of different combinations of imputation models and estimation methods for the population average survival. We further evaluate the sensitivity of our results to the nature of censoring and misspecification of fitted models. We find that an imputation model including the cumulative baseline hazard, event indicator, covariates and interactions between the cumulative baseline hazard and covariates, followed by regression standardization, leads to the best estimation results overall, in simulations. Standardization has two advantages over inverse probability of treatment weighting here: it can directly account for the informative censoring by including the entry date as a covariate in the outcome model, and allows for straightforward variance computation using readily available software.


INTRODUCTION
A substantial amount of the causal inference literature is devoted to methods for valid causal effect estimation from observational data in the presence of confounding.However, when confounders are subject to missingness, classical methods for confounding adjustment, such as inverse probability of treatment weighting (IPTW) or standardization, are not readily applicable, and need to be adapted, by borrowing from the toolbox for handling missing data.In this article, we focus on estimating marginal causal survival probabilities under deterministic missingness of confounders in register data.By deterministic missingness, we mean that there exists a measured variable whose value deterministically controls whether a confounder is missing or not.Because of the deterministic nature of the missingness mechanism of the confounder, methods that would proceed by weighting the complete cases by the inverse probability of missingness are not applicable, since the positivity assumption is violated.This a priori rules out an entire class of estimators.An additional complication is given by the nature of the administrative censoring.This is a problem that arises naturally in high-quality register data, where the censoring time of an individual who did not experience an event by the study end date is given by the time interval between their date of entry to the register and the study end date.As a consequence of this, methods that account for censoring via weighted regression are also ruled out.
2][3] The best performance seems to be obtained when propensity scores (PS) and the corresponding marginal effects of interest are estimated in each imputed data set, and then averaged over all imputed data sets.Analogous explorations for time-to-event outcomes have received considerably less attention in the literature.
Our motivating data set consists of patients registered in the Swedish Renal Registry 4 who have been diagnosed with end-stage renal disease, and have undergone renal replacement therapy (dialysis or kidney transplantation).The registry was established in 1991 and contains information on demographic variables (age, sex, region), comorbidities at the date of entry to the register, cause of kidney disease, treatment assigned, and survival status.Our interest lies in estimating the effect of the type of renal replacement therapy assigned on survival.However, information on comorbidities only started to be systematically registered in the year 1998.Comorbidities are likely confounders of the effect of treatment on survival, as they clearly have an effect on the outcome, but also have an effect on the treatment assignment (patients with certain comorbidities may be less likely to be offered a transplant).Moreover, medical care has improved over time, and protocols for treatment assignment have themselves been modified, such that patients assigned a kidney transplantation at earlier dates tended to be younger and generally healthier, with fewer comorbidities than those offered a transplant at later dates.This places us between a rock and a hard place, since ignoring data from the first 7 years of the register will eliminate patients with the longest follow-up, representing 26% of the individuals in the data set and 36% of the total number of deaths registered, while choosing not to adjust for comorbidities leaves marginal causal effect estimates vulnerable to confounding bias.
Our aim is to investigate the different consequences of these issues on causal effect estimation following multiple imputation of the missing covariate data.We analyse the performance of different combinations of the covariate imputation models and outcome estimation methods.The covariate imputation models condition on different representations of the survival outcome, together with the baseline covariates, and interactions between these.The estimation methods are based on IPTW and standardization.We further evaluate the sensitivity of our results to the nature of the censoring and misspecification of propensity and outcome models.The simulation scenarios we construct are designed to separate different dimensions of the complex data structure and reveal the possible extent of bias induced by each of these.
The article is organized as follows.In Section 2, we describe our motivating data set and make explicit the issues that may affect valid causal effect estimation.Section 3 describes our estimand of interest, together with the assumptions under which we can expect valid estimation results.Section 4 describes the imputation models and estimation methods considered.In Sections 5 and 6, we compare the estimation results obtained for simulated data, and our motivating data set, respectively.We end with a discussion in Section 7.

MOTIVATING DATA SET
Patients enter the Swedish Renal Registry when diagnosed with renal disease at chronic kidney disease stage 4. When renal replacement therapy (RRT) becomes necessary, based on the current health status of the patients, availability of a transplant organ and guidelines in effect at the time of diagnosis, one of two RRTs is assigned: kidney transplant or dialysis.A majority of patients start with dialysis.Of these patients, many receive a transplant at a later time, but we will focus our attention on the treatment assigned initially.Receiving a kidney transplant without previous dialysis is referred to as a pre-emptive kidney transplantation (PKT). 5We are interested in estimating the marginal causal effect on survival of receiving a PKT vs starting dialysis, and adopt an intention-to-treat analysis.
The data consist of all patients who entered the register between 1991 and 2017.Follow-up corresponds to the period between entering the register and 2017 or death, whichever occurs first.Due to the high quality of the register, censoring is almost exclusively administrative, with the rare exceptions being due to emigration.Apart from survival information, the register only contains information on the patients at the time of entry to the register, such as age, gender, region, comorbidities and cause of kidney disease.Comorbidities only started being registered in 1998, the eighth year of the

Percentage patients with hypertension
Entry Year  register's existence, and include diabetes, hypertension, ischemic heart disease, cerebrovascular disease and peripheral artery disease.All of these variables are potential confounders of the relationship between exposure and outcome.In particular, age and comorbidities are both important prognostic factors, as well as determinants of treatment assignment.Any missingness of comorbidities and censoring outside the deterministic framework outlined here is negligible.
For the sake of clarity, we will ignore it in the further developments.For more details on this data set, we refer to Olarte Parra et al. 6 Due to changing transplant eligibility criteria over time, the subpopulation consisting of patients who received a PKT exhibits slight changes over time.In particular, PKT patients who entered the register at earlier times tended to be younger and healthier, with fewer comorbidities (see Figure 1).This suggests an interaction effect on the treatment assignment between date of entry to the register and comorbidities and between date of entry and age.Similarly, due to improvement in care in recent years, survival times are expected to be relatively higher for patients who entered the register at later times, mostly in the transplant subpopulation.This suggests an interaction effect between date of entry and treatment on the survival outcome.
Another problematic issue is the administrative censoring.As a consequence of improving medical care, true survival times may increase over calendar time of entry, but are not fully observed for later-period patients.In particular, for patients who enter the register in 2012, for example, no event past 5 years can be recorded when the follow-up period ends in 2017.Our choice of estimand determines how sensitive results are to what happens in the unobserved domain, beyond the censoring time.The 5-year survival chance in the cohort entering prior to 2012 is immune to our administrative censoring.For other estimands, involving survival at later times or in the more recent cohort, we must rely on additional untestable assumptions.
We thus have an extrapolation need, both in the time (outcome) domain, as well as in the covariate (confounder) domain.During the first study period we miss comorbidity information, and during the later study period, we miss survival information.Removing data on either end of the calendar time is a suboptimal solution, since dropping subjects from the earlier period will remove subjects with the longest follow-up time available, while dropping subjects from the later period will remove valuable recent information on the influence of comorbidities on survival, which is arguably more relevant for forming subsequent policy recommendations.

THEORETICAL FRAMEWORK
In this section, we fix notation, describe the estimand of interest, and discuss assumptions made for the purpose of estimation.
Let T represent time to death, and A be a binary variable representing treatment assigned at time t = 0. Further, let T(a) represent the potential survival time under treatment A = a, for a ∈ {0, 1}.Due to right-censoring, the variable T is not observed for all subjects.Instead, if we let C denote the censoring time, we observe T = min(T, C), together with an event indicator D = I(T ≤ C).Write X ⊤ = (X j ) K j=0 for a given vector of baseline covariates.This includes confounders of the association between treatment and survival outcome, as well as between censoring and survival.Without loss of generality, let X 0 represent the year of entry into the study, which is a variable observed for all subjects.Suppose further that X 0 deterministically controls the missingness pattern of other variables.We will focus on the particular situation where there exists  such that for all X 0 ≤ , covariates X j are missing, for j ∈ , for some  ⊂ {1, … , K}.Any other covariates are assumed to be completely observed for all subjects.Write , where X ⊤ 1 is completely observed and X ⊤ 2 is missing for X 0 ≤ .The postulated causal diagram illustrating the relationships between the variables is given in Figure 2. In what follows, the term confounder refers to the relationship between treatment and survival.
The goal is to estimate the causal survival difference  ∶= p{T(1) > t 0 } − p{T(0) > t 0 }, for a given time point t 0 .In order to be able to estimate this quantity, we make the following assumptions: Assumption 2 (Positivity).For all observable X, we have p(A = a | X) > 0, for a ∈ {0, 1}.

Assumption 3 (Conditional exchangeability). T(a) ⫫
F I G U R E 2 Postulated causal diagram representing the relationships between the variables: X 0 , entry year; X, baseline variables; A, treatment; T, survival time; C, censoring time; T, observed survival time.The relationships between the variables contained in X ⧵ X 0 are left unspecified.Note that T = min(T, C), which is a deterministic relationship where S(t 0 ) = p(T > t 0 ) is the survival function.To further ensure consistent estimation of S(t 0 | A, X), both missingness of X 2 and censoring must be taken into account and handled appropriately.We make the following assumptions: Assumption 4 (Extrapolation assumption for missing covariates).
for some function g c , parametrized by a vector of coefficients , and  allows the functional form to be transported from the domain for which X 0 >  to the domain for which X 0 ≤ .
Assumption 5 (Non-informative censoring given covariates).C ⫫ T | X C , where X C is a vector of covariates contained in (X, A).
When censoring is exclusively administrative, as it most often is in high-quality registers, X C must include X 0 , unless we believe that the impact of entry time on the hazard is fully captured by the other baseline covariates.If the hazard depends on the entry date in ways that may differ with internal time t, then Assumption 5 is not enough to "know" the hazard, h(t | X, A), for t > , where  is the maximum follow-up time available until the end of the study.Indeed, there are no data on survival past .We must accept to extend the formal expression for this hazard to times beyond  where we have no data.We thus make the following assumption: Assumption 6 (Extrapolation assumption for hazard).h(t | X, A) = g h (t, X, A; ), for some function g h , parametrized by a vector of coefficients , and  allows the functional form to be transported from the domain for which t ≤  to the domain for which t > .
Assumptions 4 and 6 respond to two different types of positivity violations induced when the entry date influences the missingness of covariates and the censoring in a deterministic way conditional on X 0 .Assumption 4 ensures that the parameters governing the functional form describing the dependence between (X 2 , T) and X 0 , X 1 and A, can be estimated from the observed domain, X 0 > .If this assumption is violated, the dependence between (T, X 2 ) and X 0 , X 1 and A, may differ radically between the observed and the unobserved domains, and we would have no way of knowing or appropriately accounting for this in the analysis.For example, if p(X 2 | X 0 ) increases linearly for X 0 ≤  and then stabilizes at a constant value for X 0 > , no amount of data observed for X 0 >  will allow us to consistently estimate the linearly increasing trend on the domain X 0 ≤ , since that is entirely unobserved, with no remnants of it present in the observed domain.Assumption 6 ensures that, for those subjects who enter the study close to the study end-date, it is still possible to consistently estimate survival probabilities at times beyond their largest possible follow-up time.For example, if this assumption is violated, we would be unable to estimate a 3-year survival probability for subjects who enter the study two years before the study end-date.Illustrative examples of violations of Assumptions 4 and 6 are shown in Figure 3.

ESTIMATION METHODS
In the case of data with no missing values and non-informative censoring, we may fit a standard Cox proportional hazards model h(t | A, X) = h 0 (t) exp(f (A, X; )), where h 0 (t) is the baseline hazard and f (A, X; ) is a function of A and X, for example, a linear combination of A and X, together with further interactions between A and X, parametrized by a vector of coefficients, .We would then estimate conditional survival as Ŝ(t | A, X; ) = exp{− Ĥ0 (t) exp(f (A, X; β)}, where Ĥ0 (t) is an estimator for the cumulative baseline hazard, for example, Breslow's estimator.We may then estimate the causal survival difference at time t 0 using Equation (1).Alternatively, we may fit a propensity score model (X; ) for p(A = 1 | X),  The probability of X 2 conditional on X 0 increases linearly until X 0 = , after which it becomes constant.However, we do not observe any data before X 0 =  and are unable to estimate the linearly increasing trend.(B) The hazard is constant until the study end time (ie, until t = ), after which it increases linearly with time.However, we do not observe any data beyond t =  and are unable to estimate the linearly increasing trend use the inverse of the predicted probabilities to produce weighted observations and derive Kaplan-Meier curves, or fit a weighted Cox regression model.Other estimation methods are available, in principle, but may not be applicable due to the nature of the administrative censoring.Some of these are discussed in Section 4.4.
Since X is not fully observed in our case, the way the missingness is handled will impact the estimation of the causal effect and its variance.Because of the deterministic missingness and the corresponding positivity violation, we cannot apply inverse probability of missingness methods to handle the missing data in the confounders.We thus turn to multiple imputation for this step.The estimation results obtained within each imputed data set are combined using Rubin's rules. 7

Multiple imputation
When imputing missing covariate data, it is common to include the outcome in the imputation model, alongside the other covariates.This has generally been found to produce better results than when the outcome is omitted. 8A time-to-event variable is often represented in the imputation model by the event indicator, D, together with the observed follow-up time, T, or the log thereof.When assuming an underlying proportional hazards model for survival, replacing the latter with the cumulative baseline hazard H 0 (T), which can be approximated by the Nelson-Aalen estimator of H(T), has been found to produce better results, with smaller bias and higher power. 9Including interaction terms between the cumulative baseline hazard estimator and the other covariates may improve results further. 9

Standardization
As before, let t 0 be a fixed time point, at which the causal survival difference is to be computed.In each imputed data set, we fit a Cox proportional hazards model, h(t | A, X) = h 0 (t) exp(f (A, X; )), compute the standardized survival curve using Breslow's estimator for the cumulative baseline hazard, and evaluate the survival difference at t 0 as The variance of θstd can be obtained using M-estimation. 10Let M be the number of imputations performed.For m = 1, … , M, we use Equation ( 2) to obtain an estimate θstd m in each imputed data set.The final estimate for  is then given by ( The variance of θstd MI is computed using Rubin's rules. 7In particular, var We distinguish between two variance, "components" in this expression: 2 the "within"-variance given by ∕M, where var ) can be computed using M-estimation, 10 for m = 1, … , M, and the "between"-variance, given by . Confidence intervals can then be constructed using the approximation 11,12 θstd MI −  var where t d denotes the t-distribution with d degrees of freedom, and

IPTW
2][3] Specifically, two main approaches have been considered and compared: the "within"-approach, where propensity scores and effect estimates are computed separately in each imputed data set, and are then combined according to Rubin's rules, and the "across"-approach, where the propensity scores obtained in each imputed data set are averaged, and a weighted regression is fitted to one of the data sets using this average propensity score.Granger et al 3 compared these two approaches in simulation for a number of scenarios, with both continuous and binary outcomes, different confounding structures and varying proportions of missing data.Their overall results indicate that the "within"-approach is preferable to the "across"-approach in terms of bias, but SEs produced by both methods are unreliable.Leyrat et al 2 also studied these two approaches in simulation with a binary outcome, with similar conclusions.Mitra et al recommend the "across"-approach, but the setup of the simulation leading to this result has been criticized. 13The few theoretical results available also support the use of the "within"-approach in terms of consistency and balancing properties. 2,14n survival settings with complete covariate data, where interest is in obtaining the effect of treatment on a survival outcome, weighing the sample by the inverse probability of treatment and then fitting a weighted Cox regression model with treatment as the only covariate, has been found to produce the least biased marginal hazard ratio estimates among several different propensity score methods compared in simulation. 15When it comes to variance estimation, simulation results indicate that a bootstrap estimator has superior performance to using naïve or robust sandwich-type variance estimators. 16For estimating absolute treatment effects on survival, the best performance seems to be obtained by weighing the sample using the propensity scores, fitting two separate null Cox proportional hazards models in the treated and untreated populations and estimating the corresponding survival probabilities separately. 17,18To our knowledge, in the case of survival outcomes and missing covariate data, no guidance is available in the literature on combining multiple imputation with propensity score weighting.
In our analysis, within each imputed data set, we fit a model (X; ) for p(A = 1 | X), for example, a logistic regression model.We then use the predicted probabilities to compute weighted Kaplan-Meier curves for A = 1 and A = 0, which we evaluate at time t 0 .We denote the point estimate obtained in the mth imputed data set by θPS m .The variance estimate is obtained by bootstrapping within imputed data sets, and combining the resulting M variance estimates obtained using Rubin's rules.More precisely, for each imputed data set, we produce B bootstrap samples, which are then used to compute SEs.If we denote the point estimate obtained in the bth bootstrap sample drawn from the mth imputed data set by θPS m,b , then the variance of θPS m , for m = 1, … , M, is estimated as The resulting M point estimates and variance estimates are then combined using Rubin's rules, analogously to Equations ( 3) and ( 4), to produce a point estimate θPS MI , and its variance estimate, var . This approach is chosen over the more computationally intensive version, where the whole estimation procedure is bootstrapped, that is, first B bootstrap samples are drawn, and then each of these is imputed M times.The two procedures have been shown to have similar performance, 11 and we have chosen the former for convenience.Confidence intervals are constructed using the t-statistic described in Equations ( 5) and ( 6).

Other estimation methods
Variations of the estimation procedures described above are possible.However, some potentially desirable adjustments to these procedures are complicated by the nature of the administrative censoring.Under random censoring, these adjustments would not be problematic.Inverse probability of treatment weighting ensures that both treatment groups have a similar distribution of baseline covariates, but it remains true that certain values of these tend to come with shorter censoring times as well as better survival times.Thus, rather than fitting one propensity score model to the whole data set and computing one single weighted Kaplan-Meier curve for each of the two treatment arms, one may instead stratify by entry year, and average the results.Within strata, the maximum follow-up times do not differ that much, hence the impact of the possibly informative censoring is limited.However, because of the administrative censoring, a very fine stratification, for example, a 1-year stratification, is inefficient, since there are not enough events in the later period of the data, especially in the PKT arm.Moreover, the value of t 0 in the estimand of interest cannot exceed the available follow-up time for the designated strata, since the Kaplan-Meier curves for the latest time interval stratum will not reliably cover a follow-up time larger than the length of that interval.
Rather than computing Kaplan-Meier curves, one may instead fit a weighted Cox proportional hazards model.This may include only A as a covariate, but in order to account for the informative censoring, may include both A and the entry year as covariates.Alternatively, one may include all covariates and relevant interaction terms in the Cox model.This is often done in applied literature, 19 but we are not aware of any theoretical result according to which this is expected to result in a better performance in terms of bias and precision.Again, one might perform this analysis by stratifying, using an appropriate entry year interval.We remark that this route requires fitting two models-the propensity score model and the Cox model-rather than the propensity score model alone, so it involves more modeling assumptions, and does not bypass the need for standardization.Variance calculation can be performed by bootstrapping, and is expected to be computationally intensive.
Another option for estimation is to compute pseudo-observations based on the Kaplan-Meier estimator at time point t 0 . 20,21These are quantities of the form  i = nφ − (n − 1)φ −i , where φ is the estimate obtained by using the whole data set, and φ−i is the estimate obtained by removing the ith subject.We may then use these as outcomes in a generalized linear model with for example, the cloglog link, and then perform standardization.However, when censoring is not independent, which is one of the requirements for the pseudo-observations to be unbiased, one should account for this either by weighting 22 or stratification. 23Here, one could, for example, stratify by entry date.However, this is problematic for small stratification time-intervals (eg, 1-year intervals), since in the later strata there will potentially be too few events contributing to the computation of the pseudo-observations and no subjects with follow-up time as long as the time-point of interest, t 0 , when t 0 exceeds the stratification interval.Larger time-intervals for stratification could be used, but are undesirable, since they may not address the dependent censoring issue effectively.

SIMULATION
We set up the simulation data to resemble our motivating data as closely as possible in their characteristics of interest.
Patients are enrolled for a period of 27 years, corresponding to the first 27 years of existence of the register, and are followed up until death or administrative censoring, whichever occurs first.To illustrate ideas more clearly, we only include two confounders, one that is continuous and measured for all subjects (which can be taken to represent age, for example), and one that is binary, and is subject to missingness for the earlier part of the register (which can be taken to represent a comorbidity).The sample size is set at 20 000, which is similar to the original data set.In Section 5.1, we describe the scenarios under which we generate data.These scenarios allow us to investigate the consequences of the deterministic missingness, the informative administrative censoring and the changing composition of the treatment arms on the performance of the different imputation and estimation methods described in Section 4. In Section 5.2, we provide details on how the methods described in Section 4 are applied on the simulated data, and specify the variables included in each of the models we fit.In Section 5.3, we report on the comparative performance of the different analysis methods.

Data generation
Let A = treatment (binary: 0 is dialysis, 1 is PKT) X 0 = year of entry into the register (continuous) The variable X 2 is not recorded for patients registered in the period X 0 ≤ .We set  = 7 and generate data according to two scenarios, by allowing p(T | A, X 0 , X 1 , X 2 ) to vary.In both scenarios, we generate X 0 ∼ Unif(1, 28) and X 1 ∼ 15 + 60 ⋅ Beta(3, 1).Further, we let p(X 2 | X 1 ) = expit(−6.5+ 0.1X 1 ), where expit(⋅) is the inverse-logit function.This results in approximately 740 patients enrolled annually, with a mean age of 60 years.The treatment variable, A, is generated by setting With these parameters, we have p(A = 1) = 0.17, p(X 2 = 1) = 0.42 and p(X 0 ≤ ) = 0.23.The survival times of Scenario I are generated from a Weibull distribution with hazard: where h (1) 0 (t) = 2.5t 1.5 ∕300 is the baseline hazard.This is the baseline scenario, where the hazard does not depend on the date of entry to the register, and there are no interactions between treatment, comorbidity or age and date of entry on survival times.For Scenario II, the hazard function is given by where h (2) 0 (t) = 2.5t 1.5 ∕120 is the baseline hazard.Note that the shape parameter of the Weibull distribution corresponding to the baseline hazard is fixed at 2.5 in both scenarios, but the scale parameter takes values 1∕300 and 1∕120, respectively.This is done to keep survival times realistic.The survival times were generated following Bender et al, 24 by first generating a continuous uniform random variable, U ∼ Unif(0, 1), and the survival time as T = [− log(U)∕{ exp(L)}] 1∕ , where  and  are the scale and shape parameters of the Weibull distribution corresponding to the baseline hazard, respectively, and L denotes the linear term in the hazard.Censoring is administrative, and is set at 27 years, paralleling the motivating data set.
For this simulation, we generate 500 populations of size 20 000.Some of the trends in the simulated data are depicted in Figures 4 and 5, for one simulated population.

Analyses
We perform a series of analyses on the simulated data, combining different imputation and estimation methods to obtain the causal survival difference between the two treatments at times t 0 = 5 and 10 years.We start by analysing the full data set, before missingness in the comorbidity variable, X 2 , is induced.We also include an analysis where the missing comorbidity is ignored, and thus X 2 is omitted in all models fitted, and a complete-case analysis, where patients with missing comorbidities are excluded from the data set.To account for the missing data, we investigate the use of four imputation models, which we refer to as mice1, mice2, mice3, and mice4, respectively, and which are detailed in Table 1.The imputation models mice1 and mice3 use the logarithm of the observed time-to-event and the event indicator to represent the survival outcome; mice1 includes all other available covariates, and mice3 additionally includes interaction terms between the logarithm of the time to event and the covariates.The imputation models mice2 and mice4 use the Nelson-Aalen estimator of the cumulative hazard and the event indicator to represent the survival outcome; mice2 includes all other available covariates, and mice4 additionally includes interaction terms between the Nelson-Aalen estimator and the covariates.
The estimation methods included in the comparison were selected following general recommendations available in the literature, as described in Section 4. The Cox proportional hazards model used for the standardization procedure includes treatment, age, comorbidity, year of entry, and interaction terms between year of entry and comorbidity, and between year of entry and treatment.The propensity score model used for the inverse probability of treatment weighting procedure includes age, comorbidity, year of entry, and interaction terms between year of entry and age, and between year of entry and comorbidity.Both of these models are "correct," in the sense of corresponding to the true data generating mechanisms.These models are fitted to the data obtained by using mice1, mice2, mice3, and mice4 as imputation models.We evaluate the sensitivity of these estimation methods to misspecification of the fitted models, by omitting the interaction terms.For this, we only use the data imputed using the mice4 model.The models used are detailed in Table 1.We also include variations of these procedures in our comparison, as described in Section 4.4.Instead of fitting one propensity score model to the whole data set, we stratify based on entry year, using a 5-year interval.In particular, the subjects who enter the register in the last 5 years form one stratum, those who enter the register in the previous 5 years form another, and so on.The last interval covers the first 7 years of the register (since the total follow-up time is 27 years).In each stratum we fit the propensity score model, construct weighted Kaplan-Meier curves for the two treatment arms, and evaluate the survival difference at t 0 = 5.We then average the results over the strata.Note that, for administrative censoring, this does not allow us to compute the 10-year survival difference.We also include three versions of weighted Cox models (where the inverse probability of treatment weights are obtained by fitting one propensity score model to the whole data set).Each of these Cox models includes different sets of covariates: the first includes treatment as the only covariate, the second includes treatment and entry year as covariates, and the third uses the full model, which includes treatment, age, comorbidity, year of entry, and interaction terms between year of entry and comorbidity, and between year of entry and treatment.We only use data imputed using

Model mice1
logistic model for X 2 against log( T), D, A, X 0 , X 1 mice2 logistic model for X 2 against Ĥ0 (T), D, A, X 0 , X 1 mice3 logistic model for X 2 against log( T), D, A, X 0 , X 1 and interactions log( T) × {A, X 0 , X 1 } mice4 logistic model for X 2 against Ĥ0 (T), D, A, X 0 , X 1 and interactions mice4 for these additional analyses.For an overview of all analyses performed, see Tables 1 and 2 of the Supplementary Material.
To gauge the sensitivity of our methods to the administrative censoring, we performed the analyses for Scenario II twice: once by setting the administrative censoring time at 27 years after the register was established (as in the motivating data set), and then again, by allowing for random independent censoring, following an exponential distribution with rate 1∕15.Administrative censoring produces approximately 25% and 26% censored individuals in Scenarios I and II, respectively.Random censoring produces approximately 31% censored individuals in Scenario II.Our motivating data set contains 38% censored individuals.
For random censoring, in addition to the analyses described above, we also perform stratified IPTW using a 1-year interval, and use pseudo-observations as outcome in a generalized linear model with the cloglog link function.Again, we only use data imputed using the mice4 model here.
The true values of the marginal survival differences at times t 0 = 5 and t 0 = 10 are computed by generating 1000 data sets of size 10 6 , with both potential outcomes, no censoring and no missing variables, and averaging the difference in survival probabilities obtained.We evaluate the methods in our comparison by computing the bias, the mean SE and the empirical SE of the survival difference estimates over the 500 populations.The number of imputations is set at 10 and the number of bootstrap samples used to compute the variance for IPTW methods is set at 200.We conducted our analyses in R, using the following packages: mice 25 for multiple imputation, survival 26 for survival analysis, boot 27 for bootstrapping, stdReg 10 for standardization and eventglm 28 for the pseudo-observations analysis.The running time of a single instance of the analyses on a single instance of imputed data involving standardization takes as little as 5 to 6 s when run on a computer with a 2.8 GHz Quad-Core Intel Core i7 processor.However, inverse probability of treatment methods are considerably more time-consuming, the most computationally intense step being the bootstrap procedure for the variance computation.This may take up to 3 to 4 min if computations are run on a single core.The total running time of an analysis depends on the sample size, the number of imputations and, where applicable, the number of bootstrap samples used for variance computation.

Results
The results of the analyses performed are summarized in Figure 6 and Tables 3 and 4 of the Supplementary Material.For survival probabilities per treatment arm and SE estimates, see Sections 2 and 3 of the Supplementary Material.
A series of trends are apparent across scenarios.First, performing the analysis using the full data, before the missingness is induced, results in the most accurate survival difference estimation when standardization is used.Second, using the cumulative baseline hazard and the event indicator to represent the survival outcome in the imputation model results in similar or better estimation results across scenarios and estimation methods than using the log of the survival time and the event indicator.Including interaction terms between the cumulative baseline hazard and the other covariates

F I G U R E 6
Difference between the estimated and the true causal survival difference for time points t 0 = 5 and t 0 = 10 over 500 simulated data sets across methods and scenarios.compl, complete-data analysis; full, full data analysis; ignore, analysis ignoring missing covariate in fitted models; mice1, mice2, mice3, mice4, imputation models used; mice4-mis, analysis with misspecified outcome or propensity score model; PS, propensity score weighting; PSCox, weighted Cox with treatment as the only covariate; PSCox-entry, weighted Cox with treatment and entry date as covariates; PO, pseudo-observations analysis; PSCox-full, weighted Cox with full model; PSstrat1, PS weighting with 1-year entry date stratification; PSstrat5, PS weighting with 5-year entry date stratification; std, standardization improves results even further.Third, the variances of the estimates obtained using propensity score weighting are larger than those of the estimates obtained using standardization.

Scenario I
In this scenario, the true survival differences at 5 and 10 years are similar: 0.258 and 0.255, respectively.Using the full data, before missingness is induced, produces the most accurate results, both in combination with standardization and inverse probability weighting.Using only the complete data in the analysis also gives accurate survival difference estimates when using standardization as well as propensity score weighting.This is because, in Scenario I, the survival time is independent of the entry date to the register, so that the populations before and after 1997 are comparable.Ignoring the comorbidities in the models fitted leads to slight underestimation of the survival difference at t 0 = 5 and overestimation of the survival difference at t 0 = 10 when using standardization.With propensity score weighting, the survival probability is slightly underestimated, both for t 0 = 5 and t 0 = 10.The relatively small bias caused by ignoring the comorbidity could be explained by the fact that age absorbs part of the effect of the comorbidity on survival, as they share similar patterns across time.
All imputation models, with the possible exception of mice1, perform well, both in combination with standardization and with propensity score weighting, although, as previously mentioned, the estimates obtained upon standardization have considerably smaller variance.Misspecifying the propensity score model by ignoring interaction terms biases results upwards substantially.The linear term in the hazard for Scenario I did not contain any interactions to start with.
Neither performing a stratified inverse probability weighting using 5-year intervals, nor fitting weighted Cox models with treatment or treatment and entry year as covariates seems recommendable.However, if the full Cox model is fitted, results improve in accuracy, albeit at the cost of a slightly higher variance than simply fitting a Cox model and standardizing to begin with.

Scenario II
Applying standardization to the full data set produces accurate estimates, both under administrative and random censoring.The propensity score procedure applied to the full data under administrative censoring produces a mean estimate that is close to the true value for t 0 = 5.However, for t 0 = 10, the survival difference is substantially underestimated.Similar patterns are obtained when using the imputed data.We note that this problem is absent under random censoring.This suggests that propensity score weighting is not able to appropriately account for the association between censoring and survival time under administrative censoring, a problem that is amplified for later entry times by the presence of interaction terms between the entry year and the treatment and the comorbidities in the hazard and leads to a bias as large as 0.10.The complete data analyses produce estimates that are generally biased upwards, both with standardization and propensity score weighting, under administrative censoring, as well as random censoring.The exception to this is the inverse probability of treatment weighted estimate for t 0 = 10 for administrative censoring.The upward bias can partly be explained by the fact that these data only contain the later period of the register, where the PKT arm tends to include higher values of age and comorbidities, and the difference in survival probabilities between treatment arms is larger than at the start of the register.Since the subjects with the worst survival experiences are excluded from the study, the survival model uses as basis for extrapolation subjects who have considerably longer survival.This affects estimation in the PKT arm to a greater extent, since it is the arm whose survival probability changes most over time (see Figure 1 of the Supplementary Material).The apparent downward bias for inverse probability of treatment weighting t 0 = 10 under administrative censoring can be understood as the result of two competing biases of opposite directions: the downward bias induced by the informative censoring and the upward bias induced by the extrapolation being based on a population with generally better survival.This seems to be confirmed by the upward bias under random censoring for t 0 = 10.
Ignoring the comorbidities and applying standardization biases the survival difference estimates upwards for t 0 = 5 and downwards for t 0 = 10, both under administrative and random censoring.Again, the relatively small bias caused by ignoring the comorbidity and using standardization could be explained by the fact that age absorbs part of the effect of the comorbidity on survival, as they share similar patterns across time.The changing direction of the bias between t 0 = 5 and t 0 = 10 is more difficult to explain.Ignoring the comorbidity results in the implicit assumption that it remains constant across time.In reality, the incidence of the comorbidity increases, especially in the PKT arm.This, coupled with the fact that comorbidity negatively affects survival would bias the survival difference estimate downwards.However, survival generally improves across time.It is difficult to track the relative magnitudes of these two trends and determine which of them is dominant at any one given time point.When propensity score weighting is applied, the bias of the survival difference is minimal, except for the survival difference at t 0 = 10 under administrative censoring.The larger bias obtained here is most likely driven by the informative censoring, since the bias seems to be resolved under random censoring.
The best estimation results overall under administrative censoring are obtained when the missing data is imputed using mice3 or mice4, and standardization is applied.The difference between the patterns obtained under administrative censoring and random censoring here is minimal, which suggests that standardization appropriately accounts for the informative censoring.
Misspecifying the outcome model or the propensity score models both result in substantial bias, both under administrative and random censoring.This suggests that it may be preferable to use a rich model.Stratified propensity score weighting using 5-year or 1-year intervals do not perform better than their unstratified counterpart.Weighted Cox models using treatment or treatment and entry year as covariates perform poorly, even under random censoring.Using the full Cox model in the weighted regression produces estimates that are comparable to fitting the full Cox model to begin with, and then standardizing.
When random censoring is applied to Scenario II, the pseudo-observation estimation procedure is seen to produce the best results in terms of bias, with only a small increase in variance in comparison to standardization.

DATA APPLICATION
The data consist of 19 531 observations, corresponding to patients who entered the Swedish Renal Registry between January 1991 and December 2017.The shortest follow-up time is 1 day and the longest follow-up time is 9853 days (approximately 27 years).Of these patients, 5142 entered the register before 1998 (26%), and have missing comorbidities information, 1097 (0.06%) received a PKT and 12 073 (62%) died during the study period.For a more detailed description of the data set, we refer to Olarte Parra et al. 6 An important difference between the simulated data sets presented in the previous section and the real data set is that in the real data set, patients who start dialysis may still receive a transplant at a later date.This simplification in the simulation was done for clarity in presentation, and may result in different survival patterns in the real data than those in the simulation setting, especially in the dialysis arm, as we explain below.
For the outcome regression model, we fit a Cox proportional hazards model, with covariates given by treatment, entry year, sex, age, region, primary kidney disease, comorbidities, and interactions between treatment and comorbidities and between entry year and comorbidities.For the propensity score model, we fit a logistic regression model for treatment against entry year, sex, age, region, primary kidney disease, comorbidities, and interactions between entry year and comorbidities and between entry year and age.We note that we have rescaled the entry year to take values between 1 and 27.
The estimation results are summarized in Table 2 and Figure 7. Using only the complete data generally produces larger estimates than ignoring the comorbidities or imputing the data, regardless of whether standardization or propensity score weighting is applied for estimation.Ignoring the comorbidities produces similar, but slightly higher estimates than imputing the data.Imputing the data using the four imputation models gives rise to very similar point-estimates when combined with either standardization or propensity score weighting.However, estimates obtained by using propensity score weighting are generally higher than those obtained by using standardization and have higher estimated SEs.The exception to this is the dialysis arm, where, upon multiple imputation, estimates using standardization are slightly higher than those obtained using propensity score weighting, and where standardization produces larger SEs.Note, however, that it is not the SEs of the standardized estimates that are much larger for dialysis compared to PKT, but it is the SEs of the propensity score weighted estimates that are smaller.This is consistent with results obtained in simulation (see Tables 5-6 and Figures 3-5 in the Supplementary Material), which also suggest that the SEs obtained as part of the standardization procedure are conservative.A possible Note: Exposure, A = 1, corresponds to receiving a PKT, and non-exposure, A = 0, corresponds to dialysis.Abbreviations: 95%CI, 95% confidence interval; compl, complete-data analysis; ignore, analysis ignoring missing covariate in fitted models; mice1, mice2, mice3, mice4, imputation models used; s.e., standard error.
explanation for this phenomenon could be the homogeneity and size of the dialysis population compared to the PKT population.For all analyses, the estimated survival probability under PKT is larger than that under dialysis, both at t 0 = 5 and at t 0 = 10.The survival difference is seen to increase slightly from t 0 = 5 to t 0 = 10 when standardization is used for estimation.However, with propensity score weighting, the estimated survival difference is roughly the same at t 0 = 5 and t 0 = 10 (or even shows a small decrease).The exception to this is the complete-data analysis, where the estimated survival difference markedly increases from t 0 = 5 to t 0 = 10.Including the early cohort using covariate imputation and standardization, the survival probability decreases from 0.78 to 0.66 approximately between t 0 = 5 and t 0 = 10 under PKT, and from 0.56 to 0.39 approximately under dialysis.Under imputation and propensity score weighting, the survival probability decreases from 0.89 to 0.71 approximately between t 0 = 5 and t 0 = 10 under PKT, and from 0.54 to 0.37 approximately under dialysis.
Based on the results of our simulation, we would choose to report the results of the analysis using mice4 as imputation model, and standardization for survival probability estimation.We would then conclude that the survival probability difference between receiving a PKT and receiving dialysis is 0.232 (95%CI: 0.204-0.261)at t 0 = 5 and 0.273 (95%CI: 0.238-0.308)at t 0 = 10.These results are similar to those obtained by Olarte Parra et al.

DISCUSSION
We compared the performance of a number of imputation models and estimation methods to estimate the difference in marginal survival probabilities between receiving a PKT or dialysis as treatment for end-stage renal disease in the presence of conditional deterministic missingness of a confounder.In particular, the date of entry to the register determines whether comorbidities have been recorded as part of the baseline covariates for subjects.Not only does the date of entry determine missingness in an important confounder, but it also induces an association between censoring and survival times.In addition, because of the nature of the administrative censoring, subjects entering in the later period of the register will have shorter follow-up time.These issues require making extrapolation assumptions concerning both the distribution of the confounder, as well as the hazard, to the unseen domains, in order to allow for consistent estimation of survival probabilities.Regression standardization is shown to have the best performance across simulation scenarios when combined with imputation models that include interaction terms between the covariates and H 0 (T) (approximated by the Nelson-Aalen estimator of H(T)) or the log of the observed survival time.We recommend using a rich survival model, and making sure the extrapolation assumptions made are sensible.We note that the uncertainty in the estimation of H 0 (T) is not taken into account when estimating the variance in our study, and neither is our ignorance about the correctness of the extrapolations.The fact that the administrative censoring is controlled by the date of entry results in a disadvantage for propensity score models, where no direct adjustment for date of entry is made.When the influence of the entry year on survival is strong (as in Scenario II), propensity score weighting is seen to substantially underestimate the survival difference even when covariates are not subject to missingness.We note that the bias of the survival difference we observe in simulations is mostly driven by the PKT arm (see Figures 1 and 2 in the Supplementary Material).This is an expected behavior, since the PKT population is not only smaller in size, but is also the one that experiences the most changes in composition and survival experience.Under random censoring, pseudo-observations are also an option for estimation, and perform remarkably well.Note that this occurs in registers only when the study population and its adjusted survival stay constant over calendar time of entry.
The aspects that we have highlighted here are not limited to register-based studies, but are also of relevance for any study with administrative censoring where the selection criteria change over the course of patient accrual, so that the composition of the population under study changes over the course of follow-up, in a manner that is associated with the survival outcome.For example, during the course of the COVID pandemic, hospitalization criteria were subject to continuous change, creating heterogeneity in the study population and the potential for bias in survival estimates.Moreover, while we believe that the extrapolation assumptions for the missing covariates and the hazard (Assumptions 4 and 6) are plausible in our motivating data set, we can imagine that there are settings where one or both of these assumptions may be violated.This would be the case, for example, if  or  marked an event that dramatically changed the composition of the population or the context of the study, such as, for example, a ground-breaking technological advancement with direct medical applications or a pandemic.Sudden changes in trend at time points where we start missing observations (in a deterministic way) are problematic for any analysis method.One way to address the difficulties that may arise is by limiting the scope of the study.For missing covariates before , we could chose to limit the study to entries past .With fewer subjects and even fewer events we would lose substantial power.For missing follow-up beyond , if a detrimental change emerges at , we will overestimate survival chances, especially for later entries into the registry.To avoid such bias, we would need to limit inference to a shorter time window of follow-up, and stop including patients for whom this follow-up cannot be achieved.
There are a number of variations to the deterministic missingness framework presented here that one may encounter in practice.In the notation of Section 3, the set of variables X 2 may be missing for X 0 ≥  or  1 ≤ X 0 ≤  2 , for given values of ,  1 , and  2 .The latter situation could arise if there is a time-interval in the study for which certain variables could not be recorded, for example, for financial reasons, and the ,  1 ,  2 -boundaries could well be center-specific.More generally, for every component X 2k of X 2 , we may have distinct known time intervals ( k1 ,  k2 ) during which variable X 2k has not been recorded.In these cases, extrapolation issues need to be addressed at both ends of the time interval to ensure the observed data distribution is compatible with the extrapolation into the unobserved ends of the domain.Other settings where we may encounter deterministic missingness of covariates are given by cases of linkage failure between registers for certain groups of people or multi-center studies with poor protocol coordination, given, for example, by regional or hospital-specific differences in the uptake of new (and possibly more sophisticated) measures.

F I G U R E 1
Trends across years of entry in the Swedish Renal Registry per treatment arm (dialysis vs PKT).(A) Unadjusted survival curves.(B) Percentage observed events for patients during the study per year of entry to the register.(C) Mean age at disease onset per year of entry to the register.(D) Percentage patients with hypertension at disease onset per year of entry to the register.The horizontal axes in figures (B-D) represents the years of existence of the register and takes values between 1 and 27, corresponding to calendar years 1991 and 2017, respectively

3
Illustrations of violations of Assumptions 4 and 6.The gray areas represent domains on which we do not observe data.(A)

4 5
Trends across years of entry in the simulated data per treatment arm (dialysis vs PKT).(A) Mean age at disease onset per entry year.(B) Percentage patients with comorbidity at disease onset per entry year.The horizontal axis represents the years of existence of the register and takes values between 1 and 27 Trends across years of entry in the simulated data per treatment arm (dialysis vs PKT) for Scenarios I and II.(A, B) Mean time of survival in uncensored data per entry year.(C) Percentage observed events during the study per entry year, with administrative censoring for Scenario I. (D) Percentage observed events during the study per entry year, with administrative censoring for Scenario II.(E) Percentage observed events during the study per entry year, with random censoring for Scenario II.The horizontal axis represents the years of existence of the register and takes values between 1 and 27 administrative censoring: causal survival difference.t 0 = 5 (true value: 0.187) t 0 = 10 (true value: 0.312) administrative censoring: causal survival difference.t 0 = 5 (true value: 0.187) t 0 = 10 (true value: 0.312)Scenario II, random censoring: causal survival difference.
2Abbreviations: A, treatment; D, event indicator; Estimates of the potential survival probability at t 0 per treatment arm for the methods described in Sections 4.2 and 4.3 to the