A flexible approach for causal inference with multiple treatments and clustered survival outcomes

When drawing causal inferences about the effects of multiple treatments on clustered survival outcomes using observational data, we need to address implications of the multilevel data structure, multiple treatments, censoring, and unmeasured confounding for causal analyses. Few off‐the‐shelf causal inference tools are available to simultaneously tackle these issues. We develop a flexible random‐intercept accelerated failure time model, in which we use Bayesian additive regression trees to capture arbitrarily complex relationships between censored survival times and pre‐treatment covariates and use the random intercepts to capture cluster‐specific main effects. We develop an efficient Markov chain Monte Carlo algorithm to draw posterior inferences about the population survival effects of multiple treatments and examine the variability in cluster‐level effects. We further propose an interpretable sensitivity analysis approach to evaluate the sensitivity of drawn causal inferences about treatment effect to the potential magnitude of departure from the causal assumption of no unmeasured confounding. Expansive simulations empirically validate and demonstrate good practical operating characteristics of our proposed methods. Applying the proposed methods to a dataset on older high‐risk localized prostate cancer patients drawn from the National Cancer Database, we evaluate the comparative effects of three treatment approaches on patient survival, and assess the ramifications of potential unmeasured confounding. The methods developed in this work are readily available in the Rpackage riAFTBART.


INTRODUCTION
In cancer research, decision makers are starting to rely more heavily on real world evidence because clinical trials can be enormously expensive, time consuming and restrictive.The increasing availability of observational data sources like large registries and electronic health records provides new opportunities to obtain real world comparative effectiveness evidence.Recent efforts have been made to evaluate the comparative effectiveness of multiple treatment approaches on patient survival for high-risk localized prostate cancer using the large-scale national cancer database (NCDB). 1,2,3,4The complex data structures, however, pose three main challenges for statistical analyses that have not been fully addressed in the extant literature.
First, multiple (i.e., more than two) active treatments are involved.For high-risk localized prostate cancer, there are three popular treatment approaches: radical prostatectomy (RP), external beam radiotherapy (EBRT) combined with androgen deprivation (AD) (EBRT+AD) and EBRT plus brachytherapy with or without AD (EBRT+brachy±AD).Each treatment option has historically been performed on different types of patients as various health and demographic factors are closely linked to treatment choice.Second, treatment and patient information on a large, national population is collected from various treating facilities.The NCDB hospital registry data are collected in more than 1,500 Commission on Cancer accredited facilities.The participating institutions are not selected at random, and there can be substantial institutional variation in treatment effect.Third, some important confounders (pre-treatment variables predicting both treatment and outcome) may not be collected in the observational data.Two known main confounders in high-risk prostate cancer are the number of positive cores and magnetic resonance imaging findings, which are often not collected in the NCDB data or other large cancer registries alike.
Despite numerous recent advances in causal inference, the literature for handling data structures of this type -which arise frequently in population cancer research -is sparse.Causal inference techniques traditionally focus on a binary treatment.There is now a substantial body of research on causal inference methods with multiple treatments and a continuous outcome 5,6,7 or a binary outcome. 8,9,10Ennis et al. 1 and Zeng et al. 4 described propensity score weighting based methods for drawing inferences about causal effects of multiple treatments on censored survival outcomes, but neither work considered the multilevel data structure.There are two main reasons why it might be important to account for the cluster-level, or institutional variation when estimating treatment effect in the general population.First, if there are substantial institutional effects, then a causal analysis ignoring institution would be based on an incorrect model, which can lead to invalid inferences about treatment effects.Second, neither participating institutions nor patients in the registry data were selected at random.With substantial institutional variation, it is unclear exactly what treatment effect would be seen in the general patient population across various institutions with different outcomes.Finally, inferring causal links from observational data inevitably involves the untestable assumption of no unmeasured confounding, which holds that all pre-treatment variables are sufficient to predict both treatment and outcome.If there are unmeasured confounders, it is important to evaluate how departures from the no unmeasured confounding assumption might alter causal conclusions.Sensitivity analysis is useful to address this causal assumption, and is recommended by the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines. 11Existing sensitivity analysis approaches have largely focused on a binary treatment.Hu et al. 12 recently proposed a flexible Monte Carlo sensitivity analysis approach in the context of multiple treatments and a binary outcome.There is still sparse literature on sensitivity analysis methods that simultaneously accommodate multiple treatments and multilevel censored survival outcomes.
To fill these research gaps, we propose a flexible approach for drawing causal inferences about multiple treatments while respecting the multilevel survival data structure, and develop an interpretable sensitivity analysis approach to evaluate how the drawn causal conclusions might be altered in response to the potential magnitude of departure from the no unmeasured confounding assumption.We propose a random-intercept accelerated failure time (AFT) model utilizing Bayesian additive regression trees (BART) 13 , termed as riAFT-BART.In this model, we use the random intercepts for cluster-specific main effects capturing the variation across the institutions, and leverage the flexibility of the BART model 8,14,15,16 to accurately capture arbitrarily complex relationships among survival times, treatments and covariates.We then develop an efficient Markov chain Monte Carlo algorithm to draw posterior inferences about the population survival effects of multiple treatments.We further propose an interpretable sensitivity analysis approach leveraging our riAFT-BART model in the context of multiple treatments and clustered survival outcomes.Finally, we apply the proposed methods to the NCDB data and elucidate the causal effects of three treatment approaches (RP, EBRT+AD and EBRT+brachy±AD) on patient survival among older and high-risk prostate cancer patients and the impact of unmeasured confounding, as well as examine the institutional effects.
The rest of the paper is organized as follows.Section 2 describes notation and proposes the riAFT-BART method for the estimation of causal effects.Section 3 describes a corresponding sensitivity analysis approach using the model developed in Section 2. Section 4 develops a wide variety of simulation scenarios to examine the practical operating characteristics of our proposed methods against three alternative methods, and presents findings.In Section 5, we apply our methods to NCDB data to estimate the causal effects of three treatment approaches on patient survival among high-risk localized prostate cancer patients, and perform a sensitivity analysis to evaluate how sensitive the estimated causal effects are to possible unmeasured confounding.Section 6 concludes with a discussion.

Notation, definitions and assumptions
Consider an observational study possessing a two-level data structure that has clusters (institutions), each having treated individuals, indexed by = 1, … , , = 1, … , .The total number of individuals in the study is = ∑

=1
. Our goal is to infer the causal effect of treatment ∈ = { 1 , … , } on time to failure , where is the total number of treatment options.For each individual in cluster , there is a vector of pre-treatment measured covariates , and let be the individual's failure time, which may be right censored at .The observed outcome consists of = min( , ) and the censoring indicator Δ = ( < ).Let be the cluster indicators.There are no cluster-level covariates in our study, but our work can be extended to include them in .We proceed in the counterfactual framework.The counterfactual failure time under treatment for individual in cluster is defined as ( ), ∀ ∈ .We similarly define ( ) as the counterfactual censoring time under treatment .Throughout, we maintained the standard assumptions for drawing causal inference with observational clustered survival data: 3,17,18 ) is referred to as the "no unmeasured confounding" assumption.Because this is an untestable assumption, we will develop a sensitivity analysis approach in Section 3 to gauge the impact of violations of this assumption.Assumption (A3) requires that the treatment assignment is not deterministic within each strata formed by the covariates. 19This assumption can be directly assessed by visualizing the distribution of estimated generalized propensity scores.Finally, Assumption (A4) states that the counterfactual survival time is independent of the counterfactual censoring time given pre-treatment covariates, cluster-level variables and treatment variable.This condition directly implies ⨿ | , , , and is akin to the (conditionally) independent censoring assumption in the traditional survival analysis literature. 20e define the causal estimands directly in terms of counterfactual survival times.Alternatively, one can define causal estimands based on functionals (e.g., median) of the counterfactual survival curves. 3In this paper, we focus on the average treatment effect (ATE) defined either over the sample or the population.Consider a pairwise comparison between treatments and ′ .Common sample estimands are the sample average treatment effect (SATE), Common population estimands are the population average treatment effect (PATE), is another estimand that preserves some of the properties of the previous two. 14As our methods are developed in a Bayesian framework, CATE is a natural estimand to use in this paper. 14We obtain the sample marginal effects by averaging the individual conditional expectation of the counterfactual survival times [ ( ) − ( ′ ) | , ] across the empirical distribution of { , } , =1, =1 . 8Another causal estimand of interest is the average treatment effect on the treated (ATT).By averaging the differenced counterfactual survival times over those in the reference group, we can define the ATT counterparts of all three estimands described above.For brevity of exposition, we focus on CATE in this paper, but our methods can be straightforwardly extended for the ATT effects.For example, the conditional average treatment effect among those who received treatment is the size of the reference group .

The riAFT-BART model for clustered survival data
For clustered survival data, we propose the following random-intercept AFT model utilizing the likelihood-based machine learning technique BART: where ( , ) is an unspecified function relating treatments and covariates to survival times , 's are the random intercepts for cluster-specific main effects capturing the institutional variation, and is the residual term.We use BART to flexibly model the unknown function by a sum of shallow trees ( , , where  ℎ is the ℎth binary tree structure,  ℎ = ( 1ℎ , … , ℎ ℎ ) is the set of ℎ terminal node parameters associated with tree structure  ℎ .For a given value and in the predictor space, the assignment function ( , ;  ℎ ,  ℎ ) returns the parameter ℎ , ∈ {1, … , ℎ } associated with the terminal node of the predictor subspace in which { , } falls.We will place regularizing priors on { ℎ ,  ℎ } to keep the impact of each individual tree on the overall fit small and prevent overfitting. 13,14,21We assume a meanzero normal distribution for and independence between and with variance 2 and 2 , respectively.Here we adopt the parameter expansion technique 22 and introduce a redundant parameter for the variance of to improve computational performance of our proposed Markov chain Monte Carlo (MCMC) algorithm (Section 2.3).
Model (1) has two main advantages.First, it allows for direct specification of treatment effect on changes in life expectancy, which substantially facilitates the interpretability of an sensitivity analysis.Second, unlike the proportional hazards regression, the AFT model formulation can naturally incorporates BART to not only flexibly capture arbitrarily complex functional form of ( , ) but also provide coherent inferences based on a probability model and proper representations of uncertainty intervals via the posterior.
We decompose the joint prior distribution as Following Chipman et al. 13 and Hendersen et al., 23  ], which is sensible. 23As suggested in Chipman et al., 13 , we use an inverse gamma distribution ( 2 , 2 ) as the prior for 2 and the default value for ( = 3), and defer to the defauls for other hyperparameters of the BART trees { ℎ ,  ℎ }.We place a prior (1, 1) on 2 and . 22For initial values, we first set an initial random intercept ̂ (0) to be the mean of the lognormal residuals from a parametric AFT model with as the predictors for each cluster .We then set the initial value (0) to be the standard deviation of the pooled model residuals over the clusters, and initialize as the value such that < (0) ; , (0) = 0.9.We use data augmentation to deal with right censoring. 23,24Working with the centered responses , when Δ = 0, we impute the unobserved and centered survival times from a truncated normal distribution: in each Gibbs iteration, where log ∼ ( , ) + , 2 .The centered complete-data survival times are

Posterior inferences for treatment effects
Here we employ a Metropolis within Gibbs procedure for posterior inferences about treatment effects on patient survival.Using the centered complete-data survival times , , the joint posterior is We obtain the posterior of 2 as Complete derivation of the posterior distributions are provided in Web Section S1.We now describe our Metropolis within Gibbs procedure to draw from the posterior distribution of our proposed riAFT-BART model (1).A single iteration of our sampling algorithm proceeds through the following steps: Algorithm 1 A single iteration of riAFT-BART sampling algorithm 1. Update , 2 and from their respective posterior distributions.

Using log
, − as the responses and { , } as the covariates, update BART sum-of-trees model via parameters ℎ and 2 , using the Bayesian backfitting approach of Chipman et al. 13 Directly update ( , ) using the updated BART model, for = 1, … , , = 1, … , .Because we use the centered responses log( ) = log( ) − ̂ in posterior computation, we add ̂ back to the posterior draws of ( , ) in the final output.To draw posterior inferences about the CATE effects via riAFT-BART, we note that under the causal assumptions (A1)-(A4), This allows us to estimate treatment effect via outcome modeling.Specifically, where is the ℎ draw from the posterior distribution of .Inferences can be obtained based on the posterior average treatment effects, (1∕ )

Overview
The estimation of causal effects with observational data relies on the weak unconfoundedness assumption (A2), which cannot be verified empirically.Violations of this assumption can lead to biased treatment effect estimates.One widely recognized way to address concerns about violations of this assumption is sensitivity analysis.In fact, the STROBE guidelines recommend observational studies be accompanied by sensitivity analysis investigating the ramifications of potential unmeasured confounding. 11Many sensitivity analysis methods have been developed, including Rosenbaum's Γ, 25 external adjustment, 26 confounding functions 27,12 and the E value, 28 to name a few.These methods differ in how unmeasured confounding is formulated and parameterized.Sensitivity analysis approaches in the context of multiple treatments and clustered censored survival outcomes are an underdeveloped area.
With multilevel data, there can be unobserved confounders at both cluster-and individual-level.It has been shown in the literature that with propensity score based methods, the fixed-effects model for propensity score estimation automatically controls for the effects of unmeasured cluster-level confounders. 29,30,31In situations where the cluster sizes are small, a random-effects propensity score model may provide more accurate effect estimates, but is reliant on inclusion of important cluster-level covariates as regressors.As the cluster size increases, results from the random-effects model converge to those from a corresponding fixed-effects model.Although there is sparse literature on whether accounting for the clustered structure in potential outcome models would protect against misspecification due to cluster-level confounders, the outcome model is connected to the propensity score model in that the sufficient statistics (treatment group means of covariates) that must be balanced to eliminate confounding differences under both models are the same. 30Li et al. 30 conducted a simulation to show that ignoring the clustered data structure in both the propensity score and outcome models would lead to biased ATE estimates; respecting the structure in at least one of the models gives consistent estimates.Based on these grounds and that we are dealing with large clusters of a national database, we believe our riAFT-BART model (1) -which will converge to a fixed-effects model with large cluster sizes -will represent heterogeneity in cluster-level unmeasured confounding by the random effects { }.We then assume by conditioning on { } and { }, the potential outcome and treatment at cluster-level are independent, and introduce our sensitivity analysis approach for individual-level unmeasured confounding.
Our sensitivity analysis approach is along the line of work by Hu et al., 12 and is based within the framework of confounding function. 27,32The confounding function based methods have the advantage of avoiding introducing a hypothetical unmeasured confounder and making an assumption about its underlying structure, on which there is a lack of consensus, and are preferred when the primary interest is in understanding the total effect of all unmeasured confounders. 32,12

Confounding function adjusted treatment effect estimates
For notational brevity, we suppress the subscript denoting individual.Following Brumback et al. 32 and Hu et al., 12 we first define the confounding function for any pair of treatments ( , ′ ) as This confounding function directly represents the difference in the mean potential log survival times under treatment between those treated with and those treated with ′ , who have the same level of .Under the assumption of no unmeasured confounding, given measured individual-and cluster-level covariates and , the potential outcome is independent of treatment assignment.Had they received the same treatment , their mean potential survival times would have been the same, or ( , ′ , , ) = 0, ∀{ , ′ } ∈ .When this assumption is violated and there exists unmeasured confounding, the causal effect estimates using measured confounders will be biased.The bias in the estimated treatment effect ̂ , ′ takes the following form: Bias where Given known confounding functions , we can construct the confounding function adjusted estimators by first modifying the actual survival time , and then estimating the causal effect by fitting our riAFT-BART model to modified outcomes.In this way, the bias in equation ( 5) will be effectively removed from the adjusted effect estimate.Because the survival times may be right censored and we deal with right censoring using data augmentation, the outcome modification can be implemented on the complete-data survival times.We propose a Monte Carlo sensitivity analysis approach along the line of work by Hu et al., 12 which was developed for binary outcomes.We extend their work to accommodate multilevel censored survival outcomes.Our sensitivity analysis proceeds with steps listed in Algorithm 2.
Algorithm 2 Sensitivity analysis algorithm 1. Fit a multinomial probit BART model MBART ( | , ) to estimate the generalized propensity scores, from the prior distribution of each of the confounding functions ( , , , ), for each ≠ ∧ ∈ .
4. For each treatment , adjust the centered complete-data survival times (Step 2 of Algorithm 1) as follows: for each of ; ≠ ∧ ∈ }.

5.
Run Algorithm 1 for riAFT-BART on each of 1 × 2 sets of observed data with log .Estimate the combined adjusted causal effects and uncertainty intervals by pooling posterior samples across model fits arising from the 1 × 2 data sets.
Note that in Step 1 of Algorithm 2, we can fit a flexible fixed-effects multinomial probit BART model for generalized propensity scores.Steps 2 and 3 constitute a nested multiple imputation, 33 which is used to draw samples for the product term = | = , = ( , , , ) in equation ( 6).
Step 4 "corrects" the complete-data survival times to adjust the treatment effect estimate for individual-level unmeasured confounding.This is because the causal effect is defined as the between-group difference in mean potential outcomes and is estimated based on the observed outcomes.To correct the bias in equation ( 5) due to individual-level unmeasured confounding, we adjust the actual survival time of an individual who received treatment as Because the survival time can be right censored, we can replace the centered complete-data survival time log , used in our riAFT-BART sampling Algorithm 1 for treatment effect estimation with adjusted log as in equation (6).Web Section S2 provides a detailed justification of this strategy for obtaining confounding function adjusted causal effect estimates.Finally, we obtain the overall estimates of the adjusted causal effect and sampling variance as the posterior mean and variance of the pooled posterior samples across the 1 × 2 data sets. 34We used 1 = 30 and 2 = 30 when implementing our sensitivity analysis Algorithm 2 in both simulation (Web Section S3) and case study (Section 5).
The confounding functions are not identifiable from the observe data.We can assume the form of confounding function to represent our prior beliefs about the possible direction and magnitude of the effect of unmeasured confounding.We follow strategies discussed in earlier work 27,32,35,12 to specify the signs and bounds of the confounding functions.For example, by assigning ( , ′ , | , ) > 0 and ( ′ , , | , ) < 0, we assume the unmeasured factors tend to lead clinicians to systematically prescribe to healthier patients relative to ′ , because patients treated with will on average have longer potential survival time to both and ′ than patients treated with ′ .Web Table 1 presents interpretations of confounding functions with other specifications of signs.When setting the bounds, we assume the unmeasured confounding would account for less than units of the remaining standard deviation unexplained by measured confounders .Using the NCDB data in our case study as an example, a specification of (1, 2 | , ) ≤ ̂ = 0.90 and (2, 1 | , ) ≥ − ̂ = −0.90assumes that patients assigned to RP will on average have exp(0.90)= 2.45 months longer potential survival times than patients assigned to EBRT+AD to both treatment options; and therefore clinicians tend to prescribe RP to healtier patients.This bound of = 1 unit of remaining standard deviation is a plausible assumption.In the NCDB data, the median survival time was 7.7 and 7.8 years for EBRT+AD and RP group respectively and was not reached for the EBRT+brachy±AD treatment group (Web Figure 1).As suggested by Hu et al. 12 we draw the values of confounding functions from the uniform distribution.

Comparison methods
Through a contextualized simulation, we investigate the practical operating characteristics of our proposed method riAFT-BART.We also adapt the popularly used inverse probability weighting method into the setting of clustered and censored survival data to form two comparison methods: inverse probability of treatment weighting with the random-intercept Cox regression model (IPW-riCox) and doubly robust random-intercept additive hazards model (DR-riAH).In addition, we consider another outcome modeling based method, random-intercept generalized additive proportional hazards model (riGAPH), 36,3 that is flexible at capturing nonlinear relationships.We use the counterfactual survival curve as the basis to objectively compare methods.Note that we can derive the individual survival curve corresponding to our riAFT-BART model (1) as We can define the causal estimands by contrasting the conditional survival probability up to a fixed time * , or by comparing the conditional restricted mean survival time (RMST). 37For some arbitrary time bound * , the RMST can be represented as the area under the survival curve ( ) up to * , RMST = ∫ * 0 ( ) .In our simulation, we present results based on both metrics, because in our motivating prostate cancer research question, 5-year survival and RMST are of most clinical relevance. 1or the DR-riAH method, following suggestions by Li et al., 30 we obtain the DR-riAH treatment effect estimator based on the survival probability at * as where is the estimated generalized propensity score, the observed survival probability is the Kaplan-Meier estimator and the predicted counterfactual survival probability ( ′ ) > * is calculated from the random-intercept additive hazards model. 38For CATE effects based on RMST, we replace the observed (counterfactual) survival probability with the area under the observed (counterfactual) survival curve.
To assess the performance of each method, we compare the relative bias defined as , where 0 , ′ is the true treatment effect, and the frequentist coverage probability for ̂ , ′ in terms of 5-year survival probability and 5-year RMST among 250 data replications.
When implementing the comparison methods, for weighting based methods IPW-riCox and DR-riAH, we used Super Leaner 39 to estimate the stabilized inverse probability of treatment weights for improved modeling flexibility and accuracy of the estimated weights.Super Learner was implemented via the R package SuperLearner with SL.library = c("SL.xgboost","SL.bartMachine","SL.gbm").We fitted a weighted random-intercept Cox regression model using the coxme function of R package coxme to obtain the IPW-riCox estimator.For DR-riAH, we fitted the random-intercept additive hazards model using the aalen function from R package timereg to compute the counterfactual outcomes used in the DR-riAH estimator.To implement ri-GAPH, we used the gam function from R package mgcv and two helper functions (as_ped and add_surv_prob) from R package pammtools.RMST was calculated by the trapezoidal rule using the R function rmst of RISCA package.For all methods, the same confounders available to the analyst were used in the linear forms in the corresponding models.

Simulation design
Our data generating processes are contextualized in NCDB data settings.We generate = 20 clusters, each with a sample size of = 500, and the total sample size is = 10000.We simulate 10 confounding variables, with five continuous variables independently generated from the standard normal distribution ∼ (0, 1), = 1, … , 5, two categorical variables independently generated from the multinomial distribution ∼ Multinomial(1, .3,.3,.4),= 6, 7 and three binary variables 8 ∼ Bern(0.6), 9 ∼ Bern(0.4), 10 ∼ Bern(0.5)generated for each individual in cluster .Throughout we consider three treatment groups.The treatment assignment mechanism follows a random intercept multinomial logistic regression model, where ∼ (0, 1 2 ), denotes the nonlinear transformations and higher-order terms of the predictors , and 1 , 2 and 1 , 2 are respectively vectors of coefficients for the untransformed versions of the confounders and for the transformed versions of the confounders captured in .The intercepts 01 and 02 control the ratio of units across three treatment groups, for which we use 6:3:1 to mimic the ratio of individuals in the NCDB data.
We generate the potential survival times from a Weibull survival curve, where is a treatment-specific vector of coefficients for and for , ∀ ∈ {1, 2, 3}.We explicitly avoid generating survival times from a lognormal AFT model to assess the robustness of the assumption of lognormal residuals in our riAFT-BART model formulation in equation ( 1).The parameter is set to 2 and exp(0.7 + 0.5 1 ) to respectively produce proportional hazards (PH) and nonproportional hazards (nPH).Three sets of non-parallel response surfaces are generated as for ∈ {1, 2, 3}, where is a random variable following the uniform distribution on the interval [0, 1], ∼ (0, 4 2 ), = {3000, 1200, 2000} for = 1, 2, 3. Observed and uncensored survival times are generated as = ∑ ∈{1,2,3} ( ) ( = ).We further generate censoring time independently from an exponential distribution with the rate parameter selected to induce two different censoring proportions: 10% and 40%.
The data generating processes will produce four configurations: (PH vs. nPH) × (10% censoring proportion vs. 40% censoring proportion).Detailed model specifications for treatment assignment, as in model (8), and for outcomes, as in model (10), are given in Web Table 2. Web Figure 2 presents the Kaplan-Meier survival curves stratified by treatment for each of four data configurations in our simulation.The assessment of covariate overlap displayed in Web Figure 3 suggests there is moderate to strong overlap across three simulated treatment groups, which represents the overlap in the NCDB dataset (Web Figure 4).
We conduct an illustrative simulation, using one individual-level binary measured confounder and one individual-level binary unmeasured confounder, to examine how our sensitivity analysis approach performs in comparison to two causal analyses: (i) including unmeasured confounders and (ii) ignoring unmeasured confounders.Simulation design details are provided in Web Section S3.

Results
Figure 1 displays boxplots of relative biases in three treatment effect estimates ̂ 1,2 , ̂ 1,3 and ̂ 2,3 based on 5-year RMST, among 250 simulations under four data configurations.Our proposed method riAFT-BART boasts the smallest biases and variability in all treatment effect estimates across all simulation scenarios, followed by DR-riAH and riGAPH; while IPW-riCox yields the largest biases and variability.When the censoring proportion increases, all four methods show decreased performance.The violation of proportional hazards has the largest impact on the IPW-riCox method, demonstrated by the largest bias increase in CATE estiamtes, but only a small impact on riAFT-BART.Even though our method does not require proportional hazards, the elevated data complexity in the non-proportional hazards setting -the shape parameter in the outcome model (10)  is covariate-dependent -may have contributed to the slight increase in the biases.The bias assessment based on 5-year survival probability is provided in Web Figure 5, which conveys the same messages as the RMST based results.
Table 1 presents the frequentist coverage probability of each of four estiamtors for each simulation configuration.The proposed method riAFT-BART provides nominal frequentist coverage probability under proportional hazards with 10% censoring proportion.Even in the most complex data settings with nonproportional hazards and 40% censoring proportion, riAFT-BART still provides close-to-nominal frequetist coverage probability.By comparison, the IPW-riCox estimator is the least efficient producing unsatisfactory coverage probability; DR-riAH and ri-GAPH deliver similar coverage probabilities around 0.8.Web
Table 3 examines the frequentist coverage probability of the estimators based on 5-year survival probability, and we observed the same differences across the estimators demonstrated in the RMST based results.
Web Figure 6 suggests that our riAFT-BART algorithm converges well by plotting 3500 posterior draws of the variance parameters and , and cluster-specific parameter and the random intercepts for clusters = 1, = 10 and = 20.
The illustrative simulation for sensitivity analysis, displayed in Figure 2 , empirically supports our proposed sensitivity analysis Algorithm 2. Given the known confounding functions, our sensitivity analysis estimators are similar to the results that could be achieved had the unmeasured confounders been made available to the analyst.The naive analysis where we ignored the unmeasured confounding produced substantially biased estimators.

APPLICATION TO PROSTATE CANCER
We applied the proposed method riAFT-BART to estimate the comparative effectiveness of three treatment approaches, RP, EBRT+AD and EBRT+brachy±AD, on patient survival among high-risk localized prostate cancer patients who are older than 65 years of age.We then applied the proposed sensitivity analysis approach to evaluate how the causal conclusions about treatment effects would change in response to various degrees of departure from the no unmeasured confounding assumption.
The analysis dataset was drawn from the NCDB and included 23058 high-risk localized prostate cancer patients who were older than 65 when diagnosed between 2004 and 2015.Among these patients, 14237 received RP, 6683 undertook EBRT+AD with at least 7920 cGy EBRT dose 3 and 2138 were treated with EBRT+brachy±AD.Included in the dataset are pre-treatment  4.
Under the assumption of no unmeasured confounding, the treatment effect estimates, shown in Table 2 , suggest that RP is the most beneficial treatment, which would on average lead to a ratio of 1.5 (1.3, 1.7) in the expected survival time compared to EBRT+AD and a ratio of 1.2 (1.1, 1.4) compared to EBRT+brachy±AD.Between the two radiotherapy approaches, EBRT +AD leads to a shorter expected survival time that is 0.9 (0.7, 1) times the expected survival time for EBRT+brachy±AD.Figure 3 presents the posterior mean of the predicted counterfactual survival curves for each of three treatment groups.After correcting for confounding and accounting for the variability in location effects, our riAFT-BART estimators suggest increased treatment benefit associated with RP and reduced survival for EBRT+AD compared to the unadjusted Kaplan-Meier estimators.A perusal of the posterior distribution of the random intercepts 's displayed in Figure 4 suggests that there was substantial variability in the location effect.Hospitals in New England had significantly better outcomes (longer expected survival times) than hospitals in East Central area.
It is possible that some important confounders were not collected in the NCDB data.For example, patient functional status has been shown to be strongly associated with both treatment choices and survival among men with prostate cancer. 40,41The performance status measured by Eastern Clinical Oncology Group (ECOG) score may also be a confounder as it both affects the likelihood of clinicians choosing AD 42 and predicts survival for prostate cancer. 43Futhermore, magnetic resonance imaging findings and the number of positive biopsy cores are both likely confounders as they are related to patient selection for RP or brachytherapy and indicate the degree of agressiveness of high-risk prostate cancer. 2,44o evaluate the sensitivity of the treatment effect estimates to these unmeasured confounders, we first leverage the subjectarea literature to specify the confounding functions.For the sign, we assume that the unmeasured factors guiding clinicians to prescribe RP lead them systematically to prescribe it to relatively healthier patients.This is because magnetic resonance imaging findings supportive of resectability were used for patient selection for RP, 2 and RP was recommended to those with lower number of positive biopsy cores, better functional score and better performance status. 40,45Between the two radiotherapy based treatment approaches, on the one hand, unhealthier patients with lower functional scores or ECOG scores were not recommended to use AD in treatment as they would not tolerate strong side effects induced by AD. 42,46 On the other hand, brachytheray has been recommended as a boost to EBRT only to relatively unhealthier patients with multiple positive biopsy cores. 44Based on these pieces of evidence, we assume clinicians may have a preference to recommend EBRT+AD over EBRT+brachy±AD to healthier patients, but other directions of unmeasured confounding may also be plausible.
We next postulate the bounds of confounding functions based on published scientific work.Lehtonen et al. 43 shows that ECOG score has a large effect (Cohen's > 1) and number of positive biopsy cores has a small effect (Cohen's < 0.2) on prostate cancer patient survival.Loosely translating Cohen's to the proportion of the total variation in the outcome explained by a given covariate, 47 we assume the unmeasured confounders approximately account for 25% of the total variance in overall survival.Following Chan et al., 48 we fitted an AFT log-normal model to the NCDB data with fixed effects for locations, and computed the 2 to be 60%.This suggests that the unmeasured confounders will explain no more than 40% of the variation in the outcomes.Based on these grounds, we assume that the unmeasured confounding would account for = 0.75 units of the remaining standard deviation unexplained by measured variables.The remaining standard deviation in the outcome unexplained by measured covariates was estimated to be ̂ = 0.90 (months) via our riAFT-BART model.We hence set the bounds of the confounding functions to be ± ̂ = ±.675(months).
Table 2 displays the sensitivity analysis results in comparison to treatment effect estimates under the weak unconfoundedness assumption.We assume relatively healthier patients were assigned to RP:  1 for interpretations).Results show that the significant treatment benefit associated with RP over EBRT+AD is robust to different magnitudes and directions of unmeasured confounding.However, the significant gain in the expected survival time offered by RP in the causal analysis assuming weak unconfoundedness is negated in the presence of unmeasured confounding.Turning to the comparative effect between EBRT+AD and EBRT+brachy±AD, under the assumption of no unmeasured confounding, EBRT+brachy±AD had a survival benefit bordering on being statistically significant over EBRT +AD.Assuming unmeasured factors guiding clinicians to prescribe EBRT+AD lead them systematically to prescribe it to relatively healthier patients "tips" the result over to significant benefit for EBRT+brachy±AD.

SUMMARY AND DISCUSSION
The increased availability of large-scale healthcare databases allows researchers to conduct comparative effectiveness analysis of modern treatment approaches for high-risk cancer patients.Recent efforts have been made to investigate the effects of surgical and radiotherapy based treatments on patient survival using the national cancer databases.However, the multilevel data structure presented in these databases and its implications for causal analyses require special statistical consideration but have not been well studied.In addition, there is the lack of tools for evaluating the sensitivity of treatment effect estimates to the presence of unmeasured confounding in the context of multiple treatments and multilevel survival data.
Motivated by these research gaps, our work makes two primary contributions to the causal inference literature.First, we develop a flexible causal modeling tool and MCMC algorithm for causal inferences about effects of multiple treatments on patient survival while respecting the multilevel data structure.The proposed riAFT-BART model flexibly captures the relationship between survival times and individual-level covariates and cluster-specific main effects, while providing proper representations of uncertainty intervals via the posterior based on a probability model.Second, leveraging the flexible riAFT-BART model, we develop an interpretable sensitivity analysis algorithm to address the causal assumption of no unmeasured confounding.In our sensitivity analysis approach, we present the confounding function on the basis of expected survival time, which can be easily interpreted, and propose methods to adjust the estimation of causal effects by effectively removing the bias due to posited levels of unmeasured confounding.
Applying our proposed methods to NCDB data on older high-risk localized prostate cancer patients, we confirmed the survival benefit of RP relative to EBRT+AD.Inferences about other two pairwise treatment effects were inconclusive because they were impacted by the potential unmeasured confounding.Our causal analysis also demonstrates that there is substantial variability in the effects of hospital locations, which reinforces the importance of examining the cluster-level variation when estimating treatment effect in the general population using data with hierarchical structure.
There are several important avenues for future research.First, our riAFT-BART model can be extended to include the random slopes and accommodate the cluster-level covariates.Second, developing a sensitivity analysis for cluster-level unmeasured confounding could be a worthwhile and important contribution.Third, although our simulation results suggest that our methods are robust to the normality assumption for the random intercepts and residuals, it may be worthwhile to develop nonparametric priors to further improve the modeling flexibility.Finally, to address the causal assumption of positivity, we can extend the work by Hill and Su 49 and Hu et al. 8 to develop a strategy to identify a common support region for inferential units.
Web-based Supplementary Materials for "A flexible approach for causal inference with multiple treatments and clustered survival outcomes" by Hu et al.
3) For the posterior of α k , used for parameter expansion 4) For the posterior of For the posterior of µ lh , we first draw the Gibbs sample of b k , α i , τ 2 , σ 2 seperately from their respective posterior distribution.Then using the updated b k , we obtain Ỹ S2 Technical details for sensitivity analysis in Section 3.2 The derivation of the bias formula Bias(a j , a j | x, v) in equation ( 5) follows the proof of Theorem 2.1 in ?.
Under the weak unconfoundedness assumption (A2), ignoring individual-level unmeasured confounding will lead to the following bias in the estimate of the causal effect CAT E a j ,a j , To simplify notation, we will use we have Similary, Hence, Repeatedly using E [T (a l ) | a l , x, v] = E [T | a l , x, v] , ∀l ∈ {1, . . ., J} to rewrite the last two items of the RHS of the equation (1), we have Let p = 1 − p j − p j .By rewriting p j E T (a j ) − T (a j ) | a j , x, v in equation ( 2), we have We now show how we arrive at equation ( 6) for adjusting the responses so that the bias in equation ( 5) due to individual-level unmeasured confounding will be effectively removed from the confounding function adjusted treatment effect estimate.
Because the causal effect is defined as the between-group difference in mean potential outcomes and is estimated based on the observed outcomes.To correct the bias in equation ( 5) due to individual-level unmeasured confounding, we adjust the actual survival time T of an individual who received treatment a j as Let By applying the law of total expectation to E [T (a j ) | x, v], we can rewrite the second quantity of the RHS of equation ( 3) as This implies that we will replace T with T CF as log T CF = J m =j p m c(a j , a m , x, v), which is equation (6).Now we prove that replacing T with T CF removes the bias in equation (5).Consider the causal effect between any pair of treatments a j and a j .Using the adjusted survival times T CF , the estimate of the causal effect is

S4 Supplementary Tables and Figures
Web Table 1: Interpretation of assumed priors on c(a j , a j , | x, v) and c(a j , a j | x, v) for average treatment effect based on log survival time.

Prior assumption
Interpretation and implications of the assumptions c(a j , a j | x, v) c(a j , a j | x, v) > 0 < 0 Individuals treated with a j will on average have longer potential survival time to both a j and a j than individuals treated with a j ; i.e. healthier individuals are treated with a j .< 0 > 0 Contrary to the above interpretation, unhealthier individuals are treated with a j .< 0 < 0 The potential survival time to a j (a j ) is shorter among those who choose it than among those who choose a j (a j ).Thus, the observed treatment allocation between these two approaches is undesirable relative to the alternative which reverses treatment assignment for everyone.> 0 > 0 Contrary to the above interpretation, the observed treatment allocation between these two approaches is beneficial relative to the alternative which reverses treatment assignment for everyone.
Web Table 2: Specifications of treatment assignment model (A-model) in equation ( 8) and outcome generating model (T -model) in equation (10); and coefficients ξ L 1 , ξ L 2 and ξ N L 1 , ξ N L 2 of the A-model and β L a j and β N L a j of the T -model, a j ∈ {1, 2, 3}.We set ξ 01 = 0.9 and ξ 02 = −1.0 to generate the 6:3:1 ratio of unit across three treatment groups.

A-model T -model
Variables a j = 1 a j = 2 a j = 1 a j = 2 a j = 3

FIGURE 3
FIGURE 3The posterior mean of the counterfactual survival curves for each of three treatment groups in NCDB data.The solid curves are the average by treatment group of the individual-specific survival curves estimated following equation(7).The dashed survival curves are the Kaplan-Meier estimates for each treatment group.Solid gray curves are estimates of individual-specific survival curves for 15 randomly selected patients from three treatment groups.

FIGURE 4 TABLE 2
FIGURE 4The location effect in terms of the log survival time in months represented by the posterior mean and credible intervals of the random intercept , = 1, … , 9.

Web Figure 2 : 3 =Web Figure 6 :
Kaplan-Meier survival curves for three treatment groups generated in our simulation study in Section 4. Panels A-D respectively represent scenarios corresponding to proportional hazards with 10% censoring, proportional hazards with 40% censoring, non proportional hazards with 10% censoring and non proportional hazards with 40% censoring.Web Figure 5: Relative biases among 250 replications for each of four methods, IPW-riCox, DR-riAH, riGAPH and riAFT-BART, and three treatment effects CAT E 1,2 , CAT E 1,3 and CAT E 2,3 based on 5-year survival under four data configurations: (proportional hazards vs. nonproporitonal hazards) × (10% censoring proportion vs. 40% censoring proportion).The true treatment effects under proportional hazards are CAT E 0,P H 1,2 = 0.31, CAT E 0,P H 1,3 = 0.16 and CAT E 0,P H 2,−0.15.The true treatment effects under nonproportional hazards are CAT E 0,nP H 1,2 = 0.32, CAT E 0,nP H Assessing convergence of the chain by plotting 3500 posterior draws of the variance parameters τ and σ, and cluster-specific parameter α k and the random intercepts b k for clusters k = 1, k = 10 and k = 20.

TABLE 1
The coverage probability for three treatment effect estimates ̂ 1,2 , ̂ 1,3 and ̂ 2,3 based on 5-year RMST under four data configurations: (proportional hazards vs. nonproporitonal hazards) × (10% censoring proportion vs. 40% censoring proportion).age,raceand ethnicity, insurance status, income, education level, clinical T stage, year of diagnosis, prostate-specific antigen and gleason score, and geographic locations of treating facilities.There are nine hospital locations, which were considered as the clusters in our analysis.Detailed descriptions of the individual-and cluster-level variables (hospital locations) are presented in Web Table4.Covariates are deemed to have good overlap across three treatment groups based on the estimated generalized propensity scores shown in Web Figure : liangyuan.hu@rutgers.eduS1 Posterior distributions of µ lh , σ 2 , b k , α k and τ 2 in riAFT-BART 1) For the posterior of σ 2 , since we have σ 2 ∼ IG ν 2 , νλ 2 , we obtain email

Table 4 :
Descriptions of pre-treatment variables and hospital locations (clusters) for each of three treatment groups in NCDB data.