A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection

We recently developed a new method random‐intercept accelerated failure time model with Bayesian additive regression trees (riAFT‐BART) to draw causal inferences about population treatment effect on patient survival from clustered and censored survival data while accounting for the multilevel data structure. The practical utility of this method goes beyond the estimation of population average treatment effect. In this work, we exposit how riAFT‐BART can be used to solve two important statistical questions with clustered survival data: estimating the treatment effect heterogeneity and variable selection. Leveraging the likelihood‐based machine learning, we describe a way in which we can draw posterior samples of the individual survival treatment effect from riAFT‐BART model runs, and use the drawn posterior samples to perform an exploratory treatment effect heterogeneity analysis to identify subpopulations who may experience differential treatment effects than population average effects. There is sparse literature on methods for variable selection among clustered and censored survival data, particularly ones using flexible modeling techniques. We propose a permutation‐based approach using the predictor's variable inclusion proportion supplied by the riAFT‐BART model for variable selection. To address the missing data issue frequently encountered in health databases, we propose a strategy to combine bootstrap imputation and riAFT‐BART for variable selection among incomplete clustered survival data. We conduct an expansive simulation study to examine the practical operating characteristics of our proposed methods, and provide empirical evidence that our proposed methods perform better than several existing methods across a wide range of data scenarios. Finally, we demonstrate the methods via a case study of predictors for in‐hospital mortality among severe COVID‐19 patients and estimating the heterogeneous treatment effects of three COVID‐specific medications. The methods developed in this work are readily available in the R${\textsf {R}}$ package riAFTBART$\textsf {riAFTBART}$ .


Introduction
Large-scale healthcare datasets garnered from multiple facilities are increasingly available, offering fertile ground for innovative investigation into emerging medical research questions.The inherent hierarchical data structure, in conjunction with the censored survival outcome, which is of strong clinical interest, poses considerable challenges for statistical analysis.Hu et al. (2022) recently developed a flexible approach, riAFT-BART to estimate the population average treatment effect on patient survival, while accounting for the clustering structure of the observations.The method riAFT-BART stands for random-intercept accelerated failure time model with Bayesian additive regression trees (BART) (Chipman et al., 2010).The utility of this new method, which leverages Bayesian machine learning, goes beyond the estimation of population average treatment effect.In this article, we exposit how riAFT-BART can be used to solve two important statistical questions with clustered survival data: estimating the treatment effect heterogeneity and variable selection.
An immediate extension of the new method, riAFT-BART, beyond the population average treatment effect is the estimation of heterogeneous treatment effect.Prior work (Hu et al., 2021) has shown that compared to algorithm-based machine learning, leveraging the likelihood-based machine learning technique BART can provide more accurate estimation of the individual survival treatment effect, which in 2 1: A new method for clustered survival data turn can be used to assess the treatment effect heterogeneity.To the best of our knowledge, no work has yet discussed how BART can be used to estimate the heterogeneous effect of multiple treatments on patient survival using clustered survival observations.
Another important utility of riAFT-BART is for variable selection with clustered survival data, on which there is sparse literature.Fan and Li (2002) proposed a family of variable selection methods for the gamma frailty model based on a nonconcave penalized likelihood approach using the Newton-Raphson method.Androulakis et al. (2012) extended the penalized Cox or Gamma frailty model to a class of frailty models and proposed a generalized form of the likelihood function to allow the use of different continuous distributions for the frailty term.Utazirubanda et al. (2021) proposed group LASSO with gamma-distributed frailty for variable selection in Cox regression.However, no corresponding software is available for these methods.Bender et al. (2018) proposed a piece-wise exponential additive mixed model (PEAMM) for clustered survival data, and Marra and Wood (2011) described a practical approach by which shrinkage penalties can be added to the smoothing parameters in PEAMM and variable selection can be performed by using restricted maximum likelihood estimation.Rondeau et al. (2003) proposed to conduct variable selection by maximizing the penalized likelihood using the Marquardt algorithm (Marquardt, 1963), which is a combination of the Newton-Raphson algorithm and the steepest descent algorithm.Ha et al. (2014) proposed a penalized h-likelihood estimation approach for variable selection of fixed effects using semiparametric frailty models.These methods rely on parametric assumptions about the exact relationships between covariates and survival outcomes.Misspecifying the parametric forms of covariate-outcome relationships can reverberate through the variable selection procedure and yield unsatisfactory results (Hu et al., 2021;Bleich et al., 2014).
Flexible machine learning modeling techniques can help relax the parametric assumptions and improve variable selection results.Ishwaran et al. (2010) derived the distribution of the minimal depth of a maximal subtree, which measures the predictiveness of a variable in a survival tree, and used it for high-dimensional variable selection using random survival forests.Lagani and Tsamardinos (2010) proposed an algorithm based on the theory of Bayesian networks and the Markov blanket for variable selection suitable for highdimensional and right-censored survival data.However, neither of these two machine learning based methods accommodates clustered survival observations.
In this work, we describe ways in which riAFT-BART can be used to (i) estimate the treatment effect heterogeneity, and to (ii) select important predictors in the presence of missing data with clustered survival observations.The rest of the paper is organized as follows.Section 2 describes the use of riAFT-BART for estimating individual survival treatment effects and exploring subgroups who may experience enhanced or reduced treatment effect than population average treatment effect.Section 3 proposes a variable selection method using riAFT-BART when covariates are subject to missing data.In Section 4, we conduct an expansive simulation to assess the practical operating characteristics of riAFT-BART in estimating treatment effect heterogeneity and performing variable selection.Section 5 illustrates the methods using a COVID-19 dataset drawn from the Mount Sinai Medical Center, and Section 6 provides a discussion.

Notation
We maintain notation used in Hu et al. (2022).Consider a nonexperimental study with K clusters, each having treated n k individuals, indexed by i = 1, . . ., n k , k = 1, . . ., K. The total sample size is N = K k=1 n k .There are J possible treatment options, denoted by A ∈ A = {a 1 , . . ., a J }. Denote pre-treatment measured, L-dimensional covariates by X ik = {X ik1 , . . ., X ikL } for each individual i in cluster k, whose failure time T ik may be right censored at C ik .The observed outcome consists of Y ik = min(T ik , C ik ) and the censoring indicator ∆ ik = I(T ik < C ik ).Let V k be the cluster indicators.Proceeding with the counterfactual framework, the counterfactual failure time under treatment a j for individual i in cluster k is defined as T ik (a j ), ∀a j ∈ A , and the counterfactual censoring time under treatment a j is defined as C ik (a j ).Throughout, we maintained the standard assumptions (Hu et al., 2022(Hu et al., , 2021;;Hu and Hogan, 2019;Chen and Tsiatis, 2001;Arpino and Cannas, 2016): consistency, weak unconfoundedness, positivity and covariate-dependent censoring, for drawing causal inference with observational clustered survival data.Detailed assumptions and implications of assumptions can be found in Hu et al. (2022).

Method
Hu et al. ( 2022) adapted BART into a random-intercept accelerated failure time model, where f (A ik , X ik ) is an unspecified function relating treatment assignment A ik and pretreatment confounders X ik to survival times T ik , b k 's are the random intercepts capturing cluster-specific main effects, and ϵ ik is the residual term.The unknown function f is flexibly modeled by BART via a sum of shallow trees (Hu et al., 2020;Hu and Gu, 2021).A mean-zero normal distribution and independence are assumed for b k and ϵ ik .A redundant parameter α k was used for the variance of b k as a parameter expansion technique (Gelman et al., 2008) to improve computational performance.Regularizing priors are placed on the tree parameters to keep the impact of each individual tree on the overall fit small and thus prevent overfitting (Chipman et al., 2010).Sensible priors are used for parameters σ 2 , τ 2 and α k (Hu et al., 2022;Henderson et al., 2020).To deal with right censoring, data augmentation is used.That is, the unobserved survival times are imputed from a truncated normal distribution in each Gibbs iteration.A Metropolis within Gibbs procedure (Hu et al., 2022) (Web Section 1) was proposed to draw posterior inferences about the population average treatment effect on patient survival, which is defined as Via outcome modeling, the pairwise population average treatment effect between a j and a j ′ can be estimated as where f d is the dth draw from the posterior distribution of f .Inferences can be obtained based on the D posterior average treatment effects, (1/N ) Turning to the estimation of treatment effect heterogeneity.We first define the individual survival treatment effect (ISTE) for individual i in cluster k as which will be used as the building block for estimating the heterogeneous treatment effect.From the fitted Model (1), we can estimate ζ aj ,a j ′ (x ik ) as where f is the mean of D draws from the posterior predictive distribution of f .The ISTE estimates { ζaj,a j ′ (x ik ), i = 1, . . ., n k , k = 1, . . ., K} can then be used to explore the degree of treatment effect 1: A new method for clustered survival data heterogeneity.By the "virtual twins" approach (Foster et al., 2011), the { ζaj,a j ′ (x ik )} would be regressed on the predictors via a tree diagram to identify the predictor subspaces that explain away differences in { ζ(x ik )}.In a similar spirit, the "fit-the-fit" strategy has been used to identify possible subgroups defined by combination rules of covariates that have differential treatment effects (Logan et al., 2019;Hu et al., 2021).Mathematically, we wish to find a covariate subspace S aj ,a j ′ = {S 1 aj ,a j ′ , S 2 aj ,a j ′ , . . ., S H aj ,a j ′ }, where h ∈ {1, . . ., H} is the number of combination rules of covariates, such that the average treatment effects across individuals in this region are considerably different than the population average treatment effect.Denote the measure of subgroup treatment effect between a j and a j ′ as M(S aj ,a j ′ ) = {M (S 1 aj ,a j ′ ), . . ., M (S H aj ,a j ′ )}, and We adopt the "fit-the-fit" strategy to estimate S aj ,a j ′ and M(S aj ,a j ′ ).The algorithm proceeds with the following steps: 1. Add each candidate predictor separately to the random forests model (Breiman, 2001) using { ζaj,a j ′ (x ik )} as the outcome, and record the model fit measured by R 2 ; include the predictor corresponding to the largest R 2 in the model; 2. Add each of the remaining candidate predictors separately to the previous step's model and retain the predictor corresponding to the largest R 2 improvement; 3. Repeat step 2 until the percentage improvement in R 2 is less than 1%.At this point, the final model is established.
The R 2 is the coefficient of determination in regression analysis that determines the proportion of variance in the dependent variable that can be predicted from the independent variable(s).Different from prior work, we use random forests instead of a single tree to regress the ISTE estimates on predictors for improved modeling accuracy and stability.To interpret the tree ensembles in the final random forests model yielded in Step 3, we apply the inTrees algorithms (Deng, 2019) to translate the composite rules into a simple rule model.The inTrees algorithms extract, measure and rank rules, prune redundant rules, detect frequent variable interactions, and summarize rules into a simple tree diagram model that can be used for predicting new data.The branch decision rules in the final tree diagram model sending individuals to the terminal nodes suggest the predictor subspace Ŝaj,a j ′ ; and subgroup treatment effects that are substantially different from the population effect θ aj ,a j ′ can then be estimated by averaging the ISTE estimates among individuals falling into each of terminal nodes, where | Ŝh aj ,a j ′ | is the size of Ŝh aj ,a j ′ .

Variable Selection
Variable selection has been a longstanding statistical problem and studied from both the classical and Bayesian perspective.In the field of causal inference, variable selection methods can aid in confounder selection (VanderWeele, 2019).Variable selection methods are largely based on parametric models, in which the exact relationships between the response and predictor variables need to be explicitly specified.However, incorrectly specifying the parametric models may produce unsatisfactory variable selection results (Bleich et al., 2014;Hu et al., 2021;Lin et al., 2022).For example, noise variables may be selected and important variables ignored.Flexible machine learning models can better represent functional forms of arbitrary complexity, thus mitigate parametric assumptions and attendant errors in variable selection.We develop a way in which the new method riAFT-BART can be used for variable selection among clustered survival data.Notation defined in Section 2.1 is largely retained with the exception of treatment assignment A ik , as the focus now is variable selection.

Random-intercept accelerated failure time model with Bayesian additive regression trees
We propose a permutation-based approach for variable selection.Following prior work, Bleich et al. (2014), Hu et al. (2021), andLin et al. (2022), we use the variable inclusion proportion (VIP) of each predictor variable as the foundation to determine how important a predictor variable is.The VIP score is supplied by the BART model.The VIP score signifies the frequency of each predictor's utilization in defining the splitting rules, relative to the total count of such rules appearing in the model.Each predictor's proportion can be used to rank the variables based on their relative importance.However, these proportions should not be interpreted as prescriptive guidelines for variable selection.Instead, they serve as indicative measures of variable significance within the model (Chipman et al., 2010;Hu et al., 2020).We note that in the riAFT-BART model (1), the predictor variables X ik , along with treatment assignment A ik , is included in the unknown function f (A ik , X ik ), which is being modeled by BART.Thus, in each iteration of the riAFT-BART sampling algorithm (Hu et al., 2022), we can extract the VIP for each predictor variable in the step where a BART model is fit to regress log y cent,c ik −b k on {A ik , X ik }, where y cent,c ik denotes the centered and complete survival time.Note that right censoring is addressed via data augmentation.Detailed riAFT-BART sampling algorithm appears in Web Section 1.For each predictor variable X l , l ∈ {1, . . ., L} (we suppress subscripts i and k for brevity of notation), we compute its VIP by averaging the VIP scores across D posterior draws (Bleich et al., 2014) x l is the dth draw from the posterior distribution of the VIP for X l .In our simulation (Section 4), we set D = 4500 with the first 1000 discarded as burn-in.Trace plot was used to check the convergence of the posterior distribution of the VIP for X l .
Building on prior literature (Bleich et al., 2014;Huang and Ma, 2010), we propose a non-parametric and permutation-based approach for variable selection, by which we first permute the event times together with the censoring indicators and then establish the thresholds for variable selection using the VIP from observed data and the permuted data.Specifically, we first create P permutations of the vectors of observed survival times and event indicators (y, ∆) = {Y 11 , ∆ 11 , . . ., Y N , ∆ N }: (y * 1 , ∆ * 1 ), (y * 2 , ∆ * 2 ) . . ., (y * P , ∆ * P ).Then for each of the permuted outcome vectors (y * p , ∆ * p ), we fit an riAFT-BART model with (y * p , ∆ * p ) as the response and the original {X ik } as predictor variables and V k as cluster indicator.This permutation strategy removes any dependency between the predictors {X ik } and the outcomes {(Y ik , ∆ ik )} while keeping possible dependencies among the predictor variables.From the riAFT-BART run using each permuted outcome vector (y * p , ∆ * p ), we retain the VIP for each predictor variable.We use VIP * x l ,p to denote the VIP score from riAFT-BART for predictor X l from the pth permuted outcomes, and use the notation VIP * X,p for the vector of all L VIPs from the pth permuted outcomes, (VIP * x1,p , . . ., VIP * x L ,p ).As suggested by Bleich et al. (2014), we use the VIPs across all P permutations VIP * X,1 , . . ., VIP * X,P as the "null" distribution for the VIPs from the unpermuted outcomes (y, ∆), VIP x1 , . . ., VIP x L .The permutation null distribution will be used to determine which variables should be selected or deemed as important.We adopt the "local" threshold strategy, which has been shown in several works (Bleich et al., 2014;Hu et al., 2020Hu et al., , 2021;;Lin et al., 2022) to have better performance than more stringent strategies such as "global max" for large-scale healthcare data in which the ratio of sample size and number of predictors is relatively large.More stringent strategies compare the VIP of a predictor VIP x l to the permutation distribution across all predictor variables VIP * X,1 , . . ., VIP * X,P .By contrast, a local threshold compares VIP x l with its own permutation null distribution VIP * x l ,1 , . . ., VIP * x l ,P .Specifically, we only select predictor X l if VIP x l is greater than 1: A new method for clustered survival data the 1 − α quantile of the permutation distribution VIP * x l ,1 , . . ., VIP * x l ,P .Following prior work (Hu et al., 2021;Lin et al., 2022), we set α = 0.05 in our simulation and case study, because this threshold value has been shown to yield good variable selection results with a balance between selecting useful and selecting noise predictors.

Piece-wise Exponential Additive Mixed Model
For a PEAMM (Bender et al., 2018), the hazard rate at time t for individual i in cluster k with covariates X ik and cluster indicator V k is given by where g 0 (t q ) represents the log-baseline hazard rate, g m (x ik,m ) is the basis functions of natural cubic splines associated with the covariates x ik , b k is the random intercept term capturing the cluster-specific effect, and κ q−1 and κ q are the start and stop times of the censored survival dataset represented in the (start, stop] counting process form.For variable selection using PEAMM, following the practical approach by Marra and Wood (2011), a shrinkage penalty term can be added to each basis term g m (x ik,m ) of model ( 5): so that when the whole spline basis terms for a predictor x ikl are shrunk to zero, the variable x ikl will be selected out of the model.The shrinkage penalties, along with all other model coefficients, can be estimated using restricted maximum likelihood estimation.

Frailty Models by Regularization Methods
Given the common unobserved frailty u k for cluster k, the conditional hazard function at time t for individual i in cluster k is where λ 0 (t) is an unspecified baseline hazard function and β = (β 1 , β 2 , ..., β L ) T is a vector of regression parameters for x ik .The common unobserved frailty u k is assumed to follow a gamma distribution because of its mathematical convenience.Rondeau et al. (2003) proposed to conduct variable selection by maximizing the penalized likelihood, is the full marginal log-likelihood for the frailty model ( 6) and α is the smoothing parameter which controls the trade-off between the data fit and the smoothness of the baseline hazard function.The regression parameters can be estimated by the robust Marquardt algorithm (Marquardt, 1963) -a combination of the Newton Raphson algorithm and the steepest descent algorithm.Ha et al. (2014) instead proposed to perform variable selection by maximizing the penalized profile h-likelihood, ) is a penalty function that controls model complexity using the tuning parameter γ, with the optimal value determined by the Bayesian information criterion (BIC).Three penalty functions were considered for B γ (|β l |): least absolute shrinkage and selection operator (Tibshirani, 1996), smoothly clipped absolute deviation (Fan and Li, 2001), and h-likelihood (Lee and Oh, 2014).

Backward stepwise selection
We also implement a classical variable selection approach through a recursive procedure of backward stepwise selection based on statistical testing, as described in Wood et al. (2008) The backward stepwise selection starts with a random-intercept Cox regression model (Cortinas Abrahantes and Burzykowski, 2005) with all potential predictors and removes from the model the predictor that has the least impact on the fit determined by the α significance level.The forward selection process then checks whether removed variables should be added back into the model at α(1 − ϵ) for ϵ small significance level.We use α = 0.05 as suggested by Wood et al. (2008).

Variable selection with missing data
Large-scale healthcare data are susceptible to the missing data issue.Missing values in predictor variables pose challenges for variable selection.Long and Johnson (2015) proposed a general resampling approach that combines bootstrap imputation and randomized lasso based variable selection procedure, and used simulations to demonstrate that this approach had better performance compared with several existing methods for variable selection in the presence of missing data.Recent work by Hu et al. (2021) shows that combining bootstrap imputation with flexible machine learning based variable selection methods can substantially improve variable selection results over methods that rely on parametric modeling of the covariate-outcome relationship.In this work, we combine bootstrap imputation and variable selection, and compare each of the methods considered on the ability to select predictor variables that are truly associated with the outcome, or useful predictor variables, among incomplete clustered survival data.To ensure consistency with previous literature, such as Wood et al. (2008), Bleich et al. (2014), andHu et al. (2021), and to facilitate comparison of our methods and results with similar studies, we adopt the four performance metrics frequently employed in the variable selection literature.A good method should have high values of precision, recall and F 1 , and low Type I error.
(1) Precision = TP/(TP + FP), where TP and FP are respectively the number of true positive and false positive selections.A true positive selection is when a useful predictor was selected; and a false positive selection is when a noise predictor was selected.The precision of a variable selection method is the proportion of truly useful predictors among all selected predictors.
(2) Recall = TP/(TP + FN), where FN is the number of false negative selections.A false negative selection is when a truly useful predictor was not selected.The recall of a variable selection method is the proportion of truly useful variables selected among all useful variables.This is sometimes referred to as the power in the literature (Wood et al., 2008). ( The F 1 score measures a method's ability to avoid selecting irrelevant predictors (precision) with its ability to identify the full set of useful predictors (recall).
(4) Type I error.The Type I error measures the overall probability of a method incorrectly selecting the noise predictor -the mean of the probabilities that the method incorrectly selects each of the noise predictors.
Among incomplete data, we first conduct bootstrap imputation through two steps: (i) generate B bootstrap datasets of the original data; (ii) conduct a single imputation for each bootstrap dataset using an imputation method of choice.Following Hu et al. (2021), in our simulations and case study, we used 8 1: A new method for clustered survival data B = 100 and performed standard imputation program using the R package mice.Variable selection via each of the methods considered was performed for all B bootstrap imputed datasets, and the final set of predictors were selected if they were selected in at least πB imputed datasets, where π ∈ (0, 1) is a fraction threshold for selecting a predictor.

Treatment effect heterogeneity
We carried out a contextualized simulation to evaluate the performance of our proposed method in estimating the heterogeneous treatment effect using observational data with clustered survival outcomes and multiple treatments.For methods comparison, we adapted the popularly used inverse probability weighting method into the setting of clustered and censored survival data to form two comparison methods: inverse probability of treatment weighting with the random-intercept Cox regression model (IPW-riCox) and doubly robust random-intercept additive hazards model (DR-riAH) (Cai and Zeng, 2011).Additionally, we considered the PEAMM (Bender et al., 2018) -another outcome modeling based method flexible at capturing nonlinear relationships.Because not all methods directly model the survival times T ik , to objectively compare methods, we use the individual-specific counterfactual survival curve as the basis.That is, the ISTE ζ aj ,a j ′ (x ik ) will now be defined in a similar way as in equation ( 3) but using functionals of the counterfactual survival curve -either the survival probability at a fixed point in time or the conditional restricted mean survival time (RMST) (Royston and Parmar, 2013).We will compare each method's ability to accurately estimate the ISTE ζ aj ,a j ′ (x ik ) and the average subgroup treatment effect M (S h aj ,a j ′ ).The performance metrics used to evaluate the methods were the relative bias (or percent bias), root-meansquared-error (RMSE) and precision in the estimation of heterogeneous effects (PEHE) (Hu et al., 2021).The relative bias evaluates the bias in the estimation of treatment effect on the relative scale with respect to the true treatment effect (which is constant across different methods).The PEHE measures the difference between the true and estimated survival probabilities or RMST across all data points.A smaller value of PEHE indicates a higher estimation accuracy.Although the methods examined in this study rely on different models, each can be utilized to estimate survival probabilities and RMST.These common outcomes will serve as a basis for an impartial comparison of these methods.
For weighting based methods IPW-riCox and DR-riAH, we used Super Leaner (Van der Laan et al., 2007) to estimate the stabilized inverse probability of treatment weights.Super Learner was implemented using the R package SuperLearner and the library argument was set to SL.library = c("SL.xgboost","SL.bartMachine","SL.gbm").We used R package coxme to implement IPW-riCox and timereg to implement DR-riAH.For PEAMM, the gam function from R package mgcv and two helper functions, (as ped and add surv prob), from R package pammtools were used.For all methods, all confounding variables available to the analyst in their original forms were included in the corresponding models.

Simulation design
We simulated datasets with structures similar to the COVID-19 data used in our case study (Section 5).We generated K = 10 clusters, each with a sample size of n k = 200.The total sample size is N = 2000.We simulated 7 confounding variables for each individual i in cluster k: five continuous variables X ikl ∼ N (0, 1), l ∈ {1, . . ., 5}; and two categorical variables X ikl ∼ M ultinomial (1, .3, .3, .4),l = 6, 7. We generated J = 3 treatment groups with unequal sample sizes; the ratio of individuals across treatment groups was 5:3:1, which is similar to the ratio of individuals across the three treatment groups in our case study.The treatment assignment mechanism follows a random-intercept multinomial logistic regression model, where the random intercept τ k ∼ N (0, 1 2 ).The parameter setup induces a moderate level of confounding, leading to moderate covariate overlap.The sparsity of overlap can be visualized by the distributions of true generalized propensity scores, shown in Web Figure 1.It should be noted that the overall extent of covariate overlap closely resembles that observed in the COVID-19 dataset (refer to Web Figure 8).The specific one-to-one correspondences are as follows: A = 1 emulates treatment 1, dexamethasone ; A = 3 emulates treatment 2, remdesivir; and A = 2 emulates treatment 3, a combination of dexamethasone and remdesivir.Following strategies (Hu et al., 2021;Lu et al., 2018) recommended for assessing the performance of methods in estimating heterogeneous treatment effects, we subclassified the simulated individuals into 40 subgroups based on the distribution of true generalized propensity scores (Feng et al., 2012), and calculated the biases and RMSE for each subgroup across 250 data replications.Note that the generalized propensity scores for each individual sum to one, The subclassification is based on intervals of the true generalized propensity scores for treatment 1 and treatment 2, representing a full range of assignment propensity to each treatment group.Subclasses of generalized propensity scores are presented in Web Table 1.
We simulated three sets of true counterfactual survival times from the Weibull survival distribution, which can be parameterised as both the accelerated failure time model and Cox proportional hazards model (Hu et al., 2022), where S ik (t) is individual-specific survival function for ith individual in cluster k, λ aj is a treatment group specific parameter, η is the shape parameter and q aj (X ik , b k ) represents a general functional form of covariates and the random-intercept.Using the inverse transform sampling, we generated counterfactual survival times by for a j ∈ {1, 2, 3}, where U is a random variable following the uniform distribution on the interval [0, 1] and λ aj = {5000, 800, 1200} for a j = 1, 2, 3. Observed and uncensored survival times are generated as T ik (a j )I(A ik = a j ).
We considered three treatment effect heterogeneity settings with an increasing level of complexity: where the random intercept b k ∼ N (0, 4 2 ).Across all three heterogeneity settings, X ik3 , X ik4 , X ik5 , X ik6 were confounders related to both the treatment assignment mechanism and the potential outcome generating process.In scenario (a), nonlinear covariate effects are present in treatment groups 1 and 2. In scenario (b), there are nonlinear covariate effects in all three treatment groups.Scenario (c) is similar to scenario (b) except that a nonoverlapping set of covariates are included in three treatment groups.This additional complexity represents an additional source of treatment effect heterogeneity.The parameter η was set to 2 and exp(0.7 + 0.5x ik1 ) to respectively induce proportional hazards and nonproportional hazards.We further generate the censoring time C independently from an exponential distribution with the rate parameter chosen to induce a censoring proportion of 50%, which is close to the censoring proportion observed in the COVID-19 data in our case study.The Kaplan-Meier survival curves by treatment group for all three heterogeneity settings under both proportional and nonprotional hazards are presented in Web Figure 2.

Simulation results
Table 1 summarizes the overall precision measure PEHE (based on 3-week survival probability) for each of four methods under three heterogeneity settings and both proportional and nonproportional hazards.First, flexible methods riAFT-BART and PEAMM and doubly robust weighting method DR-riAH yielded substantially smaller PEHE than the parametric single robust method IPW-riCox across all simulation scenarios.Second, as the heterogeneity setting became more complex, flexible methods riAFT-BART and PEAMM tended to have better performance evidenced by decreasing mean and variation of the PEHE.All methods show decreasing precision in estimating ISTE, demonstrated by increasing PEHE value, when the proportional hazards assumption fails to hold.This is possibly due to more complex data generating processes, in which the rate parameter η depends on covariate x ik1 .The decrease in estimation precision is the smallest for riAFT-BART.Overall, across all scenarios, the proposed method riAFT-BART had the smallest PEHE, or the highest accuracy in estimating the ISTE.Similar observations can be made from the PEHE measures based on 3-week RMST, which are provided in Web Table 2. Figure 1 displays the relative biases and RMSE results among 40 generalized propensity score subgroups, each averaged across 250 simulation runs.Three pairwise treatment effects were estimated by averaging the ISTE ζaj,a j ′ (x ik ) (based on 3-week survival probability) across individuals in each subgroup.Under both settings of proportional hazards and nonproportional hazards, the proposed method riAFT-BART boasts the smallest biases and RMSE across all three treatment effects, particularly for subpopulation near the centroid of the generalized propensity score distribution.In sharp contrast, IPW-riCox, which encodes parametric and linearity assumptions and assumes proportional hazards, yielded the largest biases and RMSE.The increasing complexity level of treatment effect heterogeneity did not impact the performance of more flexible methods PEAMM and riAFT-BART, but induced larger biases and RMSE for parametric methods IPW-riCox and DR-riAH.The violation of the proportional hazards assumption, which introduced an additional data complexity (the rate parameter η depends on the covariate x ik1 ), had a negligible effect on the performance of riAFT-BART, but all three comparison methods had a deteriorated performance.It is worth noting that all methods produced large biases and RMSE for subpopulation at the tail regions of the generalized propensity score distribution, suggesting that causal inferences drawn about this subpopulation may not be reliable.Results based on 3-week RMST, shown in Web Figure 3, convey the same messages.

Variable selection
We adopt the simulation settings used in Hu et al. (2021) to evaluate the performance of variable selection methods among incomplete, clustered survival data.We used the multivariate amputation approach to generate missing data scenarios with desired missingness percentages under the missing at random mechanism (Schouten et al., 2018).The multivariate amputation procedure comprises three main steps.Firstly, the complete data is randomly divided into a specified number of subsets, each allowing for the specification of any missing data pattern.Secondly, weighted sum scores are calculated for individuals in each subset in order to amputate the data.These scores are determined based on the relationship between the variable to be amputated and the variables on which the missingness in the amputated variable depends.Finally, a logistic distribution function, as described by Van Buuren (2018), is applied to the weighted sum scores to compute the probability of missingness.This probability is then used to determine whether each data point becomes missing or not.We investigated the practical operating characteristics of our proposed variable selection method using riAFT-BART, and compare it to PEAMM, frailty models with penalized likelihood (FrailtyPenal), frailty models with penalized profile h-likelihood (FrailtyHL) and backward stepwise selection with random-intercept Cox regression model (riCox).Four performance metrics were used to evaluate the variable selection results: precision, recall, F 1 and Type I error (Section 3.3).The multivariate amputation procedure was implemented via the ampute function of the R package mice (Schouten et al., 2018).Imputation was performed using the R package mice (Van Buuren, Stef and Groothuis-Oudshoorn, Karin, 2010).To implement our proposed method using riAFT-BART, we used 4500 posterior draws with the first 1000 discarded as burn-in.The permutation distributions of VIP scores were constructed from 100 riAFT-BART model runs.To implement the variable selection procedure based on PEAMM, we used the helper function (as ped) from R package pammtools to reformat the 12 1: A new method for clustered survival data data and used the gam function from R package mgcv for variable selection.Variable selection by frailty models was implemented using the frailtyPenal function of R package frailtypack for penalized marginal likelihood; and using the function frailty.vsfrom R package frailtyHL for penalized profile h-likelihood.For backward stepwise selection, the random-intercept Cox regression model was fitted using the coxme function from R package coxme.

Simulation design
The simulation scenarios are motivated by the data structures observed in the COVID-19 data set used in our case study.We considered K = 10 clusters, each with a sample size of n k = 200.We generated 8 useful predictors that are truly related to the survival outcomes, X ik1 , . . ., X ik8 , and 20 noise predictors, X ik9 , . . ., X ik28 .We simulated X ik1 and X ik2 independently from Bern(0.5),X ik3 , X ik4 from the standard normal distribution N (0, 1), and X ik5 , X ik6 , X ik7 , X ik8 were designed to have missing values under the missing at random mechanism.For a missing at random predictor variable, we specify the true forms in which the predictor depends on the other predictors as the following: ).Among the 20 noise predictors, 10 were generated as continuous variables X ik9 , . . ., X ik18 i.i.d ∼ N (0, 1), and the other 10 were generated as binary variables X ik19 , . . ., X ik28 i.i.d ∼ Bern(0.5).Similar to the outcome generating process described in Section 4.1.1,we generate the observed and uncensored survival times from the Weibull survival distribution, where λ = 3000, U is a random variable following the uniform distribution on the interval [0, 1], b k ∼ N (0, 4 2 ), and We further generated the censoring time C independently from an exponential distribution with the rate parameter selected to induce the censoring proportion of 50%.The parameter η was set to 2 and exp (0.7 + 0.5x ik1 ) to respectively produce proportional hazards and nonproportional hazards.This data generating process was designed to make it difficult for any method to accurately model the survival outcome.We considered the outcome model with arbitrary data complexity that reflects common situations in health datasets: (i) discrete predictors with strong (X ik1 ) and moderate (X ik2 ) associations; (ii) both linear and nonlinear forms of continuous predictors (X ik6 and X ik7 ); (iii) nonadditive effects (X ik4 X ik8 ).After generating the fully observed data, we amputated the four predictors designed to have missing values X ik5 , X ik6 , X ik7 , X ik8 under the missing at random mechanism using the multivariate amputation approach.Additional technical details for the amputation procedure are available in Schouten et al. (2018); and additional amputation details for the simulation appear in Web Section 2. The resultant simulation data have an overall missingness proportion of 40%.The process of data generation was replicated 250 times.

Simulation results
Table 2 summarizes variable selection results by each method.The optimal performance in the presence of incomplete data is demonstrated by selecting the most suitable threshold value for π, which is determined based on the F 1 score (refer to Web Figures 4-5).This performance is then compared with that of fully observed data (prior to amputation) and complete cases exclusively.It is noteworthy that the optimal threshold value of π for riAFT-BART is 0.1, corresponding to the highest F 1 score.Additionally, the F 1 score declines at elevated values of π, attributable to reduced recall values.
The proposed method riAFT-BART has the best performance across all three data panels (fully observed, with missing data, complete cases), evidenced by the highest precision, recall and F 1 score, and the lowest type I error.The second best performing method is PEAMM, for its modeling flexibility over parametric methods.In the presence of missing covariate data, all methods had a slightly deteriorated performance when performing variable selection in combination with bootstrap imputation, but the performance of variable selection among only complete cases was considerably lower.2, the proposed method riAFT-BART tended to select the most useful predictors and the least noise predictors, followed by another flexible method PEAMM, under both proportional hazards and non-proportional hazards.Among parametric methods, penalized likelihood methods for frailty models had slightly better performance than riCox.All methods tended to select more noise predictors and less useful predictors under non-proportional hazards; the decrease in the performance due to elevated data complexity associated with nonproportional hazards, was only marginal for riAFT-BART but considerable for less flexible methods, particularly FrailtyHL and riCox.
A perusal of each method's ability to select each of useful predictors is visualized in Figure 3.All methods had a good power (> 0.8) for selecting a strong discrete predictor (X ik1 ), but unsatisfactory power (< 0.5) for detecting a discrete predictor (X ik2 ) that only has a moderate relationship with the outcome.
For continuous variables, all methods performed well in selecting X ik3 , which has only a linear relationship with the log survival time, but riAFT-BART produced a significantly higher power for selecting the predictor X ik4 , which has a complex nonadditive form in model ( 8).Among the four predictors with missing data X ik5 , . . ., X ik8 , the advantage of riAFT-BART is demonstrated by the higher power of selecting predictors that have more complex functional forms that are difficult to be captured by other methods considered.A direct comparison of the proportional hazards setting with nonproportional hazards shows that all methods performed better under proportional hazards.However, when the proportional hazards assumption is violated, the effects are least detrimental for PEAMM and riAFT-BART.
As our proposed method riAFT-BART pivots on the predictor's VIP, we examined the convergence of the Markov chain by plotting 4500 posterior draws, with the first 1000 discarded as burn-in, of the VIP for four useful predictors and four noise predictors in Web Figure 6.The method converged well.
All simulation procedures were conducted in R on an iMac equipped with a 4 GHz Intel Core i7 processor.The dataset under consideration was comprised of 10 clusters (K = 10), each with a sample size of 200 (n k = 200), and included 8 relevant predictors and 20 noise predictors.In terms of processing time, riAFT-BART required approximately 58 minutes to execute, PEAMM took around 42 minutes, FrailtyPental ran for about 28 minutes, FrailtyHL consumed roughly 30 minutes, and riCox finished in about 8 minutes.These methods were run with the default memory size setting of 100MB.

Case study: Predictors for in-hospital mortality and heterogeneous survival treatment effects among post-ICU COVID patients
In this case study, we first applied the proposed variable selection methods to select important predictors for in-hospital mortality using a comprehensive COVID-19 dataset, and then evaluated the treatment effect heterogeneity on patient survival using the selected predictors by implementing our proposed method riAFT-BART.The COVID-19 dataset was drawn from six hospitals (clusters) of the Epic electronic medical records system at the Mount Sinai Medical Center.Included in this dataset were 1955 patients who were diagnosed with COVID-19 between March 10, 2020 to February 26, 2021 and were in intensive care (ICU) due to COVID-19 during their hospital stay (Wang et al., 2020).Post ICU admission, each patient received COVID-specific medications: either dexamethasone, or remdesivir, or the combination of dexamethasone and remdesivir (dexamethasone+remdesivir).The cluster sizes and number of patients in each treatment group are provided in Web Table 4.We used the following 28 baseline (time of ICU admission) variables: age, sex, self-reported race and ethnicity, smoking status, binary comorbidities including hypertension, coronary artery disease, cancer, diabetes, asthma, and chronic obstructive pulmonary disease, vital signs including temperature, systolic blood pressure, diastolic blood pressure, patient oxygen level (definition in Web Table 3), heart rate, the fraction of inspired oxygen, body mass index, oxygen saturation, risk score (sequential organ failure assessment score and glasgow coma scale), laboratory test results including partial pressure of oxygen, D dimer value and inflammatory markers such as lactate dehydrogenase, ferritin, c-reactive protein, creatinine and white blood cell count.Vital signs, risk score and lab results were measured during the first 24 hours of ICU admission.Among the 28 candidate predictor variables, 17 variables have missing data and the missingness proportion ranges from 1.4% to 28.6%.Only 1187 (60.7%) patients have fully observed data for all covariates, which motivated the overall missingness proportion of 40% used in our simulation study (Section 4.2).Detailed summary statistics for baseline variables appear in Web Table 4.
The Kaplan-Meier survival curves by treatment group are displayed in Web Figure 7, with crossovers suggesting nonproportional hazards.Assessed by the distribution of the generalized propensity scores shown in Web Figure 8, there is moderate covariate overlap across three treatment groups in the COVID-19 data.
We first implemented all five variable selection methods.All 28 candidate predictors were included in the imputation models, and the optimal threshold value of π for each method, which produced the best Table 3 Variable selection results by each of 5 methods, with the best selection threshold value of π sug-F 1 score in the simulation study, was used to select the final set of predictors.As summarized in Table 3, FrailtyHL selected the most (9) predictors, riAFT-BART and riCox both selected 8 predictors, PEAMM selected 7 preditors and FrailtyPenal selected the least (6) predictors.Age and lactate dehydrogenase were the only two predictors selected by all five methods.Variables selected by the two flexible methods riAFT-BART and PEAMM were largely in agreement with each other, with the exception that riAFT-BART but not PEAMM selected Race and D dimer, and PEAMM but not riAFT-BART selected C-reactive protein.
To further distinguish between methods based on their ability to select the most relevant predictors, we computed the predictive performance of the five methods that modeled the outcome variable on their respectively selected predictor variables.For survival outcomes, we used the concordance statistic as a metric to assess the predictive performance (Harrell Jr et al., 1996).The concordance statistic is the fraction of concordant pairs among all pairs of individuals.A pair of patients is called concordant if the patient with higher (lower) predicted survival probability at a given time point had a longer (shorter) time to death.In this case study, we chose 21 days post ICU admission as the target time point, because it is a sufficiently long time to evaluate the effects of COVID-19 treatments on in-hospital mortality for severe patients.As the COVID-19 dataset is subject to missing covariate data, to calculate the cross validated concordance statistic, we follow the strategy described in Lin et al. (2022).We first split the data into two halves and implemented each of the five variable selection methods on half of the data.Then we imputed the other half of the data with a single imputation and recorded the concordance statistic.Finally we repeated the previous two steps 250 times to get the distribution of the cross-validated concordance statistics.Figure 4 shows that the proposed method riAFT-BART yielded the highest predictive performance with a median concordance statistic > 0.7.
Supported by both the simulation study and case study, riAFT-BART provided the best variable selection results.Thus we used the 8 predictors selected by riAFT-BART as confounding variables, and estimated the heterogeneous treatment effects on patient survival using methodology described in Section 2. To address the missing data issue when drawing causal inferences about treatment effects, we fitted an riAFT-BART model (1) on each of the 100 bootstrap and imputed datasets (previously conducted for variable selection), and drew statistical inferences about the ISTE ζ aj ,a j ′ (x ik ), defined in equation (3), by pooling posterior samples of the ISTE, f d (a j , x ik ) − f d (a j ′ , x ik ) as in equation ( 2), across model fits arising from the multiple datasets (Zhou and Reiter, 2010;Hu et al., 2022).Inferences about the population average 1: A new method for clustered survival data treatment effects θ aj ,a j ′ can be obtained from the posterior samples of f d (a j , x ik ) − f d (a j ′ , x ik ) across the sample population (Hu et al., 2022).The population average treatment effects, summarized in Web Table 5, suggest that there was no statistically significant difference in treatment effect when comparing dexamethasone with remdesivir or with a combined use of dexamethasone and remdesivir.However, there was a statistically significant treatment benefit associated with dexamethasone+remdesivir compared to using remdesivir alone.Web Figure 9 demonstrates that substantial variability in the institutional effect was captured by the riAFT-BART model: the Mount Sinai main hospital had considerably better outcomes than the Mount Sinai Queens hospital.
Turning to the treatment effect heterogeneity. Figure 5 summarizes the combination rules of covariates S aj ,a j ′ that have differential subgroup treatment effect M(S aj ,a j ′ ), defined in equation ( 4), than the sample population treatment effect.When comparing remdesivir and dexamethasone + remdesivir across the sample population, there was significant treatment benefit, suggested by a longer expected survival time, for the combined use of dexamethasone and remdesivir.Our analysis suggests that combining dexamethasone and remdesivir offered enhanced treatment benefit for relatively healthier COVID-19 patients with lower oxygen saturation (< 95.5%) and a normal white blood cell count (< 11.4K/µL), demonstrated by the leftmost branch of the tree diagram in Figure 5.By contrast, the rightmost branch suggests that there was no statistically significant treatment benefit associated with either treatment choice for comparatively unhealthier patients with a higher level of oxygen saturation (> 95.5%).

Discussion
We describe two important and practical utilities of a new Bayesian machine learning method riAFT-BART for analyzing clustered and censored survival data.To address the implications of complex data structures of large-scale clinical datasets generated from multiple institutions for causal inferences about population treatment effect, riAFT-BART was developed to accurately estimate the population average treatment effect on patient survival while accounting for the variation in institutional effects (Hu et al., 2022).
Moving from broad applicability of a population study to personalized medicine requires the understanding of treatment effect heterogeneity.The treatment effect heterogeneity was traditionally identified by first enumerating potential effect modifiers with subject-matter experts and then estimating the average treatment effect within each subgroup (Kent et al., 2020).This approach is particularly suitable to confirmatory treatment effect heterogeneity analysis in randomized clinical trials.In observational data with a large number of pre-treatment covariates, such a priori specification that separates the issues of confounding and treatment effect heterogeneity is often practically infeasible.We described a way in which the new method riAFT-BART can be used to perform an exploratory treatment effect heterogeneity analysis to generate scientifically meaningful hypotheses and treatment effect discovery.Through a comprehensive simulation representative of complex confounding and heterogeneity settings with right-censored survival data, we provided new empirical evidence for the better performance of our proposed method in comparison with three existing methods.In the application to a large COVID-19 dataset, we found that oxygen saturation and white blood cell count were two key patient factors that modulated the comparative treatment effect between remdesivir and dexamethasone + remdesivir.We further exploited the posterior samples of the ISTE from riAFT-BART and the random forests model to explore clinically meaningful relationships between COVID-19 medications, patient profile and expected survival time.The results could facilitate treatment effect discovery in subpopulations and may inform personalized treatment strategy and the planning of future confirmatory randomized trials.Based on the simulation results, caution is warranted when drawing causal inferences about subpopulations in the tail regions of the generalized propensity score distribution, as the reliability of these inferences may be compromised.
Drawing on prior work (Hu et al., 2021) that combines bootstrap imputation and BART for variable selection among incomplete data, we develop a way in which riAFT-BART can be used to select important predictors among incomplete, clustered and censored survival observations.This strategy is general enough to accommodate any missing data pattern and is highly flexible as it leverages BART to model the relationship between survival times and covariates.As demonstrated by both simulations and the case study, this elevated modeling flexibility gave rise to substantially better variable selection results, particularly in terms of identifying predictors of complex functional forms, compared to existing variable selection methods suitable for clustered survival data.In general, we recommend using smaller values for the selection threshold, such as π = 0.1 or π = 0.2, when employing riAFT-BART for variable selection.Furthermore, the determination of the optimal selection threshold value can be achieved through cross-validation.Specifically, the most suitable threshold value ought to yield a set of selected variables that result in better cross-validated predictive performance.
There are several important avenues for further research.First, the riAFT-BART model could be extended to include the random slopes and accommodate the cluster-level covariates.Correspondingly, a variable selection method can be developed to identify important predictors at both the individual level and the cluster level.Second, using flexible modeling techniques alone will not address violations of the no unmeasured confounding assumption required for causal inference.Developing a formal sensitivity analysis approach (Hu et al., 2022) for unmeasured confounding would be a worthwhile and important contribution.Finally, our bootstrap imputation-based variable selection procedure is can be computationally demanding on a alrge dataset.Although the bootstrap resampling can be computed in parallel on multiple cores when such resources are available, it would be worthwhile to develop strategies that could offer substantial computational savings.Web-based Supplementary Materials for "A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection" in each Gibbs iteration, where log Z ik ∼ N f (A ik , X ik ) + b k , σ 2 .The centered complete-data survival times are Using the centered complete-data survival times y cent,c ik , the joint posterior is where W h is the hth binary tree structure, M h = (µ 1h , . . ., µ c h h ) T is the set of c h terminal node parameters associated with tree structure W h .For a given value A ik and X ik in the predictor space, the binary tree function returns the parameter µ lh , l ∈ {1, . . ., c h } associated with the terminal node of the predictor subspace in which {A ik , X ik } falls.We can draw the values of BART sum-of-trees model parameters, µ lh and σ 2 , directly from the fitted BART model.Their posterior distributions P µ The posterior of α k , used for parameter expansion, is Complete derivation of the posterior distributions are also provided in Web Section S1 of ?.A single iteration of the riAFT-BART sampling algorithm proceeds through the following steps: 1. Update b k , τ 2 and α k from their respective posterior distributions.
2. Using log y cent,c ik − b k as the responses and {A ik , X ik } as the covariates, update BART sum-oftrees model via parameters µ lh and σ 2 , using the Bayesian backfitting approach of ?. Directly update f (A ik , X ik ) using the updated BART model, for i = 1, . . ., n k , k = 1, . . ., K.

For each {i, k}
Because we use the centered responses log(y cent ik ) = log(y ik ) − μAF T in posterior computation, we add μAF T back to the posterior draws of f (A ik , X ik ) in the final output.

Supplementary figures and tables
Web Figure 1 Web d = 1, . . ., D, where VIP d

Figure 1
Figure 1 Relative biases (Panel A) and root-mean-squared-errors (RMSE) (Panel B) among 40 generalized propensity score subgroups under 6 data configurations: (heterogeneity settings a, b, c) × (proportional hazards (PH) and nonproportional hazards (nPH)) for each of four methods, IPW-riCox, DR-riAH, PEAMM and riAFT-BART.Three pairwise treatment effects were estimated by averaging the individual survival treatment effect (based on 3-week survival probability) across individuals in each subgroup.Each boxplot visualizes the distribution of relative biases or the distribution of RMSE for 40 subgroups, each averaged across 250 simulation runs.

Figure 2
Figure2The distribution, across 250 data replications, of the numbers of selected noise predictors and useful predictors for each of five methods: riAFT-BART, PEAMM, FrailtyHL, FrailtyPenal and riCox, with clustered survival data generated under both proportional hazards (PH) and non-proportional hazards (nPH).The total number of useful predictors is 8 and the total number of noise predictors is 20.There are K = 10 clusters, each with a size of 200; the total sample size is 2000.The overall proportion of missingness is 40%.

Figure 3 Figure 4
Figure 3 Power of each of five methods: riAFT-BART, PEAMM, FrailtyHL, FrailtyPenal and riCox, for selecting each of 8 useful predictors with clustered survival data generated under proportional hazards (PH) and non-proportional hazards (nPH), based on 250 data replications.There are K = 10 clusters, each with a size of 200; the total sample size is 2000.The overall proportion of missingness is 40%.Filled symbols represent the PH setting, and open symbols correspond to the nPH setting.

Figure 5
Figure 5 Final Random Forests model fit to the posterior mean of the individual survival treatment effect comparing remdesivir and dexamethasone + remdesivir.Values in each node correspond to the posterior mean, in terms of difference in log survival days, for the subgroup of individuals represented in that node.Uncertainty intervals were obtained by pooling the posterior samples arising from the multiple imputed data sets.WBC: White blood cell.
: Overlap assessment for data simulated under moderate covariate overlap.Each panel presents boxplots by treatment group of the true generalized propensity scores for one of three treatments, and for every unit in the sample.The left panel presents treatment 1, the middle panel presents treatment 2, and the right panel presents treatment 3. Web Figure 3: Relative biases (Panel A) and root-mean-squared-errors (RMSE) (Panel B) among 40 generalized propensity score subgroups under 6 data configurations: (heterogeneity settings a, b, c) × (proportional hazards (PH) and nonproportional hazards (nPH)) for each of four methods, IPW-riCox, DR-riAH, PEAMM and riAFT-BART.Three pairwise treatment effects were estimated by averaging the individual survival treatment effect (based on 3-week restricted mean survival time) across individuals in each subgroup.Each boxplot visualizes the distribution of relative biases or the distribution of RMSE for 40 subgroups, each averaged across 250 simulation runs.Web Figure 4: The precision, recall, F 1 score and Type I error, for each of four methods: riAFT-BART, PEAMM, riCox, FrailtyHL and FrailtyPenal with data generated under proportional hazards, based on 250 data replications.Imputation was performed on 100 bootstrap samples of each replication dataset, using imputation method mice.There are K = 10 clusters, each with a size of 200; the total sample size is 2000.The overall proportion of missingness is 40%.Web Figure 5: The precision, recall, F 1 score and Type I error, for each of four methods: riAFT-BART, PEAMM, riCox, FrailtyHL and FrailtyPenal with data generated under non-proportional hazards, based on 250 data replications.Imputation was performed on 100 bootstrap samples of each replication dataset, using imputation method mice.There are K = 10 clusters, each with a size of 200; the total sample size is 2000.The overall proportion of missingness is 40%.Web Figure 7: The Kaplan-Meier survival curves for three treatment groups.Three treatment options are A = 1: Dexamethasone, A = 2: Remdesivir and A = 3: Dexamethasone + Remdesivir.Web Figure 8: Overlap assessment for three treatment groups in the COVID-19 dataset.Each panel presents boxplots by treatment group of the generalized propensity scores, estimated by Super Learner, for one of three treatments, and for every individual in the sample.The left panel presents treatment 1 = Dexamethasone, the middle panel presents treatment 2 = Remdesivir, and the right panel presents treatment 3 = Dexamethasone + Remdesivir.Web Figure 9: Examining the effect of hospital sites on patient survival in terms of the log survival days represented by the posterior mean and credible intervals of the random intercept b k , k = 1, . . ., 6, for COVID-19 case study.

Table 2
Simulation results for each variable selection approach performed on the fully observed data, and among incomplete data with 40% overall missingness in covariates under proportional hazards and non-proportional hazards.Imputation was conducted via mice.For each of five methods, we show results corresponding to the best threshold values of π (based on F 1 ).
With riAFT-BART, a Metropolis within Gibbs procedure was employed for posterior inferences about treatment effects on patient survival.The observed responses y ik were first centered via the following two steps: (i) fit a parametric intercept-only accelerated failure time model assuming log-normal residuals, and estimate the intercept μAF T and the residual scale σAF T ; (ii) transform the responses as y cent ik = y ik exp (−μ AF T ).We use data augmentation to deal with right censoring[?].Working with the centered responses y cent ik , when ∆ ik = 0, we impute the unobserved and centered survival times z ik from a truncated normal distribution:

Table 4 :
Baseline variables of the COVID-19 data.Summary statistics are represented as mean and standard deviation (SD) for continuous variables and N (%) for discrete variables.
Abbreviations: BP = blood pressure; SOFA = Sequential organ failure assessment