Conditional Cross-Design Synthesis Estimators for Generalizability in Medicaid

While much of the causal inference literature has focused on addressing internal validity biases, both internal and external validity are necessary for unbiased estimates in a target population of interest. However, few generalizability approaches exist for estimating causal quantities in a target population when the target population is not well-represented by a randomized study but is reflected when additionally incorporating observational data. To generalize to a target population represented by a union of these data, we propose a class of novel conditional cross-design synthesis estimators that combine randomized and observational data, while addressing their respective biases. The estimators include outcome regression, propensity weighting, and double robust approaches. All use the covariate overlap between the randomized and observational data to remove potential unmeasured confounding bias. We apply these methods to estimate the causal effect of managed care plans on health care spending among Medicaid beneficiaries in New York City.


BACKGROUND
When estimating causal effects, randomized data estimates often have unbiased causal effects for the population represented by the study (i.e., internal validity). However, these estimates may not reflect causal effects in the target population (i.e., external validity), and, furthermore, not represent subsets of the target population. Observational data may be more representative of the target population and, hence, have external validity, but they are potentially affected by unmeasured confounding. These challenges arise in settings ranging from clinical trials that exclude certain patient subsets (Prentice et al., 2005; to policy evaluation studies that aim to inform deployment in a different population (Attanasio et al., 2003;Kern et al., 2016). While much of the causal inference literature has focused on addressing internal validity biases, both internal and external validity are necessary for unbiased estimates.
Although generalizability and transportability methods exist for extending inference from a randomized study to a target population, few leverage a combination of randomized and observational data to address each data source's shortcomings (Degtiar and Rose, 2021). Approaches that do combine individual-level randomized and observational data face limitations when the target population doesn't fully overlap with the randomized data and the observational data have unmeasured confounding. Existing techniques extrapolate from the randomized data beyond their support (Attanasio et al., 2003;Hill, 2011;Kern et al., 2016), assume the included observational data have no unmeasured confounding (Kern et al., 2016;, or allow for unmeasured confounding but assume treatment effects are identical within strata of effect modifiers, which may not hold with continuous effect modifiers (Rosenman et al., 2020). Cross-design synthesis methods combine randomized and observation data, often relying on a binary flag that determines eligibility to be in the randomized subset of the data, which requires overlap membership to be known (Begg, 1992;Kaizar, 2011;Greenhouse et al., 2017). Bayesian calibrated risk-adjusted modeling, currently only deployed in the context of Cox proportional hazards survival regression, necessitates a third external data source that has strong overlap with both the randomized and observational data (Varadhan et al., 2016;Henderson et al., 2017). A 2-step regression approach by  assumes that the randomized covariate distribution is subsumed in the observational data distribution and does not directly extend to estimating population treatment-specific means rather than average treatment effects.
We present a novel class of methods, which we refer to as conditional cross-design synthesis (CCDS) estimators, that address several limitations of existing estimators that incorporate outcome information from randomized and observational data. All CCDS approaches estimate a conditional bias term from the overlapping support between randomized and observational data that is then used to 'debias' observational data estimates. These techniques are robust to unmeasured confounding in the observational data and pos-itivity violations for selection into the randomized data. The estimators include outcome regression, 2-step outcome regression, inverse probability weighting, and double robust augmented inverse probability weighting approaches. Our implementation allows for the incorporation of ensemble machine learning to estimate the regression components of the various estimators, minimizing reliance on misspecified parametric regressions.
We apply our class of CCDS estimators to a study in New York City (NYC) Medicaid Managed Care (MMC), which provides health insurance to most New York Medicaid beneficiaries. Beneficiaries who do not choose a health plan are randomly assigned to one. However, the 7% of NYC beneficiaries who are randomized are not representative of the broader NYC Medicaid population. Of the remaining 93% of enrollees who actively chose their health plan (i.e., the observational beneficiaries), some are not well-represented by any randomized beneficiaries. This motivates our CCDS approaches that combine randomized and observational data to estimate health plan-specific causal effects on health care spending in the full NYC Medicaid population.
Section 2 defines notation and the estimand of interest. Section 3 reviews standard generalizability assumptions, describes our relaxation of two of the assumptions through the combination of randomized and observational data, and identifies the estimand of interest under our relaxed assumptions. Section 4 presents the novel CCDS estimators and the limited available alternative approaches. We evaluate all estimators through a simulation study in Section 5 that highlights settings where each CCDS estimator can be anticipated to perform well. Section 6 applies these methods to our NYC Medicaid study examining the impact of managed care plans on health care spending. Section 7 concludes with a discussion.

Notation
The target population of interest is represented by a target sample. A portion of the target sample is randomized to the intervention (i.e., managed care plans) and the remaining individuals are observational. Hence the target sample is a union of randomized and observational data. We observe n = n RCT + n obs independent draws from an underlying probability distribution P ∈ M, where M is statistical model, namely, a collection of possible probability distributions. Each of these draws consist of an outcome Y ∈ R, the intervention A ∈ A, the vector of covariates X ∈ R ∈ R d , where R is the region of support in the target population's covariate distribution, and an indicator for selection into the randomized group S ∈ S = {0, 1}. Thus, the observational unit for the target sample is O = (Y, A, X, S).
The data generating processes which result in randomized and observational data re-Figure (1) Overlap and Nonoverlap Regions in the Target Population. The region of support in the target population's covariate distribution, R, is the union of the randomized group's support (R RCT ) and the observational group's support (R obs ). Partial overlap exists between R RCT and R obs : R overlap corresponds to the region of overlap (i.e., region of common support in the covariate distributions) between data sources; R obs-only corresponds to the region only represented in the observational data and R RCT-only corresponds to the region only represented in the randomized data.
alizations differ. The randomized data consist of n RCT i.i.d. realizations conditional on selection into the randomized group, S = 1. The observational unit for the randomized data is O RCT = (Y, A, X, S = 1) ∼ (Y, A, X|S = 1) ≡ P RCT . Similarly, the observational data consist of n obs i.i.d. draws conditional on selection into the observational study, S = 0. The observational unit for the observational data is thus O obs = (Y, A, X, S = 0) ∼ (Y, A, X|S = 0) ≡ P obs .

Estimand
As per the potential outcomes framework, let Y a be the potential outcome if intervention a were assigned. The estimands of interest for our intervention are the target population treatment-specific means (PTSMs): E(Y a ) for ∀a ∈ A, as have been explored in prior analyses with multiple unordered treatments (Rose and Normand, 2019). In contrast, study treatment-specific means (STSMs) are mean counterfactual outcomes for a given treatment over a given study population: E(Y a |S = s) for ∀a ∈ A, s ∈ S. Because no given health plan serves as a natural "control" comparator, treatment-specific means rather than the target population average treatment effect (PATE: E(Y a ) − E(Y a )) are of interest.

Defining and Determining Overlap and Nonoverlap Regions
Covariate distributions differ between randomized and observational groups: P (X|S = 1) = P (X|S = 0). Furthermore P (X = x|S = 0) = 0 and P (X = x |S = 1) = 0 for some x, x ∈ R. Namely, a portion of the observational data is not well-represented in the randomized data and potentially a portion of the randomized data may not be well-represented in the observational data. However, there is a region of overlap between randomized and observational covariate distributions. Overlap refers to common support across randomized and observational populations in the distribution of outcome predictors associated with study selection (or effect modifiers associated with study selection if the estimand of interest had been an average treatment effect): R overlap = x ∈ R : P (X = x|S = 1) > 0 ∩ P (X = x|S = 0) > 0 ( Figure 1). Regions of nonoverlap therefore correspond to regions of the covariate distribution where either only observational individuals (R obs-only ) or only randomized individuals (R RCT-only ) would be observed, i.e., regions where units in one study population are not eligible to be in the other study population.
The target sample covariate distribution (R) can therefore be decomposed as: R = R overlap ∪ R obs-only ∪ R RCT-only . Thus, R obs = R overlap ∪ R obs-only and R RCT = R overlap ∪ R RCT-only . R RCT-only and R obs-only may be null sets. Let R be an indicator for being in the respective region, e.g., R overlap = 1(membership in R overlap ).
At times, it may be the case that rather than a union of a randomized and observational study being representative of the target population, a reweighted union of the two studies may be representative, such as when working with a random sample of observational data for computational efficiency (which we do for our analysis), or when data are collected through survey sampling. In this case, through reweighting, one can map the randomized and observational study regions of covariate support, R RCT and R obs , into a transformation, R RCT → R * RCT and R obs → R * obs , in which the decomposition above of R * = R * overlap ∪ R * obs-only ∪ R * RCT-only holds. Note that this includes the possibility of the target population being represented by just the observational data.
While the above definition of overlap corresponds to a population feature, nonoverlap can also occur due to having a finite sample; by chance, the data may be sparse in some region of the covariate distribution even though that region has support. In practice, we will account for overlap as both a population and sample feature, determining regions of the covariate space that have common support and observed data from both groups. To estimate the region of overlap, R overlap , we extend a data-driven approach for determining areas of treatment overlap based on propensity scores for treatment assignment (Nethery et al., 2018). We adopt a similar approach for the propensity score for study selection π S = P (S|X), but on the logit scale to give more granularity to very low and very high propensity scores. The region of overlap consists of points in the logit of the propensity score for selection that have at least β observations from each study group within an interval of size α around that point (Nethery et al., 2018).
1b. Mean conditional exchangeability in the randomized group: and constant conditional bias in the observational group: 5b. Overlap between study samples: there exists a non-null set R overlap such that P (X = x|R overlap ) > 0 ⇒ P (S = s|X = x) > 0 with probability 1 for all s ∈ S.
Assumption 1b corresponds to the same conditional bias relationship holding in R overlap as . See Appendix 1 in the supplementary material for a derivation and further motivation for these weakened identifiability assumptions, in addition to a restatement of Assumption 1b with respect to the unmeasured confounders that are implicitly being integrated over.
More specifically (and more weakly), Assumption 1b must hold in expectation over the X covariate distribution in the observational data (mean constant conditional bias): Assumption 1b states that the relationship between bias and measured covariates is unrelated to being in the overlap vs. nonoverlap regions, i.e., that the distribution of unmeasured confounders does not differ between R overlap and R obs , conditioning on X and A. This assumption is strictly weaker than the no unmeasured confounding assumption in that the assumption of no unmeasured confounding is nested within Assumption 1b: with no unmeasured confounding, b(a, x) = 0. Constant conditional bias can be seen as an extension of the assumption made for cross-design synthesis (Kaizar, 2011), except that constant conditional bias is allowed to depend on measured covariates and a dichotimization of the covariate distribution support into overlap and nonoverlap regions replaces predefined eligibility determining overlap region membership. We hence assume that the covariates X capture all factors that would lead to differential bias in the overlap as nonoverlap regions. This suggests that we can estimate bias in the overlap region and use those estimates to extrapolate to and correct for bias in the observational group's nonoverlap region.
Assumption 1b is untestable, just as is the assumption of no unmeasured confounding; it would fail if the processes that drove unmeasured confounding differed between overlap and nonoverlap regions in a way that was not captured by measured covariates, or if the distribution of the unmeasured confounder differed between those regions in such a way as to create different conditional expectation relationships. This could occur, for example, if an unmeasured confounder drove overlap region membership. If the constant bias assumption is not reasonable for a given setting, one can alternatively perform sensitivity analysis to obtain bounds on PTSMs (Appendix 2).
In practice, Assumption 5b's region of overlap should be sufficiently large to learn the bias term, i.e., sufficiently large for Assumption 1b to hold. Empirical violations of Assumption 5b are partially testable using π S ; the existence of overlap in the propensity score distributions between randomized and observational groups provides evidence for this assumption. Observational group propensity scores may also be close to zero and lack overlap with randomized group propensity scores when the observational group size far exceeds the randomized group.
Of note, the X needed for Assumptions 1b and 2 and the X needed for Assumption 4 and 5b may differ. As a result, the region of overlap should exist with respect to outcome predictors but should be large enough to ensure that Assumption 1b holds. It is therefore reasonable to use an X matrix that contains all outcome predictors and confounders to assess all assumptions. Thus, as described earlier, our X is the union of the covariates sets needed for all assumptions to hold.

Identification
Under the modified assumptions above, the causal estimand of interest can be identified by the following CCDS functional of the observed data: See Appendix 3 for the proof and Appendix 4 for alternative functionals that identify the PTSM, derived through different decompositions of the data.

ESTIMATORS
We develop four novel estimators that combine randomized and observational data to estimate PTSMs relying on our CCDS framework. The novel estimators consist of outcome 8 regression, 2-stage outcome regression, inverse probability weighting, and double robust augmented inverse probability weighting approaches.

CCDS Outcome Regression Estimator
The CCDS outcome regression (CCDS-OR) estimator uses outcome regressions to estimate the combination of the conditional distributions in ψ CCDS (a): whereR overlap is estimated as described in Section 2.3, The first term corresponds to treatment specific mean estimates for the randomized subset of the target sample, the second term provides preliminary estimates for the observational subset of the target sample, and the third term debiases the preliminary observational data estimates. Implementation considerations for regression choices and a conditional treatment-specific mean version of the estimator are presented in Appendix 5.

2-stage CCDS Outcome Regression Estimator
To avoid overfitting to overlap region trends, the 2-stage CCDS estimator replaces the debiasing term, the third term, with a 2-stage regression: andĝ(X) an estimator of a regression function described below.b(S i = 1, a, X i ) is estimated for the observational data from this fixed weighted regression.
Namely, Stage (1), estimates an intermediate bias termb (S i = 1, a, X i ) using randomized overlap data: bias estimates are the difference in predicted counterfactual outcomes using regressions fit to the overlap region of the observational vs. randomized data, creating predictions for the randomized overlap data. As there is no bias in expectation in the randomized overlap data, any estimated bias stems from the regressionQ(S i = 0, A i = a, X i ). Stage (2) then fits a weighted regression with the estimates ofb (S i = 1, a, X i ) from Stage (1) as the outcome. This second stage focuses on the relationship between the bias estimates in the overlap region and measured covariates. The debiasing termb(S i = 1, a, X i ) is then estimated using the observational data and the fixedĝ(X) fit in Stage (2).
The weight,ŵ bias , standardizes the randomized data to the observational data so that the bias term is estimated for the covariate distribution of interest. The weights follow from P (S = 0) = E P (S = 0|X) = E 1(S = 1, R overlap = 1)P (S = 0|X)/ P (R overlap = 1|S = 1, X)P (S = 1|X) . Reweighting will frequently not face issues when positivity of selection violations occur because S = 1 data are used to estimate the bias term and thus should not have many values close to zero forP (S i = 1|X i ), which is in the denominator of the weight. Thus, while weighting is not required in such a 2-stage approach, the weights add robustness compared to an unweighted approach without common drawbacks of weighting, such as variance inflation due to unstable weights.
Appendix 6 presents a 2-stage approach that does not restrict itself to the overlap region (2-stage whole data), which suffers from the same reliance on extrapolating beyond randomized group support as does using only the randomized data, highlighting the importance of focusing on the overlap region to debias observational data.

CCDS Inverse Probability Weighting Estimator
The cross-design synthesis inverse probability weighting (CCDS-IPW) estimator with stabilized weights uses propensity models to estimate PTSMs (see Appendix 7 for the proof): where: Here, positivity of selection violations will usually not lead to unstable weights sincê P (R overlap, i = 1|S i = 1, X i )P (S i = 1|X i ) only appears in the denominator forŵ 4 ; these individuals, by overlap region construction, have propensity scores for selection bounded away from zero. Normalizing weights by their sum adds stability (Robins et al., 2000). Nonetheless, this method can face lack of efficiency and potentially unstable estimates, particularly from estimating the second bias term contribution weighted byŵ 4 , as the components are estimated using small subsets of the data relative to the overall sampleonly individuals randomized in the overlap region on a given treatment arm. This problem is exacerbated with many treatment groups, particularly for rare treatments.

CCDS Augmented Inverse Probability Weighting Estimator
Our double robust estimator provides consistent estimates when either the outcome regressions or product of propensity regressions are consistently estimated in each of the terms of ψ CCDS (a). The CCDS augmented inverse probability weighted (CCDS-AIPW) estimator is as follows: A, X) as defined above. CCDS-AIPW is a double robust estimator that is asymptotically efficient when the propensity and outcome regressions are estimated consistently. See Appendix 8 for a derivation of the efficient influence function.

Inference
Confidence intervals and standard errors in our machine-learning-based analyses were calculated using a nonparametric bootstrap (Efron and Tibshirani, 1994). When using parametric regressions, a sandwich variance approach can be used to derive sampling variance, following M-estimation theory.

Comparison estimators
No existing methods address both the overlap and unmeasured confounding challenges specific to our data setting. While the estimator of Rosenman et al. (2020) addresses overlap and unmeasured confounding, it assumes that treatment effects are identical between randomized and observational groups within the same stratum of effect modifiers, which is unlikely to hold in our setting. We therefore compare against two simple approaches. The first (rand estimator) fits an outcome regression using randomized data to extrapolate to the entire target population, including outside its region of support (Kern et al., 2016): . This extrapolation may yield bias when the relationship between covariates and potential outcomes differs in R overlap compared to R obs in a way that cannot be extrapolated from the randomized data. The second (obs/rand estimator) is similar to Kern et al. (2016) and Prentice et al. (2006), though those estimators fit one outcome regression to both randomized and observational data and estimate effects for either just the observational data or just the randomized data. The obs/rand estimator we deploy here fits an outcome regression using randomized data to estimate counterfactuals for the randomized data and fits an outcome regression using observational data to estimate counterfactuals for the observational data: . This approach assumes there is no unmeasured confounding in the observational data. We used outcome regressions for rand and obs/rand estimators rather than approaches that incorporate propensities for selection, as the latter will result in denominators close to zero due to lack of overlap. See Appendix 9 for AIPW versions of the rand and obs/rand estimators, presented in .

SIMULATION STUDIES
We designed a broad series of simulations to evaluate the finite sample performance of our novel CCDS estimators compared to alternative approaches for estimating PTSMs as well as the PATE, examining two treatment groups A ∈ {1, 2}. We assessed performance of these estimators in the presence of (1) complex data-generating mechanisms such that the randomized data do not extrapolate well outside their support, (2) unmeasured confounding in the observational data, and (3) positivity of selection violations. We also studied alternative data generating processes including different sample sizes, constant bias violations, unmeasured confounding settings, overlap settings, ratios of n RCT to n obs , positivity of selection violation settings, exchangeability of study selection violations, overlap region determination settings, propensity for selection relationships, alternative outcome models, and alternative regression fits. In total, we examined 84 different data generating scenario × regression choice combinations.
In the base case, we generated a target population of 1 million individuals from which we drew random samples of size n = 10, 000, with data-generating mechanism P (Y, S, A, X, U ) = P (X)P (U |X)P (S|X, U )P (A|S, X, U )P (Y |S, A, X, U ). The data had four independent measured confounders X 1 , ..., X 4 ∼ N (0, 1); an unmeasured confounder U ∼ Binom(0.5); selection into the randomized group driven by the strongest confounder such that there existed R RCT-only (S = 1 if X 1 > QN norm(0.9)), R obs-only (S = 0 if X 1 < QN orm(0.5)), and R overlap (S ∼ Binom(0.5), otherwise). This study selection process resulted in approximately a 1:4 ratio of randomized to observational individuals. Treatment assignment was A ∼ Binom(0.6) for S = 1 and A ∼ Binom(logit −1 (−0.8 + 0.125X 1 + 0.1X 2 + 0.075X 3 + 13 0.05X 4 + 0.1(X 1 + 1) 3 + 0.625U )) for S = 0. The outcome was generated from the same distribution for both groups, Estimators were fit with linear outcome regressions as well as an ensemble of 8 machine learning approaches. We implemented 2000 simulation iterations and 1000 bootstrap replications to generate confidence intervals. Propensities and their products used in weight denominators were trimmed at 0.001. We implemented the simulations in R, including the SuperLearner package  and the pw overlap function for overlap region estimation (Nethery et al., 2018). See Appendix 10 for the correspondence of our simulation design with identifiability assumptions, descriptions of alternative data-generating mechanisms, and further implementation details. Our code is available on GitHub: https://github.com/idegtiar1/CCDS. Main Findings. Results across different regression specifications highlight the estimators' relative strengths and disadvantages (Figure 2). At the base case sample size of n = 10, 000, CCDS-OR and CCDS-AIPW performance was almost identical. These estimators suffered from large variability when fitting complex regressions in a small overlap region, which we observed in the correctly specified and ensemble settings. In contrast, the CCDS-OR and CCDS-AIPW estimators showed little bias and variance when fitting underspecified main terms regressions (underspecification avoids overfitting in a small overlap region). The 2-stage CCDS estimator decreased bias and variance when using correctly specified or ensemble regressions, relative to the (1-stage) CCDS-OR estimator and CCDS-AIPW. In the main terms setting, its estimates were identical to those of CCDS-OR due to linearity and additivity.
The CCDS-IPW had the smallest bias and RMSE throughout all settings, except when fitting main terms regressions where it grossly misspecifies the propensity for selection, resulting in large remnant bias for that setting. However, the estimator's superior performance was due to the outcome model having more variability compared to the propensity models; e.g., the propensity for selection was deterministically assigned by X 1 . With a more probabilistic relationship and smaller propensity scores, CCDS-IPW's bias increased (Appendix 10.10).
While the rand estimator performed well with correctly specified regressions, using only main terms regressions resulted in large bias due to poor extrapolation beyond the randomized data support. With more flexible ensemble approaches, the rand estimator suffered from both large bias and large variance. The obs/rand estimator was subject to unmeasured confounding bias, which was present even when correctly specified regressions were fit, though it had relatively low RMSE due to the large observational sample size. of 's obs/rand estimators tended to be larger. This may be due to the misspecification of both outcome and propensity regressions and using less data to fit each regression due to sample splitting (Appendix Figure S1).
Estimating Overlap. The last column of Figure 2 presents results from overlap region estimation using α = 0.01 × range(logit(π S )) and β = 0.01 × min(n RCT , n obs ). With these specifications, compared to the truth, the estimated overlap region had a similar number of observational individuals (38% vs. 35%) and randomized individuals (50% vs. 48%). Performance was similar or better when estimating the overlap region in this setting and across the various other data-generating mechanisms and overlap region hyperparameter specifications we examined (Appendix 10).
Coverage. The obs/rand estimator showed 0% coverage across all settings while all CCDS estimators were able to achieve nominal coverage, except the CCDS-IPW estimator when using grossly misspecified linear regressions ( Figure 3). The rand estimator attained 0% coverage in the main terms settings for the PTSMs but 95% coverage for the PATE, due to linear regressions correctly specifying the treatment effects but not the treatmentspecific means in this data-generating mechanism; coverage remains low when the PATE does not extrapolate well from the randomized data. Thus, while the bias and RMSE of the CCDS estimators may or may not decrease compared to the obs/rand estimator (as shown in Figure 2) due to remnant estimation error from misspecifying regressions, which is particular evident with ensemble approaches, the poor coverage of the obs/rand estimator indicates this can be a false indication of precision.
Alternative Data-Generating Mechanisms. CCDS estimator bias and RMSE shrunk with more overlap and with increasing proportions of randomized data. As unmeasured confounding bias increased, there was no corresponding increase in bias across CCDS estimators with correctly specified regressions and only a slight increase with ensembles. However, variance increased, reflecting additional uncertainty in settings with more unmeasured confounding. Violating the constant conditional bias assumption increased bias for the rand and all CCDS estimators, with CCDS estimators generally performing better than the rand estimator. All estimators performed poorly when the exchangeability of study selection assumption was violated. The RMSE for CCDS-IPW was most impacted by a smaller ratio of randomized to observational individuals. Overall, results for the CCDS estimators were similar across alternative data-generating mechanisms. Further details can be found in Appendix 10.

MEDICAID STUDY
Medicaid, administered by the Centers for Medicare & Medicaid Services, provides insurance for low-income and disadvantaged Americans, covering a fifth of all individuals in the United States (Centers for Medicare & Medicaid Services, 2020). As described earlier, MMC provides health insurance plans for all but certain exempt groups (Medicaid, 2020), and beneficiaries who do not actively choose a health plan are randomized to one. Understanding the impact of these individual MMC health plans on health care spending is an open question. However, generalizing the 7% of beneficiaries who are randomized to the full NYC Medicaid population may be hampered by a lack of overlap in parts of the covariate distributions between randomized and observational (active chooser) groups. Yet, data from observational beneficiaries may be subject to potential unmeasured confounding from variables not captured in the claims data.
We estimated the causal effects of enrollment into NYC MMC health plans on health care spending for all NYC Medicaid beneficiaries with at least 6 months of follow-up, applying our novel CCDS and comparison estimators. Health care spending was examined over 6 months on the log scale, as log(spending + 1), adjusting for baseline spending decile, age, documented sex, aid group, whether the beneficiary received social security income, neighborhood, and neighborhood poverty level. Further descriptions of the data can be found in . We used all 65,591 randomized beneficiaries and a 10% random subset of observational beneficiaries within the study period (2008 -2012) for computational efficiency, which totaled 98,232. Baseline spending was missing for 1% of beneficiaries and was imputed to be zero (the most likely reason for missingness was no spending) along with an indicator for missingness. Regressions were fit using a SuperLearner ensemble (of glm, glmnet with α = 0.5, gam, and nnet). Propensity scores and their products used in weight denominators were trimmed at 0.001. To assess simultaneous 95% coverage, a conservative Bonferroni adjustment was made to the bootstrap confidence intervals, which used 500 replications: each marginal confidence interval was constructed at the 1 − 0.05/k level, where k = 10 plans.
Compared to randomized beneficiaries, observational beneficiaries differed across all measured factors: the latter were slightly younger (34.3 vs. 35.5 years old), spent less at baseline ($2796 vs. $3052), were more likely to have a documented sex of female (59% vs. 40%), were less likely to live in Manhattan (13% vs 20%) and more likely to live in Queens (28% vs. 19%), came from different aid groups, and were less likely to be eligible for social security income (2% vs 9%) (Appendix Table S3 in Appendix 11). Effect heterogeneity within the randomized data was driven by aid group status, supplemental security income eligibility, and neighborhood effects; within the observational data it was driven by neighborhood effects and receiving aid for children, all of which were imbalanced across randomized and observational beneficiaries, highlighting the need for generalizability approaches.
Overall, across all measured covariates, observational beneficiary characteristics were imbalanced across health plans, and these characteristics were also associated with health care spending, providing empirical evidence that these variables may be confounders. While randomized beneficiaries were not representative of their observational counterparts, there was considerable covariate overlap, as measured by the propensity score for selection into the randomized subset of the data, though overlap was weakest where the observational data were most concentrated (Appendix Figure S11 in Appendix 11). Using the conservative overlap hyperparameters α = 0.01 × range(logit(π S )) and β = 0.01 × n RCT resulted in 60% of the target sample within the overlap region. The standardized mean difference in the propensity score for selection was 1.1 standard deviations, which far exceeds 0.25, one proposed threshold indicating large extrapolation , and, thus, supportive of the need for CCDS estimators. Figure 4 presents STSMs for the randomized and observational study populations and PTSMs for the NYC Medicaid target population, including results for two CCDS estimators well suited to this setting. (All estimators are available in Appendix Figure S12 in Appendix 11.) Despite higher unadjusted mean spending in the randomized group, causal estimates of STSMs in the observational data were consistently higher than estimates of STSMs in the randomized data across all health plans. This discrepancy reflects both differences in population characteristics as well as potential unmeasured confounding in the observational data; neither estimate aligned with rand or CCDS estimates of PTSMs, which were consistently lower than randomized and observational STSMs. Results remained similar when accounting for country-month-year correlation (Appendix 11.1).
Given the substantial overlap between randomized and observational covariate distri- butions, it is not surprising that CCDS estimates were in a similar range to the rand estimates of PTSMs. However, the double robust CCDS-AIPW estimates were higher than rand estimates (12.5-16.7% difference in log spending) and confidence intervals were non-overlapping for all but plans D and I. CCDS-AIPW did not show large variability with ensemble regressions, unlike in the simulations. Obs/rand estimates and observational data AIPW STSMs (which largely aligned as the observational data comprised 93% of the data) were widely discrepant from other PTSMs, suggesting a large amount of unmeasured confounding bias in the observational data. Unlike in the simulation, CCDS-IPW confidence intervals were wider than those of other CCDS estimators, which is common to IPW estimators in practice, and also reflects the difficulty of estimating propensities for multiple treatments (Appendix Figure S12). While the rand PTSM estimator could provide reasonable estimates in this setting, where there is a fair amount of overlap between randomized and observational data, the CCDS estimators were able to incorporate all data and did not rely on extrapolating spending estimates beyond the support of the randomized data.

DISCUSSION
When observational and randomized data are both available, there is potential to overcome each data type's limitations through their combination. Namely, when some individuals in the target population are not well-represented in the randomized data and the observational data have unmeasured confounding, neither data type alone can successfully generalize to the target population represented by a union of randomized and observational data. This article proposes a class of novel estimators that can surmount positivity of selection assumption violations in the randomized data and unmeasured confounding in the observational data by using common support between the data sources to remove unmeasured confounding bias.
The proposed outcome regression, propensity score, and double robust CCDS estimators have varying strengths. When the functional forms of the true data generating processes can be approximated by simple linear regressions, the double robust CCDS-AIPW estimator with linear regressions is a suitable default approach for combining randomized and observational data. Even when linear regressions do not capture the full complexity of the data generating process, simulations showed that CCDS-AIPW and CCDS-OR with main terms regressions were able to recover unbiased estimates. However, when fitting more complex regressions, these estimators may lead to unstable bias extrapolations from the overlap region, although we did not see this drawback in our NYC Medicaid data analysis, which had a larger area of overlap. When more complex regression approaches are used, the 2-stage CCDS or CCDS-IPW may also be suitable, depending on whether there is more knowledge of the outcome relationship or the propensity for selection and treatment relationships and whether selection or treatments are rare or multinomial with small probabilities. The 2-stage approach improves performance compared to the CCDS-OR estimator by stabilizing initial estimates to alleviate overfitting to overlap region trends.
In the NYC Medicaid data, the study and target population causal estimates were markedly different. Novel and existing generalizability methods helped reconcile these discrepancies by specifying a target population for which inference was desired. There were also significant differences between PTSM rand and obs/rand estimates, showcasing the need to account for both potentially poor extrapolation from the randomized data and potential unmeasured confounding in the observational data. The proposed CCDS estimators provided evidence that the observational data remained subject to unmeasured confounding bias even after adjusting for measured factors.
Our CCDS framework is sensitive to the randomized data regression in the overlap region being an accurate reflection of the truth, as highlighted in the simulation results. When the overlap region is small, the conditional mean relationships estimated from the overlap region may be misspecified, leading to bias and large variability in estimates of unmeasured confounding bias. To assess goodness of fit, investigators can compare estimates to the truth in the randomized data overlap region. Regularization and cross-validation can reduce chances of overfitting to the data, particularly with more flexible regression approaches. Further practical challenges to applying CCDS estimators in other settings may include imperfect covariate correspondence between observational and randomized data sources. Our approach assumes that, after incorporating common covariates, there are no unmeasured outcome determinants (for estimating PTSMs) or effect modifiers (for estimating PATEs) that differ in distribution between randomized and observational groups. However, if this assumption is violated, CCDS estimators often performed better than using randomized data alone.
Future extensions to the CCDS estimation framework could consider addressing positivity of treatment assignment violations, combining more than two studies (with at least one randomized and one observational), alternative approaches for determining the overlap region that allow for the degree of information borrowing to depend on the similarity of randomized and observational observations, and overlap estimation that does not rely on an estimated propensity score for selection, such as a convex hull approach (King and Zeng, 2006) or estimating common causal support (Hill and Su, 2013). Randomized and observational data commonly face multiple challenges beyond those of positivity of selection violation and unmeasured confounding discussed here. These challenges include lack of independence between observations (e.g., clustering), missing data, and measurement error. Methods for addressing such challenges can be combined with our CCDS approaches.
Our CCDS estimators have relevance to many other settings. Positivity of selection violation and unmeasured confounding arise in other studies where the target population is composed of randomized and observational subsets, or more broadly when observational data are being combined with randomized data. For example, in comprehensive cohort studies, patients who refuse randomization are enrolled in a parallel observational study Olschewski and Scheurlen, 1985) and when randomized controlled trials are embedded in electronic health record data, the observational data can provide information on patients included in and excluded from the trial (Kibbelaar et al., 2017). Policy evaluation studies can be combined with observational data from outside the evaluation geography to estimate scale-up impacts (Attanasio et al., 2003;Kern et al., 2016). Across these settings, CCDS estimators can be used to generalize to the target population represented by the union of the randomized and observational data. CCDS could also be applied when randomized data represent the target population but will be combined with observational data to increase power, such as in clinical trials that use a mix of randomized and historical controls (Ghadessi et al., 2020), or when, in the absence of a comprehensive target sample, a combination of randomized and observational studies may more fully rep-resent the target population than either study alone (Prentice et al., 2005;Vaitsiakhovich et al., 2018).
Generalizability methods applied to a specified target population are necessary to obtain unbiased estimates for a policy-relevant population. The internal validity of randomized studies is insufficient to obtain unbiased causal estimates; external validity also needs to be considered. The CCDS estimators presented here provide several approaches for combining randomized with observational data to make inferences that do not rely on extrapolating beyond randomized data support nor on the assumption of unmeasured confounding in the observational data.

Derivation of Assumption 1b
To overcome violations of Assumptions 1 (mean conditional treatment exchangeability, or no unmeasured confounding) and 5 (positivity of study selection), we can leverage information from the combination of randomized and observational data.
For unmeasured confounding bias, we begin by characterizing the conditional bias in the observational group: The conditional bias corresponds to the average difference in potential outcomes between observational group individuals on intervention a vs. marginally, conditioning on measured covariates X = x. We could alternatively have defined conditional bias relative to a specific alternative intervention, E(Y a |S = 0, A = a , X = x), or relative to all other interventions, E(Y a |S = 0, A = a, X = x); the same principles hold. Mean conditional treatment exchangeability holds if and only if b(a, X) = 0 for all a ∈ A.
By randomization, mean conditional treatment exchangeability holds for the random- By mean conditional exchangeability for study selection, E(Y a |S = 1, However, overlapping support between randomized and observational groups only exists in R overlap , hence E(Y a |S = 0, A = a, X = x) − E(Y a |S = 1, A = a, X = x) can only be identified in R overlap without further assumptions to warrant the extrapolation. One extrapolation approach would be to directly extrapolate from the randomized group to obtain potential outcomes in regions of non-support (see the rand estimator in Section 4.6 of the main manuscript for an estimation strategy based on this approach), or to extrapolate for the purposes of estimating bias in regions of non-support (see the 2-stage whole data estimator in Appendix 6 of this document), but estimation relying on these strategies is sensitive to parametric assumptions needed to extrapolate beyond the randomized data's support. We instead make an alternative assumption, Assumption 1b: b(a, x) = b(a, x|R overlap = 1); namely, that the same conditional bias relationship that holds in the region of overlap also holds in the broader support of the observational group. When estimating PTSMs, more weakly, the constant conditional bias assumption must hold in expectation over the X distribution in the observational data: The mean constant conditional bias assumption can also be restated with respect to the unmeasured confounders that are implicitly being integrated over. Assumption 1b states that, in expectation, the bias when integrating over the distribution of unmeasured confounders in R overlap is equivalent to the bias when integrating over the distribution of unmeasured confounders in R obs . Namely, with U corresponding to unmeasured confounders, the mean constant bias assumption can be written as: for all a ∈ A.

Sensitivity Analysis Bounds
Making no constant conditional bias assumptions, we arrive at the following functional of the observed data and potential outcomes: Proof for 1: Using the law of iterated expectations, no unmeasured confounding in the randomized group, SUTVA assumptions, and positivity assumptions, we obtain the following: Identity (1) can be used as the basis for sensitivity analysis, substituting different plausible bias relationships for b (a, x), as was done by Brumback et al. (2004). In many settings, it is unlikely that b (a, x) would have different signs for different a (this would imply that within the same level of X, individuals would have the largest outcome on the treatment they ended up on compared to other treatments). Among the various possible functional forms for the bias term presented in Brumback et al. (2004), we could assume bias would depend on measured covariates and take the form b (a, x) = β a X. Note the similarity to the 2-stage conditional cross-design synthesis (CCDS) approach where a slightly different formulation of the bias term is estimated from the overlap region.
Identity (1) highlights that the bias from the naive obs/rand estimator in Section 4.6 that averages across randomized and observational estimates for randomized and observational units respectively is therefore:  Lines (1) and (2) follow from the law of iterated expectations. Line (3) follows from Assumption 4 of conditional exchangeability for study selection; line (4) follows from the first part of Assumption 1b: E(Y a |S = 1, A = a, X) = E(Y a |S = 1, X); line (5) adds and subtracts the same term; line (6) then follows from the constant conditional bias part of Assumption 1b; line (7) follows from Assumptions 3 and 6 of SUTVA for treatment assignment and study selection; the final quantities are well-defined by the two positivity assumptions, 2 and 5.
One can alternatively identify treatment-specific means through different decompositions of the data in lines (2)-(3) (see Appendix 4). Each functional implies different estimation strategies that rely on different auxiliary regression models.
Under this alternative formulation of Assumption 1b, along with Assumptions 2 -5b, we can identify the causal estimand as follows: 7 Proof for ψ 3 (a): E(Y a ) = E(Y a |S = 1)P (S = 1) + E(Y a |S = 0, R overlap = 1)P (S = 0, R overlap = 1) + E(Y a |R obs-only = 1)P (R obs-only = 1) As in the proof for ψ CCDS (a), lines (1), (2), and (8) follow from the law of iterated expectations. Line (3) follows from Assumption 4 of conditional exchangeability for study selection; line (4) follows from the first part of Assumption 1b: E(Y a |S = 1, A = a, X) = E(Y a |S = 1, X) (and the redundancy of S = 1 and R RCT ); line (5) adds and subtracts the same term; line (6) then follows from the constant conditional bias part of Assumption 1b; line (7) follows from Assumptions 3 and 6 of SUTVA for treatment assignment and study selection; the final quantities are well-defined by the two positivity assumptions, 2 and 5. For lines (9)-(10), the same steps seen in the proof of ψ 2 (a) were repeated to arrive at the final functional.
Each of these three functionals (ψ CCDS , ψ 2 , ψ 3 ) suggest slightly different estimation procedures that rely on different auxiliary regression models for different subsets of data. For example, the outcome regression estimators of ψ 2 and ψ 3 would be as follows:  Choices between the two estimators here and the one presented in the main paper should rely on such considerations as efficiency and which regressions may better fit the data (e.g., bothψ 2-OR (a) andψ 3-OR (a) rely on preliminary observational estimates estimated from regressions fit to small subsets of the data). As a reminder, ψ CCDS (a) suggests an estimation procedure in which regressions are fit using: (a) all the randomized data to estimate potential outcomes in the randomized data, (b) all the observational data to estimate preliminary potential outcomes in the observational data, and (c) the randomized data in the overlap region and the observational data in the overlap region to estimate the debiasing term for preliminary observational data estimates.
In contrast, ψ 2 (a) suggests an estimation procedure in which regressions are fit using: (a) all the randomized data to estimate potential outcomes in the randomized and in the overlap region of the observational study, (b) the observational data in the nonoverlap region to estimate preliminary potential outcomes in the nonoverlap region of the observational study, and (c) the randomized data in the overlap region and the observational data in the overlap region to estimate the debiasing term. Correspondingly, ψ 3 (a) suggests an estimation procedure in which regressions are fit using: (a) all the randomized data to estimate potential outcomes in the randomized population, (b1) the randomized data in the overlap region to estimate potential outcomes in the overlap region of the observational study, (b2) the observational data in the nonoverlap region to estimate preliminary potential outcomes in the nonoverlap region of the observational study, and (c) the randomized data in the overlap region and the observational data in the overlap region to estimate the debiasing term.
The three estimators differ in the flexibility of their regression specifications: the latter estimators let the covariate-outcome relationship differ in the overlap vs. nonoverlap regions. However, this flexibility comes at the cost of less information borrowing across the 9 entire covariate distribution.

CCDS-OR
Each of the outcome regressions inψ CCDS-OR (a) must appropriately capture treatment effect heterogeneity such as through including all relevant interaction terms in a least squares regression or by using flexible nonparametric approaches that discover effect heterogeneity in a data-driven fashion, such as machine learning algorithms (keeping in mind that many such approaches do not have convergence rates that result in √ n-consistency). When fitting more complex algorithms for the outcome regressions, there is a potential for overfitting to the trends in the overlap region when estimating the debiasing term (third term in ψ CCDS-OR (a)), even with regularization and cross-validation.
The CCDS framework can also be used to estimate conditional PTSMs for the CCDS-OR and 2-stage CCDS-OR estimators via a weighted average of randomized and debiased observational conditional means, weighted by the relative proportion of randomized and observational individuals in the target population. The CCDS-OR conditional PTSM estimator is as follows: Rather than simply using n study /n, these weights can also be replaced by sampling weights.

2-stage CCDS
In the second stage of the 2-stage CCDS estimator, a simpleĝ() function such asĝ(X) = X Tθ can prevent overfitting to the overlap region and thus provide added stability for estimating bias, particularly when fitting more complexQ(S, A, R, X) regressions in the first stage. Substantive knowledge can also inform choice of theĝ() function, such as knowledge of which measured covariates can serve as proxies for unmeasured confounders.
In studies with fewer treatment groups and thus potentially more data in each one, it may be beneficial to subset to randomized overlap region data in a given treatment group for bias estimation to make sure to fully capture treatment effect heterogeneity (though this approach precludes borrowing strength across treatment groups).
Step 2 of the 2-stage CCDS estimator then becomes:

CCDS-IPW
To circumvent unstable weights for CCDS-IPW and other novel estimators using weights, propensity scores and their products used in weight denominators can be trimmed. However, trimming weights effectively changes the estimand of interest, thus requiring a biasvariance tradeoff (Potter, 1993;Crump et al., 2009;Lee et al., 2011). 6 2-stage Whole Data Outcome Regression Estimator

Estimator
An alternative to the constant conditional bias assumption is to instead extrapolate from the randomized study to regions not supported in the randomized study covariate distribution, R obs-only . If we believe that we can reliably extrapolate from the randomized data for the purpose of debiasing term estimation (although we are not confident enough to directly extrapolate potential outcomes), the 2-stage whole data (WD) outcome regression estimator would provide more power than the 2-stage CCDS approach by not restricting debiasing term estimation to the overlap region: (1)b (S i = 1, a, X i ) =Q i (S = 0, A = a, X)1(S i = 1) −Q i (S = 1, A = a, X)1(S i = 1) One could likewise subset to randomized data in a given treatment group for bias estimation. The 2-stage WD estimator then becomes: A similar approach was taken by  to estimate target population conditional average treatment effects for a target population represented by the observational data, using Y i 1(S i = 1, A i = a)/P (A i = a|S i = 1, X i ) and notQ i (S = 1, A = a, X)1(S i = 1) in Stage (1) and not weighting Stage (2). Therefore, the Kallus et al. 2-stage approach optimizes for mean squared error across the covariate distribution in the randomized group rather than the observational group, a covariate distribution that does not represent the one where we wish to minimize bias. However, it does not suffer from potentially increased variability due to the weights. The Kallus et al. 2-stage approach does not directly extend to estimating PTSMs.

Simulation Results
With correctly specified regressions, all novel estimators, including the 2-stage WD approach, were able to decrease unmeasured confounding bias. The 2-stage WD approach was the most efficient novel outcome regression estimator because it used more data to fit regressions compared to CCDS estimators ( Figure S1). However, when (incorrectly) fitting main terms regressions, just as with the rand estimator, extrapolation became an issue for the 2-stage WD estimator. As expected, with linear additive regressions like the correctly specified and main terms regressions, the 2-stage WD estimator is numerically equivalent to the rand estimator. Using ensemble methods the bias for the 2-stage WD estimator tended to be similar to that of the rand estimator. Because of the 2-stage WD estimator's sensitivity to model misspecificiation, we do not generally recommend using this estimator. The estimator's poor performance highlights the importance of focusing on the overlap region for estimating unmeasured confounding bias.
We can then identify each of the conditional distributions via the following propensity decomposition (using conditional probability laws and positivity assumptions). For example, for component (4): This weight stabilization creates more stability for estimation and ensure estimates are in the support of the outcome variable.
We can similarly identify each of the conditional distributions in (1) -(3) through the following propensity score decompositions:

CCDS Influence Function
To derive the influence function for ψ CCDS (a), we first derive the influence function for each of its four conditional means: (1) For χ 1 (a) = E X [E(Y |S = 1, A = a, X)|S = 1], (2) For χ 2 (a) = E X [E(Y |S = 0, A = a, X)|S = 0], ( where probabilities and expectations are taken under the true model and weights are as previously defined. The joint influence function will then be the reweighted (by P (S = 1) or P (S = 0)) sum of the 4 conditional mean influence functions:   develop one AIPW rand estimator and two obs/rand estimators relying on positivity of study selection and/or unmeasured confounding in the observational data. Their estimators split data into k folds and, within each fold, fit separate regressions in each treatment group and study type, which may result in larger variance for rare or multinomial treatments, as in our applied study.  fit generalized additive models for each regression; for better comparability, we fit either parametric or ensemble machine learning regressions to match other estimators.

Further implementation details
Ensemble regressions were implemented using the SuperLearner package  and consisted of SL.glm, SL.glm.interact, SL.glmnet with α = 0.5, SL.ranger with 300 trees and a minimum node size of 5% of the sample being fit, SL.nnet with 2 hidden layers, SL.earth, SL.gam, and SL.kernelKnn. For primary results, we conservatively estimated the overlap region using α = 1%×range(logit(π S )) and β = 1%×min(n obs , n rand ); that is, at least 1% of observations in a given treatment group must fall within 1% intervals of the logit of the propensity score.

Further Descriptions of the Data-Generating Mechanism
The core data-generating mechanism resulted in positivity of selection violation ( Figure  S2b): the confounders having varying strengths of confounding, there being relatively strong unmeasured confounding (U had the second largest impact on treatment and outcome values), the conditional outcome relationship in R overlap in the randomized data not fully extrapolating well to R obs-only unless the correct outcome regression was fit, and observed covariates differing in distribution across randomized and observational data. As a result, randomized and observational data each displayed external validity bias for estimating PTSMs and PATEs, and observational data likewise displayed internal validity bias due to measured and unmeasured confounding (Table S1). With these specifications, there were discrepancies between true randomized and observational study population treatmentspecific means (STSMs) and study population average treatment effects (SATEs). This data-generating mechanism also ensured that identifiability assumptions held, namely: 1. The randomized group had no unmeasured confounding and the distribution of U was the same in R overlap as R obs-only conditioning on measured covariates; thus, the constant conditional bias assumption was satisfied (bias was in fact constant, not just conditionally constant; it was equal to E(10U )).
2. Study/treatment groups had positive probabilities of receiving each treatment ( Figure  S2a). 3. Observations were independent.
4. The unmeasured covariate did not confound the relationship between outcome and study selection.
5. R overlap was not a null set ( Figure S2).
6. The same outcome specification held for both randomized and observational data.

Overlap Region Specifications
We examined a range of overlap region specifications (Table S2), with α and β overlap region hyperparameters set on the propensity and log propensity scales. The overlap region specifications used in the base case (α = 1% × range(logit(π S )) and β = 1% × min(n obs , n rand )) were the closest to approximating the true overlap region, particularly for randomized data. Across the range of scenarios examined, which spanned underestimating to grossly overestimating the overlap region, bias and RMSE were minimally impacted ( Figure S3).

Different Degrees of Overlap (Positivity of Study Selection Violation)
We changed the size of R overlap from the default of QN orm(0.5) = 0 ≤ X 1 ≤ 1.28 = QN orm(0.9) to QN orm(0.7) = 0.52 ≤ X 1 ≤ 1.28 = QN orm(0.9) for less overlap and QN orm(0.1) = −1.28 ≤ X 1 ≤ 1.28 = QN orm(0.9) for greater overlap. These changes resulted in different proportions of observational data falling in the overlap region (13%, 38%, and 88%, respectively), but always retained 50% of randomized data in the overlap region. With greater overlap, all estimators besides obs/rand were able to shrink the bias close to zero ( Figure S4). Novel estimators had a larger region in which to estimate bias and the rand estimator was able to extrapolate better because more of the target population was in its region of support. With less overlap, all novel estimators' bias remained minimal, although variance increased (most starkly for the CCDS-OR and CCDS-AIPW estimators with ensembles); rand model bias increased sharply for estimating PTSMs. The datagenerating mechanism allowed PATEs to be extrapolated from the randomized data, thus a corresponding bias increase was not observed for PATEs. The CCDS-IPW's RMSE remained the least impacted among all novel estimators.

Different Ratios of n RCT : n obs
The base-case ratio of n RCT : n obs was 1:4. We also examined 1:1 and 1:30 ratios. To maintain the same overlap region across all settings, we changed the overlap region bounds to QN orm(0.18) = −0.92 ≤ X 1 ≤ 2.05 = QN orm(0.98). Bias and RMSE largely decreased across all estimators as the ratio of randomized to observational increased ( Figure S5). As the randomized observations comprised a larger portion of the target sample, the bias from using the rand and obs/rand estimators also decreased. With ensembles, the rate of bias and RMSE decrease for rand and novel estimators exceeded that of the obs/rand estimator, highlighting the large impact of having more randomized data when overlap is small. With correct specification, the rate of RMSE decrease exceeded that of the obs/rand estimators. Relative performance of the estimators largely remained the same.

Varying Sample Sizes
We examined sample sizes n = 2000, 10000, and 50000. With correctly specified regressions, all other estimators had bias lower than that of the obs/rand estimator, although the CCDS-OR and CCDS-AIPW estimators retained some bias due to fitting complex regressions in the small overlap region ( Figure S6). With smaller sample sizes (n = 2000; n RCT = 500), the RMSE of all novel estimators and the rand estimator exceeded that of the obs/rand estimator. With ensembles, all novel estimators' bias was below that of the obs/rand estimator. RMSE, however, only dropped below that of the obs/rand estimator with n = 10000 for the PATE (except for the CCDS-IPW estimator where RMSE was lower even with n = 2000).

Varying Strengths of Unmeasured Confounding
We examined four settings: no unmeasured confounding (where U was included as a measured covariate) and three levels of unmeasured confounding. The coefficient for U was assigned three different values in P (Y |S, X, U ): (0.1, 0.625, 1.5) and in P (A|S, X, U ): (5, 10, 20). These settings represented low, default, and high confounding, respectively. Results were similar for no and low unmeasured confounding. As unmeasured confounding bias increased, with correctly specified regressions, there was no corresponding increase in bias across novel estimators. However, variance increased, reflecting more uncertainty in settings with greater unmeasured confounding ( Figure S7). With ensembles, there was a small increase in bias with greater confounding. This increase was smaller for CCDS estimators than for the rand estimator.

Constant Conditional Bias Assumption Violation
To violate the constant conditional bias assumption, the amount of unmeasured confounding bias was varied in a way that is not predictable from the trends observed in the overlap region (the overlap region lower bound is at corresponds to the base-case potential outcome. When the bias relationship observed in the overlap region differed from outside the overlap region, bias and RMSE increased for each of the estimators relative to the amount of extra unmeasured confounding bias in the observational data that cannot be estimated from the overlap region ( Figure S8). The rand estimator was not able to extrapolate well outside the overlap region even with a correctly specified regression. Novel estimators' bias remained below that of the obs/rand estimator as they removed the portion of unmeasured confounding bias that was estimable from the overlap region. The bias increase observed with constant conditional bias assumption violation thus accommodates the extra unmeasured confounding bias which cannot be removed, reflecting that these novel methods can only remove bias estimable from the overlap region; they rely on the constant conditional bias assumption. With ensembles, the rand estimator's PATE bias and RMSE decreased with the constant bias violation, likely reflecting this specific data-generating mechanism as this result was not observed for PTSMs. When estimating PTSMs, the rand estimator was the most impacted by the assumption violation.

Exchangeability of Study Selection Violation
We examined violations of the exchangeability of study selection assumption through two approaches. In the first, P (S|X, U ) was changed to be a function of U in the overlap region, P (S = 1|U ) = 0.125 + 0.25U , and remained deterministically 0 or 1 outside the overlap region. P (S = 1) remained at 0.20. When study selection was a function of unmeasured U , bias for PTSMs of all other estimators exceeded that of the obs/rand estimator ( Figure S9). This was not observed for the PATEs due to U not being an unmeasured effect modifier.
In the second violation assessment, U was a function of X 1 , which determines overlap region membership: U ∼ Binom(p U ), where p U = expit(30X 1 ). Hence, U was an unmeasured effect modifier with different distributions in the randomized vs. observational data. Thus, randomized estimates would not represent the truth in the overlap region. Hence, the CCDS estimators' bias increased with the assumption violation; when rand estimates are biased for the target population quantities in the overlap region, CCDS estimators are not be able to properly debias ( Figure S9). Likewise, the rand estimates' bias also increased in all but the ensemble estimating the PATE, where bias decreased, likely due to bias cancellation between treatment groups.

Supplemental Medicaid Results
Summary characteristics of the randomized and observational groups in the Medicaid data are presented in Table S3. Plots of propensity for selection into the randomized group for both the randomized and observational data are displayed in Figure S11. STSMs and PTSMs for all health plans and estimators can be found in Figure S12.

Accounting for Country-Month-Year
Due to computational considerations, the primary analysis of the Medicaid study did not account for correlation by country-month-year, the unit of randomization. In sensitivity analyses, we accounted for this correlation using either (1) fixed effects, similar in spirit to the analysis by , or (2) random effects. Analyses were run paralleling the linear rand estimator, i.e., fitting a linear outcome regression on the randomized data. We recovered similar estimates for the randomized population and slightly higher point estimates for the target population compared to the primary analysis (Table S4). These correspond to a 67% (fixed effects estimate) and 63% (random effects estimate) difference <0.001 Percent neighborhood poverty (mean (SD)) 0.24 (0.08) 0.23 (0.08) <0.001 NOTE: The p-values correspond to a t-test for continuous variables and a chi-squared test for categorical variables, with a continuity correction. Abbreviations: MA = Medicare Advantage; SD = standard deviation; SN = safety net; SSI = social security income; TANF = Temporary Assistance for Needy Families.     Table (S4) STSMs and PTSMs Across Analyses with Differing Correlation Procedure for Country-Month-Year (CMY). All analyses fit a linear outcome regression using the randomized data.
between 6-month spending for the randomized population had it been in the highestvs. lowest-spending plans and a 67% and a 65% difference, respectively, for the target population. In comparison, when not accounting for correlation by country-month-year, linear regression versions of the main analysis found a 70% difference for the randomized population and 75% difference for the target population.