Data integration in causal inference

Abstract Integrating data from multiple heterogeneous sources has become increasingly popular to achieve a large sample size and diverse study population. This article reviews development in causal inference methods that combines multiple datasets collected by potentially different designs from potentially heterogeneous populations. We summarize recent advances on combining randomized clinical trials with external information from observational studies or historical controls, combining samples when no single sample has all relevant variables with application to two‐sample Mendelian randomization, distributed data setting under privacy concerns for comparative effectiveness and safety research using real‐world data, Bayesian causal inference, and causal discovery methods. This article is categorized under: Statistical Models > Semiparametric Models Applications of Computational Statistics > Clinical Trials


| INTRODUCTION
The availability of multiple datasets collected by different designs from heterogeneous populations has brought emerging challenges and opportunities for causal inference. Integrating data from multiple sources to facilitate causal inference has become increasingly popular. For example, randomized clinical trial (RCT) has been the gold standard for causal inference but often suffers from insufficient sample size and homogeneous study population due to inclusion/exclusion criteria. Results from RCTs may not be generalizable to a real-world population. In contrast, observational study typically offers a diverse sample representative of the target population with a large sample size but often suffers from unmeasured confounding. Combining data from both designs allows one to extend causal inference from an RCT to a target population, to correct for bias in observational studies, and to improve efficiency (Colnet et al., 2020). Another prominent example is when no single dataset contains all relevant variables, that is, there are no complete data for any subject. In this case, identification becomes difficult even for parameters that are straightforward to be identified with complete data (Ridder & Moffitt, 2007). This is typical in survey sample combination where variables collected in each survey may differ . This is also the case in two-sample instrumental variable methods, which is widely applied in Mendelian randomization studies where individual-level genetic data are not available due to privacy concerns (Angrist & Krueger, 1992).
In this article, we review selected literature on data integration methods in causal inference. Recent review studies focused on combining randomized and observational data (Colnet et al., 2020;Degtiar & Rose, 2021) and data combination in survey sampling (Ridder & Moffitt, 2007;. We aim to provide a more systematic review and cover a range of research areas. We start with notation and introduce key assumptions and concepts that frequently appeared in the literature in Section 2. We then summarize recent methodological advances in integrating data from RCTs and observational studies in Section 3, and combining data when no single sample has all relevant variables in Section 4. We briefly review the literature on data integration for causal discovery, distributed data analysis for privacy protection, and Bayesian methods for integrated causal inference in Section 5. We close with a discussion in Section 6.

| PRELIMINARIES
In this section, we briefly introduce the potential outcome framework and review key concepts in causal inference and data integration. Let A denote a binary treatment (1: treated, 0: untreated), Y denote an observed outcome, and X denote a vector of measured covariates. When all circumstances are the same except for the treatment status, any difference observed in the outcomes has to be attributed to the treatment. Correspondingly, for each subject we define a pair of potential outcomes, Y 1 ð Þ, Y 0 ð Þ ð Þ , that would be observed if the subject had been given treatment, Y 1 ð Þ, and control, Y 0 ð Þ (Rubin, 1974), under the stable unit treatment value assumption that there is no interference between units and no multiple versions of treatment (Rubin, 1980). As such, the observed outcome is equal to the potential outcome corresponding to the subject's treatment condition, that is.
A fundamental problem in causal inference is that for each subject, we can only observe one of the potential outcomes. Because it is impossible to compute the difference in Y 1 ð Þ and Y 0 ð Þ for a specific subject, we often specify a target population of interest, and study the mean difference in the target population, referred to as the average treatment effect (ATE). In practice, we cannot observe data on all subjects in the prespecified target population but rather data on a sample of subjects referred to as the study sample. Let S be a binary indicator of whether a subject is selected into the study sample (1: sampled, 0: not sampled). It is important to note that the ATE is population-specific. In fact, we can define multiple ATEs each with respect to a different target population as follows: For example, the ATE is τ if the combined S ¼ 1 and S ¼ 0 sample is a random sample of the target population. The ATE estimated based on the study sample, that is, the S ¼ 1 sample, is an estimate of τ 1 , which is not necessarily equal to τ because the study sample is not necessarily a representative sample of the target population. Identification of the ATE, which is a function of the potential outcome distribution in a target population involves expressing it as a function of the observed data distribution, such that distinct data-generating mechanisms lead to distinct values. To identify the ATE, ideally we would like to observe Y a ð Þ of all subjects in the target population to compute E Y a ð Þ ½ , for a ¼ 0 or 1. However, both sample selection mechanism and treatment assignment mechanism lead to missingness in Y a ð Þ: generally Y a ð Þ's are missing for all subjects in the S ¼ 0 sample (sample selection); in the S ¼ 1 sample, Y a ð Þ's are unobserved for subjects in the other treatment arm with A ¼ a 0 , a 0 ≠ a (treatment assignment). Confounding bias, also referred to as violation of interval validity, occurs when factors that impact treatment assignment also predict the outcome, such that the observed Y a ð Þ's in the A ¼ a arm cannot represent the missing Y a ð Þ's in the A ¼ . Selection bias, also referred to as violation of external validity, occurs when factors that impact sample selection also predict the outcome, such that the observed Y a ð Þ's in the S ¼ 1 sample cannot represent the missing Y a ð Þ's in the S ¼ 0 sample. A less stringent condition targeting treatment effect estimation, that is, the mean difference rather than the mean, defines selection bias as when factors that impact sample selection also modifies the treatment effect, that is, Lesko et al., 2017;Stuart et al., 2011). Collider-stratification bias may also occur due to conditioning the analysis on the study sample, if S is a common consequence of the treatment (or a predictor of the treatment) and the outcome (or a predictor of the outcome; Greenland, 2003).
Two key assumptions about the treatment assignment mechanism are often imposed, which we refer to as treatment exchangeability and positivity. The treatment exchangeability assumption states that within a strata of X, Y a ð Þ of subjects in the A ¼ a arm can be exchanged with Y a ð Þ of subjects in the A ¼ a 0 arm: Assumption 2 allows us to represent the conditional distribution of the unobserved potential outcome using that of the observed potential outcome. We thus have that for a ¼ 0 or 1, Equation (1) has also been used as a weaker version of Assumption 2. Within each strata of the covariates sufficient for the treatment exchangeability, we also need to have nonzero subjects in both treatment arms: Assumption 3. (Treatment positivity) P A ¼ a j X, S ¼ 1 ð Þ > 0 for all a almost surely.
Often P A ¼ 1 j X, S ¼ 1 ð Þ is referred to as the propensity score. Note that Assumptions 2 and 3 are conditional on the study sample, thus the set of covariates X sufficient for Assumptions 2 and 3 to hold may include variables beyond common causes of treatment and outcome, that is, the typical confounders. For example, a covariate that causes selection S and outcome but is independent of the treatment can become a confounder if the treatment also causes selection. This is a consequence of collider-stratification bias where conditioning on S results in a spurious association between the treatment and the covariate.
Besides conditions to ensure internal validity, people often impose another two key assumptions about the sample selection mechanism to ensure external validity, which we refer to as selection exchangeability and positivity, in analogy to Assumptions 2-3 Lesko et al., 2017;Stuart et al., 2011).
Assumption 4 allows us to generalize the conditional distribution of the potential outcome from the study sample to a target population, such as the one represented by the S ¼ 0 sample or the combination of S ¼ 0 and S ¼ 1 sample: Weaker versions of the selection exchangeability assumption include (I) mean conditional exchangeability, that is, Equation (2) and (II) all treatment effect modifiers are measured, that is, . We further assume that variables required for selection exchangeability do not serve as study eligibility criteria that completely exclude certain subjects from the study sample. For example, suppose geographic location restricted study participation such that there is zero probability of selecting subjects in a certain area, then Assumption 5 requires that geographic location is not needed for Assumption 4, that is, conditional on X, geographic location is not associated with the outcome or does not modify the treatment effect.
Two problems are frequently studied: generalizability (Buchanan et al., 2018;Cole & Stuart, 2010;Dahabreh, Robertson, Tchetgen, Stuart, & Hern an, 2019;Stuart et al., 2011) and transportability (Bareinboim & Pearl, 2016;Hünermund & Bareinboim, 2019;Pearl & Bareinboim, 2014;Rudolph & van der Laan, 2017;Westreich et al., 2017). The distinction between the two concepts is well summarized in  and Degtiar and Rose (2021): generalizability focuses on the setting when the study sample is a subset of the target population, and transportability considers the setting when the study sample and the target population are partially-or non-overlapping. An example of the generalizability problem is: suppose the target population is the trial-eligible population, and the combined S ¼ 1 and S ¼ 0 sample is a random sample of the target population, in which trial participants are in the S ¼ 1 sample and non-participants are in the S ¼ 0 sample. In this case, the target ATE is τ and we would like to generalize inference about τ 1 obtained from the trial data to τ. An example of the transportability problem is: suppose the target population is a real-world population, and S ¼ 0 sample is a random sample of the target population separately obtained from external data sources such as administrative healthcare databases or survey studies. In this case, the target ATE is τ 0 and we would like to transport inference about τ 1 to τ 0 .
Both problems require some information in the S ¼ 0 sample, and often two scenarios are considered: (S1) covariates are measured on all individuals in the S ¼ 0 sample, that is, we observe X,S ¼ 0 ð Þ ; (S2) covariates are measured on a subsample of the S ¼ 0 sample, that is, we observe X, Þis a simple random sample of the S ¼ 0 sample with two possibilities:

| Generalizability and transportability methods
In this section, we review three common strategies for identification and estimation of E Y a ð Þ ½ (generalizability) and E Y a ð ÞjS ¼ 0 ½ (transportability) for a ¼ 0 or 1. Correspondingly, the ATE τ and τ 0 can be directly obtained based on E Y a ð Þ ½ and E Y a ð ÞjS ¼ 0 ½ by definition. To illustrate the methods, we take scenario (S1) as an example where we from a total of n ¼ n 1 þ n 0 subjects. We summarize the methods under all scenarios in Table 1.
Both f x ð Þ and f xjS ¼ 0 ð Þare identifiable in scenario (S1) where we have observed X on all individuals. Therefore, we can marginalize b m a x ð Þ over the empirical distribution of X in the combined sample and the S ¼ 0 sample, respectively, which gives the following outcome regression estimators (Dahabreh et al., 2019,e;Lesko et al., 2017 (3) has been referred to as the g-formula (Greenland & Robins, 1986;Robins, 1986) or standardization (Vansteelandt & Keiding, 2011) in epidemiology, and can also be viewed as imputation in missing data literature (Cheng, 1994).

| Inverse probability weighting
Inverse probability weighting is a very commonly used technique (Cole & Stuart, 2010;Lesko et al., 2017;Westreich et al., 2017). Note that the g-formula in Equation (3) can be re-expressed as follows T A B L E 1 Three estimation strategies (OR = outcome regression, IPW = inverse probability weighting, AIPW = augmented IPW) for the mean potential outcome in the target population, when either the combined sample (generalizability, corresponding to ATE τ) or the S ¼ 0 sample (transportability, correspond to ATE τ 0 ) is a random sample of the target population Þ is designed by the investigator in an RCT and can also be estimated based on the S ¼ 1 sample, and Estimation strategies are proposed in . f Similar to footnote (e), The propensity score, P A ¼ a j S ¼ 1, X ð Þ , is a known function designed by the investigator in an RCT, while the trial participation probability P S ¼ 1jX ð Þcan be estimated in the combined sample because X is fully observed under (S1). We arrive at the following inverse probability-weighted estimators where Þis a product of the estimated treatment and trial participation probabilities. Although the propensity score is known, estimating the model parameters rather than using the true value can improve efficiency (Hahn, 1998;Lunceford & Davidian, 2004;Robins et al., 1994). Comparing Equation (6) to traditional IPW estimator using the trial data only, that is, we further weight each subject who participated in the trial by the inverse of the trial participation probability, P S ¼ 1jX ð Þ , to generalize the ATE from the S ¼ 1 sample to the combined sample, while to transport the ATE from the S ¼ 1 sample to the S ¼ 0 sample, trial participants are weighted by the inverse of both the odds of trial participation

| Augmented inverse probability weighting
So far, each of the estimators relies on estimating components of the likelihood such as m a X ð Þ and P S ¼ 1, A ¼ a j X ð Þ , which are not necessarily in themselves of scientistic interest. Nonparametric estimation may not be feasible when X is of high dimension, while parametric working models may be prone to model misspecification. We can combine the two estimators to gain robustness. A common approach to derive a robust estimator is by constructing an estimating equation from the efficient influence function (EIF) and evaluating it under a working model for the observed data distribution to solve for the parameter of interest, which is widely used in missing data problems (Tsiatis, 2007). Any regular and asymptotic linear estimator is asymptotically equivalent to the sample average of the influence function, which is a function of the observed data with mean zero and finite variance, and the one with the smallest variance is referred to as the EIF (Tsiatis, 2007;Van der Vaart, 2000). The EIFs for E Y a ð Þ ½ and E Y a ð ÞjS ¼ 0 ½ under a nonparametric model where the distribution of the observed data is unrestricted are where  et al., 2019,e). As mentioned in Section 3.1.2, P A ¼ a j S ¼ 1, X ð Þis guaranteed to be correctly specified in an RCT, therefore P A ¼ a, S ¼ 1 j X ð Þis correctly specified as long as P S ¼ 1jX ð Þ is. Hence the above AIPW estimators are doubly robust in the sense that it remains consistent when either the probability of trial participation P S ¼ 1jX ð Þor the outcome regression model m a X ð Þ is correctly specified. This can be seen by the following observation: the IPW estimator introduced in Section 3.1.2 can be obtained by misspecifying m a X ð Þ as zero in Equation (8), while the OR estimator introduced in Section 3.1.1 can be obtained by setting the weight in the first term of both U Á ð Þ and U 0 Á ð Þ to zero in Equation (8). 3.1.4 | Other methods for combining data from clinical trial and external data Other doubly robust estimators include a targeted maximum likelihood estimator (Rudolph & van der Laan, 2017) and an augmented calibration weighted estimator (Dong et al., 2020). A sensitivity analysis that replaces Assumption 4 with a prespecified bias function has also been proposed (Dahabreh, Robins, Haneuse, Saeed, et al., 2019). Meta-analysis is often used to synthesize information about parameters from data collected from multiple trials, which allows for extensions of the above methods to the setting of generalizing or transporting inferences from multiple randomized RCTs to a target population (Dahabreh, Petito, et al., 2020;Dahabreh, Robertson, Petito, Hern an, & Steingrimsson, 2019;Manski, 2000;Steele et al., 2020). Identification under an arbitrary collection of observational and experimental data has been investigated (Lee et al., 2020). Combining probability and nonprobability samples with high-dimensional data has also been studied (Yang, Kim, & Song, 2020).

| Correcting for bias in observational study using validation or trial data
Internal validity, that is, Assumptions 2-3, naturally holds in RCTs due to randomization but not necessarily in observational studies due to potential unmeasured confounding. Borrowing strength from the internal validity of RCT data and the large sample size of observation data can mitigate bias and improve efficiency. In this vein, Yang, Zeng, and Wang (2020) considered estimation of the average treatment effect on the treated (ATT) in the scenario where X ¼ X 1 , U ð Þ, and U is unobserved. Data are obtained from RCT Y , A, X 1 , S ¼ 1 ð Þand from observational study Y , A,X 1 , S ¼ 0 ð Þ . In RCT, X 1 is sufficient for Assumption 2, while in the observational study, the unmeasured confounding U leads to bias. A weaker version of Assumption 4 is further assumed. Yang, Zeng, and Wang (2020) proposed to model unmeasured confounding bias via λ X , which is equal to zero if U ¼ ;. Modeling this bias function allows one to improve efficiency in estimation of the ATT by combining observational data and RCT data. A similar idea was considered in Kallus et al. (2018) where a confounding bias correction term was learned with interpolation of E YjA,X 1 ½ between RCT and observational data, and Gui (2020) where RCT data were used to correct bias in an imperfect estimator based on an invalid instrumental variable defined on observation data.
In Athey et al. (2020), it was assumed that we observe data from RCT W ,A, X, S ¼ 1 ð Þand from an observational study Y , W , A, X,S ¼ 0 ð Þ , where W denotes a secondary outcome observed in both studies, Y denotes the primary outcome expensive to measure in RCT, and the S ¼ 0 sample is a random sample of the target population. Motivated by the observation that the treatment effects on the secondary outcome should be similar in the RCT and observational data if X is sufficient for Assumption 2, Athey et al. (2020) developed a control function method for using differences in the estimated causal effects on the secondary outcome between the two samples to adjust estimation of the treatment effect on the primary outcome.
Yang and Ding (2019) considered the scenario where a small validation dataset with all confounders Y ,A, X 1 ,U, S ¼ 1 ð Þand a big main dataset with unmeasured confounders Y ,A, X 1 ,S ¼ 0 ð Þare available. Both are random samples of the target population hence external validity is satisfied. The big main data can improve efficiency and the small validation data can ensure consistency. For each dataset S ¼ s, let b τ s , s ¼ 0, 1 denote a consistent estimator of the ATE based on a user-specified estimation strategy adjusting for all confounders X 1 , U ð Þ, and let b τ ep,s , s ¼ 0, 1 denote an error-prone estimator using the same estimation strategy but with U uncontrolled. Apparently b τ 0 cannot be obtained. A key insight is that the two error-prone estimates b τ ep,1 Àb τ ep,0 should be consistent for zero. By modeling the joint distribution of b τ 1 and b τ ep,1 Àb τ ep,0 , they derived the most efficient consistent estimator of τ among all linear combinations b τ 1 þ α b τ ep,1 Àb τ ep,0 À Á , α R. Other methods for controlling unmeasured confounding with validation data include the propensity score calibration (Stürmer et al., 2005)

| Combining clinical trial with external control
Single-arm clinical trials are typically conducted for rare diseases due to difficulties in recruiting enough patients for an adequately powered two-arm trial, or for diseases with high unmet medical need that raise ethical concerns (Abrahami et al., 2021;Cuffe, 2011;Viele et al., 2014). Historical or contemporaneous information on the control arm is often available from previous RCT or observational studies. Such external controls have been used to emulate the control arm in the setting of single-arm trials, which can decrease costs and duration and improve power.
Formally, the single-arm trial data Y , A ¼ 1, X, S ¼ 0 ð Þ are a random sample of the target population, while the external control data contain Y , A ¼ 0, X, S ¼ 1 ð Þ . Our goal is to estimate E Y 0 ð ÞjS ¼ 0 ½ leveraging historical data in order to contrast it with the mean response in the single-arm trial to estimate the treatment effect. Traditional methods to account for differences in patient characteristics between the external control and the target population include metaanalysis (Hasegawa et al., 2017;Schmidli et al., 2014;Schmidli et al., 2020;Weber et al., 2018;Zhang et al., 2019) and matching (Schmidli et al., 2020;Signorovitch et al., 2010). Typically, a form of exchangeability across different studies like Assumption 4 is assumed. Recently, Li and Song (2020) proposed to build an outcome regression model using external control data under exchangeability, and then estimate E Y 0 ð ÞjS ¼ 0 ½ by standardization, which is similar to the identification strategy in Equation (3) with a ¼ 0. Besides single-arm trial data, external controls have also been used to improve efficiency in a traditional RCT with data on both arms available. Li, Miao, Lu, and Zhou (2020) showed that the semiparametric efficiency bound for estimating E Y 1 ð ÞÀY 0 ð ÞjS ¼ 0 ½ is reduced by incorporating external control data, and proposed a doubly robust and locally efficient estimator that combines outcome regression and inverse probability of treatment weighting.

| NO SINGLE SAMPLE CONTAINS ALL RELEVANT VARIABLES
The data integration problems described so far have complete data on all relevant variables in at least one sample. A more challenging problem is when there are no complete data at any data source. This setting has been referred to as data combination (Ridder & Moffitt, 2007;Shu & Tan, 2020) or data fusion (Evans et al., 2018;Li, Miao, Cai, et al., 2020;Sun & Miao, 2018) in the literature. In the following, we will first introduce methods applicable to the general data combination problem in Section 4.1. We will use a new set of notation in Section 4.1 while notation in the rest of the article follows Section 2. We will then overview specific causal inference problems and methods in Sections 4.2 and 4.3.

| General data combination methods
We first introduce some new notation. Suppose for each member from a population of interest, we can define a vector of relevant variables Y , X, Z ð Þ. A sample of complete data on Y , X, Z ð Þis unavailable, instead two separate samples are available. In one sample we observe variables Z, Y , S ¼ 1 ð Þand in the other sample, we observe Z,X, S ¼ 0 ð Þ , with Z shared by the two datasets. Suppose the S ¼ 1 and S ¼ 0 samples are of size n 1 and n 0 , respectively, with total sample size n ¼ n 1 þ n 0 , then a merged sample combining the two samples is an i

| Estimation of general parameters defined through moment restrictions
We assume that the S ¼ 1 sample is drawn from the population of interest, while the S ¼ 0 sample is an auxiliary sample independent of the S ¼ 1 sample, which ensures identification that could not be achieved by the S ¼ 1 sample alone. We are often interested in a population parameter defined as the unique solution θ ℛ k to the k Â 1 vector of population moment conditions E m Y, X,Z; θ ð ÞjS ¼ 1 ½ ¼ 0, which includes the maximum likelihood estimation and generalized method of moments as special cases. For example, θ is the ATT when S is the binary treatment indicator, Y ,X ð Þare the potential outcomes under treatment and control respectively, Z is a vector of pretreatment covariates, and m Y, X, Z; θ ð Þ¼ Y À X À θ. Another example is the two-sample instrumental variable (IV) problem, where Z is a vector of IVs, X is the treatment (not necessarily binary), Y is the outcome, and m Y, X, We will detail the two-sample IV literature in Section 4.2. Typically selection exchangeability (S ╨ Y ,X ð ÞjZ) and positivity (P S ¼ sjZ ð Þ> 0) are assumed to identify θ by combining the two samples. Graham et al. (2016) and Shu and Tan (2020) proposed doubly robust and locally efficient estimators of θ extending the semiparametric efficiency theory of Hahn (1998) and Chen et al. (2008). We illustrate the estimation strategies in Shu and Tan (2020) below. When Y ¼ ;, the moment restriction becomes E m X,Z; θ ð ÞjS ¼ 1 ½ ¼ 0 in which X is unobserved in the S ¼ 1 sample and we need to combine the two samples for estimation. Shu and Tan (2020) took the EIF in Chen et al. (2008) as the estimating function to obtain an AIPW estimator, which solves The AIPW estimator is doubly robust in that it remains consistent when either the propensity score model P S ¼ 1jZ ð Þ or the outcome regression model E m X, Z; θ ð ÞjZ ½ is correctly specified. This can be seen by the following observation: an IPW estimator can be obtained by misspecifying E m X, Z;θ ð ÞjZ ½ as zero in Equation (9), while an outcome regression estimator can be obtained by setting P S ¼ 1jZ ð Þ =P S ¼ 0jZ ð Þto zero in Equation (9). When Y ≠ ;, Graham et al. (2016) and Shu and Tan (2020) further imposed a key identification assumption that the moment condition is separable in the sense that E m Y, X, Z;θ where m 1 and m 0 only depend on variables observed in one sample. We can see that E m 1 Y , Z; θ ð ÞjS ¼ 1 ½ can be directly estimated from the S ¼ 1 sample, while the challenge is to estimate E m 0 X, Z; θ ð ÞjS ¼ 1 ½ combining both samples. Motivated by the observation that estimation of E m 0 X, Z;θ Shu and Tan (2020) proposed an AIPW estimator that solves with U S,X, Z; m 0 Á ð Þ, θ ð Þ being the estimating function in Equation (9) with m Á ð Þ substituted with m 0 Á ð Þ. An alternative assumption often imposed is the conditional independence assumption, that is, Y ╨X j Z (Ogburn et al., 2020;Ridder & Moffitt, 2007). Under this assumption we have f Y, ð Þ and f X, Z ð Þ can be estimated from one sample. Therefore, the sample moment conditions can be computed combining the two samples.

| Statistical matching
Another set of methods in data combination problems is statistical matching, which has been proposed mainly under two scenarios. In the first scenario, a sufficient number of units are shared between the two data sources, that is, the two samples are partially overlapping. In this case, it is convenient to merge the two samples by linking the records relating to the same unit. There is a rich literature on record linkage which is beyond the scope of this article (Deepak & Jurek-Loughrey, 2018;Fellegi & Sunter, 1969;Herzog et al., 2010;Komarova et al., 2018;Sayers et al., 2016;Winkler, 1999). In the second scenario, the two samples are selected from the same population but have no common unit. In this case, a statistical matching framework has been proposed in survey studies, which finds a matched pair of units according to the shared variable Z, then imputes the missing value for one unit using the observed value from its matched counterpart (D'Orazio, 2015;D'Orazio et al., 2006;Radner, 1980;Ridder & Moffitt, 2007;. Validity of the statistical matching approach depends on the conditional independence assumption that conditional on the shared variable Z, the potentially missing variables Y and X are independent. Under this assumption, matching on Z is sufficient to impute Y in S ¼ 1 sample regardless of whether X are the same. A similar argument holds for imputation in the S ¼ 0 sample. Evans et al. (2018) studied a different problem of estimating the regression coefficient of a correctly specified model E Y j Z, X; θ ½ when both samples are i.i.d. random samples of the same population. Selection exchangeability and positivity were assumed similar to Section 4.1.1, while no assumption on separable moments (Graham et al., 2016;Shu & Tan, 2020) or conditional independence (Ridder & Moffitt, 2007) introduced in previous sections was made. In this setting, identification of θ can be hard even under linear models, which has been discussed in Pacini (2019), , and Miao et al. (2022). Evans et al. (2018) proposed a doubly robust estimator for θ that solves

| Data combination in regression analysis
where g Á ð Þ is of the same dimension as θ. The doubly robust estimator remains consistent under misspecification of either f XjZ ð Þ or P S ¼ 1jZ ð Þ . Therefore, an IPW estimator can be obtained by misspecifying f XjZ ð Þ as zero, that is, by substituting E YjZ ½ with zero in Equation (10), while an imputation estimator can be obtained by substituting P S ¼ 1jZ ð Þwith 0.5 in Equation (10).

| Two-sample instrumental variable and Mendelian randomization
An important setting of data combination problem is the two-sample instrumental variable methods. An instrumental variable is an exogenous variable known to satisfy the following three core assumptions: (I) the IV must be associated with the treatment; (II) the IV must not have a direct effect on the outcome that is not mediated by the treatment; (III) the IV must be independent of unmeasured confounders. The IV approach is one of the most frequently used methods to mitigate unmeasured confounding denoted as U. It turns out that the causal effect can be estimated by combining information from two data sources. Let Z denote an instrumental variable. The two-sample IV estimation concerns the scenario when Z, A, X, S ¼ 1 ð Þare available in one data source and Z, Y , X, S ¼ 0 ð Þare available in a separate data source, with Z, X ð Þ shared by the two datasets. No complete data on all variables Z, Y , A, X ð Þare available. In the following, we will suppress the measured covariates X to simplify notation, and all arguments are made implicitly conditional on X.
We first consider the case of a binary treatment. Assuming that U does not modify the causal effect of A at the individual level, that is, Hence common IV methods often estimate the effect of the treatment using the IV-outcome and IV-treatment associations. The numerator and denominator can be separately estimated from two distinct samples if both are random samples of the same target population. In a general case where A is not necessarily binary and could be a vector, the most common IV approach assumes Y ¼ βA þ ε Y , and A ¼ γZ þ ε A , and the IV estimator is given by ð Þ denotes the sample covariance matrix. In the one-sample setting, the IV estimator is equivalent to a two-stage least squares (2SLS) estimator obtained by first regressing A on Z, and then regressing Y on b A, the fitted values of A. Angrist and Krueger (1992) and Arellano and Meghir (1992) showed that the IV estimator can be obtained by computing c cov Z,A ð Þ based on the S ¼ 1 sample and computing c cov Z, Y ð Þ based on the S ¼ 0 sample, referred to as the twosample IV estimator. Klevmarken (1982) and Angrist (1995) showed that the 2SLS can also be separately carried out using two samples, referred to as the two-sample two-stage least squares (TS2SLS) estimation (Björklund & Jäntti, 1997). In the first stage, A is regressed on Z using the S ¼ 1 sample, and the estimates are then combined with observations on Z in the S ¼ 0 sample to form b A. In the second stage, Y is regressed on b A. Inoue and Solon (2010) pointed out that the equivalence of IV and 2SLS estimation in the one-sample setting does not hold in the two-sample setting. In fact, TS2SLS is more efficient than two-sample IV because it implicitly corrects for differences in the distribution of Z between the two samples.
The above classical two-sample IV methods often assume that the two samples are compatible with the same observed data distribution f Z, Y , A ð Þ. However it is found that the common variable, that is, the IV, can have different distributions between the two samples, that is, Graham et al. (2016) modeled the selection probability, P S ¼ 1jZ ð Þ , parametrically and developed a doubly robust and locally efficient estimator which can be applied in more general data combination problems. Similar methods proposed in Shu and Tan (2020), detailed in Section 4.1, were also applied to the two-sample IV problem. It is important to note that the estimator proposed by Graham et al. (2016) is based on EIF derived under a correct model for P S ¼ 1jZ ð Þand is therefore doubly robust only under such restricted model specification of nuisance parameters, whereas the estimator of Shu and Tan (2020) is based on EIF under a nonparametric model for the observed data and is doubly robust without such restrictions. Sun and Miao (2018) established sufficient conditions for nonparametric identification of the ATE allowing for heterogeneous samples, derived the efficiency bound for estimating the ATE, and proposed a multiply robust and locally efficient estimator for estimation and inference.
Using genetic variants as IVs, two-sample Mendelian randomization (MR) methods have also been studied recently, which leverage publicly available summary statistics on genetic instrument-treatment and genetic instrument-outcome associations typically obtained from genome-wide association studies (GWAS; Davey Smith & Ebrahim, 2003;Davey Smith & Hemani, 2014;Lawlor, 2016;Pierce & Burgess, 2013;Spiller et al., 2019;Zhu et al., 2018). Although simple and convenient, the traditional two-sample MR methods typically rely on valid instruments. Methods robust to invalid instruments have been studied (Bowden et al., 2015(Bowden et al., , 2016Hartwig et al., 2017;Li, 2017;Sanderson et al., 2021;Zhao et al., 2020), and extension to the setting of weak instruments has also been studied Sanderson et al., 2021;Wang & Kang, 2019). Zhao et al. (2019) further considered the scenario when the sample compatibility assumption is violated and proposed methods that are robust to heterogeneous samples.
A more general setting is studied in Li, Miao, Cai, et al. (2020) assuming K þ 1 datasets. Specifically, let ð Þ╨A j X, S is randomly assigned, and E Y j A, X; β ½ is linear and additive, Li, Miao, Cai, et al. (2020) showed that the coefficient of A, which is the ATE under linear additive model, is identifiable by combining summarylevel statistics obtained from the separate datasets.

| Distributed data setting
Meta-analysis has a long history in integration of the results from multiple clinical trials with no access to individuallevel trial data (DerSimonian, 2015;DerSimonian & Laird, 1986). Recently, another widely studied topic is the analysis of distributed data where individual-level observational data are not shareable due to privacy concerns (Toh, 2020). This is increasingly needed in multidatabase or multicenter study of comparative effectiveness and safety of medical products using real-world data such as electronic health records data. Each data partner can share a summary-level dataset with the analysis center. A few methods have been proposed and we summarize them ordered by the amount of information shared. The first method is to reduce the dimension of measured confounders using the propensity score or the prognostic score (Hansen, 2008;Rosenbaum & Rubin, 1983), then share individual-level treatment, outcome, and score with the analysis center to apply propensity score methods (Rassen & Schneeweiss, 2012;Shi et al., 2019). The second method is to aggregate subjects into cells defined by confounders or the propensity score strata, then adjust for confounding based on counts of subjects in each cell (Cook & Goldman, 1989;Rassen et al., 2010;. Propensity score matching within each data partner can be done prior to the aggregation (Toh et al., 2013;Yoshida et al., 2018). The third one is distributed regression Zhang et al., 2013), and the fourth one is metaanalysis of site-specific results (Toh et al., 2013).

| Bayesian causal inference
Bayesian framework can naturally facilitate the borrowing of prior information across data sources (Gelman, 2006;Hobbs et al., 2011;Ibrahim & Chen, 2000;Kaizer et al., 2018). Boatman et al. (2020) studied the problem of estimating causal effects from a primary source and borrowing from any number of supplemental sources when data on outcome, treatment, and confounders are available in all data sources. When some confounders are unmeasured in a large main dataset but are available in a small validation dataset, a missing data perspective has been used to impute the missing covariates (Gelman et al., 1998;Jackson et al., 2009;Murray & Reiter, 2016). When the number of missing covariates in the main study is large relative to the sample size of the validation study, Antonelli et al. (2017) proposed a Bayesian approach to estimate the ATE in the main study that combines Bayesian variable selection and missing data imputation, allowing for heterogeneous treatment effects between the main and validation studies. Comment et al. (2019) proposed to use informative priors on quantities related to the unmeasured confounding bias in a range of settings including both static and dynamic treatment regimes as well as treatment-induced mediator-outcome confounding.

| Causal discovery
Data integration has also been studied in causal discovery, which aims to learn the causal relations between variables of a system, using multiple heterogeneous datasets that measure the system under different environments or experimental conditions and with different sets of variables. There are two main types of methods. The first type pools data from different experiments to learn a context-independent causal graph of the system (Cooper & Yoo, 1999;Eaton & Murphy, 2007;Peters et al., 2016;Tian & Pearl, 2001;Zhang et al., 2017). For example, Peters et al. (2016) provided an invariant prediction method built on the idea that the conditional distribution of the outcome given the direct causes is invariant across different experimental conditions. Mooij et al. (2020) proposed to take into account context variables that discriminate the different datasets in standard causal discovery methods applied to the pooled data. The second type derives statistics or constraints from each context separately without pooling data and combines them to learn a single graph (Claassen & Heskes, 2010;Tillman & Spirtes, 2011;Triantafillou & Tsamardinos, 2015).

| DISCUSSION
In this article, we reviewed a collection of data integration methods in causal inference. A common perspective views data integration in causal inference as a missing data problem where the study sample is a subset of the target population. This problem is referred to as generalizability or verify-in-sample. We summarize the data missing patterns in Sections 3 and 4 in Table 2. Another setting increasingly recognized is when the study sample and the target population are partially or nonoverlapping, in which selection exchangeability requires that the variables that determine study inclusion/exclusion should not be predictive of the outcome or at least does not modify the treatment effect. This problem is referred to as transportability or verify-out-of-sample (Chen et al., 2008;Colnet et al., 2020;Degtiar & Rose, 2021). We summarized causal inference methods under both scenarios and their applications in important real-world problems including combining clinical trial with external information, correcting for unmeasured confounding in observational study using auxiliary or trial data, two-sample Mendelian randomization, and distributed data network. Majority of the methods rely on some form of exchangeability/homogeneity across different data sources, hence sensitivity to violation of exchangeability assumptions should be routinely conducted. In addition, identification strategies in complex settings such as when no single sample contains all relevant variables have not been fully explored, and connection to the covariate shift problem in machine learning has yet to be fully studied.

CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.
T A B L E 2 Data missing patterns in the major settings discussed in Sections 3 and 4 Note: For each variable in each sample, ✓ stands for observed, empty stands for unobserved, and ✓/O indicates different settings considered by different papers.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.