Individualized causal discovery with latent trajectory embedded Bayesian networks

Bayesian networks have been widely used to generate causal hypotheses from multivariate data. Despite their popularity, the vast majority of existing causal discovery approaches make the strong assumption of a (partially) homogeneous sampling scheme. However, such assumption can be seriously violated, causing significant biases when the underlying population is inherently heterogeneous. To this end, we propose a novel causal Bayesian network model, termed BN‐LTE, that embeds heterogeneous samples onto a low‐dimensional manifold and builds Bayesian networks conditional on the embedding. This new framework allows for more precise network inference by improving the estimation resolution from the population level to the observation level. Moreover, while causal Bayesian networks are in general not identifiable with purely observational, cross‐sectional data due to Markov equivalence, with the blessing of causal effect heterogeneity, we prove that the proposed BN‐LTE is uniquely identifiable under relatively mild assumptions. Through extensive experiments, we demonstrate the superior performance of BN‐LTE in causal structure learning as well as inferring observation‐specific gene regulatory networks from observational data.

that factorizes with respect to the graph.When the graph is directed and acyclic, the graphical model is called a Bayesian network (BN).To interpret BN causally, we make the common causal Markov assumption, namely, a variable is independent of its non-effects given all of its direct causes.For instance, if one believes smoking causes lung cancer only through lung tissue damage (i.e., smoking → lung tissue damage → lung cancer), then it is not unreasonable to assume that given the condition of lung tissue (the direct cause of lung cancer), smoking status (a non-effect of lung cancer) does not contain additional information of the incidence of lung cancer.
It is well-known that the graph structure of a BN is identifiable up to Markov equivalence classes (MECs) with purely observational, cross-sectional data.Thus, traditionally, many methods (Chickering, 2002;Castelletti et al., 2018;Kalisch & Bühlmann, 2007) focus on learning MEC instead of specific BN, which can be subsequently used for bounding causal effects (Castelletti & Consonni, 2021;Maathuis et al., 2009).However, BNs belonging to the same MEC could encode drastically different causal hypotheses (e.g.,  →  vs.  → ).Additional modeling assumptions are thus required to uniquely identify BNs.Several approaches (Altomare et al., 2013;Ni et al., 2015;Shojaie & Michailidis, 2010) consider imposing a known causal ordering of variables.Nevertheless, the ordering may not be known in many real applications.Another series of works (Logsdon & Mezey, 2010;Ni et al., 2018;Oates et al., 2016;Zhang & Kim, 2014) have been proposed to identify BNs with the help of carefully chosen instrumental variables.However, the effectiveness of these methods depends crucially on the quality of instruments.The methods that are most relevant to ours make assumptions on the sampling distribution for identifiability.BNs for continuous data are often represented as sparse additive noise models.Under such representation, BNs are identifiable if, for example, the noises are non-Gaussian (Shimizu et al., 2006), the causal-effect relationships are nonlinear (Hoyer et al., 2008), or the noise variances are equal (Peters & Bühlmann, 2014).
One common assumption of the aforementioned methods is data/sampling homogeneity, that is, all observations are assumed to be independent and identically distributed (IID).However, in many applications, samples are not IID.For example, in cancer genomics, patients with the same type of cancer may have very different tumor genomic profiles, which can significantly affect the diagnosis, prognosis, and response to treatment (Dagogo-Jack & Shaw, 2018).This motivates us to develop a new causal discovery method that can account for-in fact, take advantage of-data heterogeneity in order to discover causal gene regulation at the patient level with observationspecific BNs, which is potentially useful for developing targeted therapy tailored to each individual patient.In addition to identifiability, another major challenge of estimating observation-specific BNs lies in the fact that there is essentially no replicate for each observation.
There are some existing works that explicitly address the data heterogeneity issue.In regression, varying coefficient/sparsity models (Hastie & Tibshirani, 1993;Ni et al., 2019) can be used to accommodate sampling heterogeneity.In graphical models, group-specific graphical models (Ni et al., 2018;Yajima et al., 2015) assume observations to be IID within each pre-determined group.Ni et al. (2019) proposed a graphical regression model, which incorporates subject-level covariates to account for data heterogeneity in BNs.Recently, Peters et al. (2016), Huang et al. (2020), andMooij et al. (2020) also utilized data heterogeneity to establish causality by making the assumption of causal invariance across environments.However, these methods do not allow inference of observationspecific causal graphs.Furthermore, they assume that the covariates/groups/environments are observed.
We propose a novel individualized causal discovery method, termed Bayesian networks with latent trajectory embedding (BN-LTE), that is, capable of inferring observation-specific BNs from purely observational, crosssectional, heterogeneous data.BN-LTE embeds the multivariate observation of interest onto a low-dimensional manifold through a latent variable (dubbed pseudotime, borrowed from the genomic literature) that is nonlinearly mapped to the conditional expectations of the multivariate data.The embedding helps explain the data heterogeneity through the latent trajectory.Conditional on the pseudotime and nonlinear mapping, a trajectory-dependent BN is proposed to characterize the causal information of the system on the observation level.Both the causal graph structure and direct causal effects can vary smoothly along the latent trajectory, which is very different from the causal invariance assumption (Peters et al., 2016).We establish causal structure identifiability theory of the proposed BN-LTE under relatively mild assumptions.A fully Bayesian inference procedure is developed based on Markov chain Monte Carlo (MCMC).Simulations and a real application demonstrate the practical utility of the proposed model.Note that this paper will focus on the causal graph structure learning rather than causal effect inference although the latter can be carried out given the inferred causal graph structure through the do-calculus (Pearl, 2009).

BACKGROUND ON BAYESIAN NETWORKS
Let  = ( 1 , … ,   )  ∈ ℝ  be a -dimensional random vector.A directed acyclic graph (DAG)  = (, ) consists of a set of nodes  = {1, … , } representing the elements of  and a set of arrows  ⊂  ×  representing the direct cause-effect relationships among .We sometimes refer to   and its corresponding node  interchangeably.Cycles are not allowed in the sense that one cannot return to the same node by following the arrows.For any  ≠  ∈ ,  is said to be a parent or direct cause of , and  is said to be a child or direct effect of , if there exists an arrow  → .The set of parents and children of  in  are denoted by   () and ℎ  (), respectively.Node  is a descendant of node  if there exists a set of arrows  =  0 →  1 → ⋯ →   =  and otherwise is a non-descendant.The set of non-descendants of  excluding  in  is denoted by   ().
A BN  = (, ) of  is a probability model of  of which the joint distribution  factorizes over the DAG , where   is the conditional probability distribution of   given its parents.The factorization implies Markov properties (i.e., conditional independence assertions) of  that are encoded and can be read off from  through the notion of d-separation (Pearl, 2009).A distribution  is faithful to a DAG  if  encodes all the conditional independence relationships of .Two BNs  and  ′ are said to be Markov equivalent if their corresponding DAGs encode the same Markov properties and an MEC contains BNs that are Markov equivalent.For example, Figure 1 shows three DAGs from BNs in the same MEC, which encodes the only independence relationship  2 ⟂  3 | 1 , but their causal interpretations are quite different.Therefore, constraintbased methods such as the PC algorithm (Spirtes, 2000), which rely on conditional independence relationships alone, cannot differentiate these graphs.Consequently, if one of them is the true causal model, the best conclusion one can draw is that one of them is true, which is unsatisfactory for many causal applications.Two BNs  = (, ) and  ′ = ( ′ ,  ′ ) are distribution equivalent (Spirtes & Zhang, 2016) (Spirtes, 2000).Exploiting the latter, existing works have shown that Markov equivalent BNs are not distribution equivalent when   's are non-Gaussian (Shimizu et al., 2006),   's are nonlinear (Hoyer et al., 2008), or all   's are Gaussian with the same variance (Peters & Bühlmann, 2014).Therefore, under any of these conditions, BNs (not just MECs) are uniquely identifiable with purely observational, cross-sectional data.

Sampling model
We introduce a latent variable  lying on the lowdimensional manifold  to characterize the heterogeneity of the observations.Borrowed from the genomic literature, we call the latent variable pseudotime.Without loss of generality, we assume the manifold  = [0, 1].Define a DAG function () = (, ()) of which the edge set () varies with the pseudotime .Accordingly, as in BNs, we assume that the distribution of  factorizes with respect to the structure of () and hence also varies with the pseudotime , where  is a generic notation for population-level model parameters.In words, at any given value of , the random vector  is modeled with a BN, where the conditional independence relationships can be read off from (), because BN factorization implies Markov properties (Lauritzen, 1996).We model the conditional probability distribution   in Equation ( 2) as where   () = 0 if  ∉  () ().The form of Equation ( 3) is motivated by both practical and theoretical considerations, which will be detailed later.Combining Equations ( 2) and ( 3 2) and (3).
BN-LTE encodes the directed Markov properties of  conditional on any fixed value of .For example, the local Markov property states that   and {  ∶  ∈  () ()} are conditionally independent given {  ∶  ∈  () ()} and .
The baseline function   (),  = 1, … , , captures the latent dynamic of the expectation of   when all its parents are set to zero.The direct causal effect   (), ∀ ≠ , is a function of , which not only encodes the constraint required by (), but also tracks the latent dynamic of the direct causal effect along the trajectory of pseudotime .For example, there exists an arrow  →  (i.e.,   is the direct cause of   ) at pseudotime  if and only if   () ≠ 0, and the sign and magnitude of non-zero   () indicate how the two variables are causally related.When the baseline and causal effect functions are all constant in , our model reduces to a standard Gaussian BN.
Let  1 , … ,   and  1 , … ,   be  realizations of  and , respectively.Since   is unobservable in many applications, it is treated as a latent variable to be inferred.Note that different from other population-level model parameters,   is an observation-specific parameter, with which the baseline and direct causal effect functions vary.BN-LTE, therefore, can be viewed as a continuous relaxation of group-specific causal graphical models (Huang et al., 2020;Mooij et al., 2020;Ni et al., 2018;Peters et al., 2016;Yajima et al., 2015), for which the latent variable   ∈ {1, … , } is discrete and indicates groups.Another major difference lies in the fact that   is latent in our model, whereas group indicator   is typically assumed to be known in group-specific causal graphical models.
The choices of the direct causal effect function   (⋅) and the baseline function   (⋅) in Equation ( 3) are yet to specified.Two critical properties are desired for   (⋅).Smoothness-similar -values should lead to similar causal models, for example, if   ≈   ′ , then (  ) ≈ (  ′ ) (recall (⋅) is determined by   (⋅)).Without smoothness, similar observations may have vastly different causal graphs, which is difficult to interpret in many applications including ours.Piecewise sparsity-() should be sparse for any  because a complete causal graph (i.e., all pairs of nodes are connected) is difficult to make sense of, and dissimilar -values may lead to distinct causal models, that is, if   ≉   ′ , it is possible that (  ) ≠ (  ′ ).To equip   (⋅) with these two properties, we model it with splines,   (⋅) = ∑   =1     (⋅), where  1 (⋅), … ,    (⋅) are cubic Bspline basis functions with evenly spaced interior knots.The finite support property of B-spline basis functions allows us to achieve the smoothness and piecewise sparsity of   (⋅).We demonstrate the modeling choice of   (⋅) and its implication on causal graphs in Figure 2. In this illustrative example with three nodes, since  12 ( 1 ) ≠ 0,  23 ( 1 ) ≠ 0, and  13 ( 1 ) = 0 (top panel), there are arrows 2 → 1 and 3 → 2 but a missing arrow between nodes 1 and 3 in ( 1 ).Furthermore, due to the piecewise sparsity of causal effect functions, graphs ( 1 ), ( 2 ), and ( 3 ) have distinct structures as shown in the bottom panel of Figure 2. The top panel of Figure 2 also illustrates the capability of Bsplines in identifying the sparse regions of the causal effect functions by encouraging the adjacent spline coefficients to be zero.We defer the discussion of priors that induce the desired piecewise sparsity to Section 3.3.For simplicity, we also model the baseline function   (⋅) with cubic B-spline, where  1 (⋅), … ,    (⋅) are the basis functions.For the choice of   and   , one possibility would be imposing priors (e.g., geometric distribution) and estimating them from data, which then, however, requires a reversible-jump MCMC for posterior inference.We will not pursue this direction but instead fix   and   to moderately large numbers.Our sensitivity analyses show that our method was relatively robust (see Section C.2).

Causal identifiability theory
There are two potential sources of non-identifiability of BN-LTE.First, as mentioned in Section 2, BNs are generally only identifiable up to MEC for purely observational data without additional assumptions.As shown in previous works (Hoyer et al., 2008;Peters et al., 2014;Shimizu et al., 2006), under certain distributional assumptions, Markov equivalent BNs are not distribution equivalent and are, therefore, distinguishable from each other.Second, the pseudotime may not be identifiable, which is reminiscent of many other latent variable models in the literature.For example, one can arbitrarily switch cluster labels without changing the distribution of a mixture model.Similarly, latent factors are only identifiable up to orthogonal transformations in the factor model.In the following, we characterize the identifiability property of the proposed BN-LTE.We first define distribution equivalence for the proposed BN-LTE.
We first show that the causal DAG of the proposed BN-LTE is identifiable.We define, for a fixed DAG , the preimage of DAG function () by () = { ∈  ∶ () = }, which is the set of pseudotimes that lead to the same fixed DAG structure .
All proofs can be found in Section A of the Supporting information.Note that we do not assume the functions   (⋅) and   (⋅) in Theorem 1 (also Theorem 2) to be additive splines as formulated in Section 3.1 and therefore the result is valid for general functions.Loosely speaking, the identifiability condition (4) of Theorem 1 states that BN-LTE is identifiable unless the quadratic transformation of heterogeneous direct causal effects on any specific node accidentally becomes homogeneous (i.e., constant in ).For example, the condition is violated and hence BNs are not identifiable if all direct causal effects are constant in .This is not surprising because BN-LTE in this case is reduced to a standard Gaussian BN, which is known to be non-identifiable.Because of its peculiarity, Equation ( 4) is deemed a mild condition.Faithfulness is a common assumption adopted by many existing causal discovery methods (Chickering, 2002;Peters & Bühlmann, 2014;Spirtes, 2000).Note that the model parameters that lead to its violation have Lebesgue measure zero (Meek, 2013).Faithfulness allows us to restrict our attention to BNs, whose DAGs have the same skeleton due to Verma and Pearl (1990).It may be possible to relax this assumption; however, accordingly, current condition (4) is likely to become more stringent.
Next, we show that we can identify the pseudotime  up to a monotonic transformation.Let (⋅) denote a vectorvalued function that stacks all baseline functions   (⋅) and non-zero causal effect functions   (⋅) for  = 1, … ,  and  ≠ .
Theorem 2. Assume the same conditions of Theorem 1 hold.In addition, if (⋅) is continuous and injective, then  is identifiable up to a monotonic transformation, that is, for any instantiations  1 , … ,   and  ′ 1 = ( 1 ), … ,  ′  = (  ) such The injectivity condition.Every function on the right panel either repeats or mirrors the corresponding one on the left panel so that the injectivity condition is violated.
The condition that assumes  ↦ () to be injective as a vector-valued function should not be interpreted as a requirement that each   (⋅), ∀ and non-zero   (⋅), ∀ ≠  must be injective.In fact, this condition can still hold even if none of the individual functions are injective.For example, on the left panel of Figure 3, neither  1 () = sin(2( − 0.5)) nor  2 () = cos(2( − 0.5)) is injective within (0, 1) but the resulting ( 1 (),  2 ()) is injective.In other words, the condition is easily met except for artificially constructed examples such as repeating or mirroring the original injective function, as demonstrated in the right panel of Figure 3.Our prior model introduced in Section 3.3 will favor a simpler/smoother model and hence effectively avoid these overly complex specifications.
While there are still infinitely many configurations of  1 , … ,   that would yield the same likelihood, the pseudotimes are useful in explaining the data heterogeneity, sorting the observations (because different configurations are merely monotonic transformation of each other), and, most importantly, enabling the causal identification of BN-LTE.The pseudotime in essence rearranges the indices of observations so that the baseline and causal effect functions are smoothly varying along the relabeled observations.In our motivating cancer genomic example, the ordering of observations is informative in characterizing the disease progression.If desired, prior information regarding the pseudotime can be incorporated to further improve its identifiability and interpretability.
The assumption of data heterogeneity in our work is different from the recent methods (Huang et al., 2020;Mooij et al., 2020;Peters et al., 2016;Pfister et al., 2019;Rothenhäusler et al., 2021) in several significant ways.
First, most of these methods (except Pfister et al., 2019) require  to be observed, whereas BN-LTE is applicable to heterogeneous data when  is latent.Second, their methods do not infer an observation-specific model-some of them allow the direct causal effects to vary with , but the graph structure is fixed across all observations.In our view, if the direct causal effect is assumed to vary with , it is quite natural to expect its statistical significance (i.e., graph structure) to vary with  as well.BN-LTE allows individualized inference of causal graphs.

Prior model
We discuss prior specifications of spline coefficients   's and   's, pseudotimes   's, and noise variances  2  's, which parameterize the proposed BN-LTE.

3.3.1
Prior of   's Strictly enforcing the injectivity condition in Theorem 2 is computationally tedious and practically unnecessary.
In practice, encouraging enough smoothness of   (⋅), ∀ seems sufficient to avoid the artificially constructed cases as mentioned in Section 3.2 because they are much rougher than the natural cases (Figure 3).We penalize the roughness of   (⋅), ∀ via penalized splines.Under the Bayesian framework, the roughness penalty corresponds to a random walk prior on the spline coefficients (Lang & Brezger, 2004), where  is a fixed second-order difference penalty matrix,  − is its pseudo-inverse, and   is the roughness parameter, which controls the strength of the penalty.A smaller   lead to smoother   (⋅).In order to encourage smoothness, we impose a gamma prior on the roughness parameter   ∼ gamma(  ,   ).Compared to the commonly adopted inverse-gamma prior, a gamma prior with   ≤ 1 evaluates to infinity at zero and thus encourages smooth curve fit.Note that gamma distribution is a special case of generalized inverse Gaussian distribution with gamma(  ,   ) = GIG(2  , 0,   ), which is conjugate to the Gaussian distribution.Hence, the full conditional distribution of   is still generalized inverse Gaussian, which facilitates the implementation of Gibbs sampler.

Prior of 𝑡 𝑗𝓁𝑘 's
Recall that the causal effect function evaluated at pseudotime   encodes the causal graph structure of observation : arrow  →  ∈ (  ) is included in graph (  ) for observation  if   (  ) ≠ 0. In order for the graph structure and direct causal effects to both vary smoothly with the pseudotime, we assume   (⋅) to be smooth and piecewise sparse.
For the latter, we assign a spike and slab prior (George & McCulloch, 1993) on the cubic spline coefficients to select sparse regions of   (⋅), that is, for  ≠  and  = 1, … ,   , where   ∈ {0, 1} is a latent indicator of the significance of   .Generally, at any point  ∈  (except at the knots), there are four spline basis functions that are non-zero for a cubic B-spline (e.g., see the top panel of Figure 2).Therefore, if these corresponding spline coefficients are all zero, then   (⋅) ≡ 0 in the interval  = (, ) defined by the two closest knots  and  near , which in turn implies a missing arrow  →  in ( ′ ), ∀ ′ ∈ .Furthermore, if   = 0 for all  = 1, … ,   , then the arrow  →  will be excluded in the entire population.A beta-Bernoulli prior is used for   ∼ Bernoulli() and  ∼ beta(  ,   ), subject to the constraint that the causal graph is acyclic at any   , which automatically controls multiplicity (Choi et al., 2020).Alternatively to the beta-Bernoulli prior, a Markov/autoregressive type of prior could have been used to encourage consecutive pairs (  ,  +1 ) of spline coefficients to be simultaneously 0 or 1, which in turn would induce smoother changes in the graph structure.But we find that the conditionally independent prior of   does not show lack of smoothness in both simulations and applications.Therefore, for simplicity, we do not pursue this direction.A gamma prior is assigned to the hypervariance  2  ∼ gamma(  ,   ).

Prior of 𝑧 𝑖 's
Pseudotimes characterize the heterogeneity of the observations and can be interpreted as the disease development trajectory in biomedical applications.Subjective priors could be used if prior information of the disease development trajectory is available.When lacking, we assume the Coulomb repulsive process prior (Wang & Dunson, 2015), which encourages  1 , … ,   to fill out the entire unit interval  = [0, 1], with a repulsive parameter  and  1 , … ,   ∈ [0, 1].This prior has been shown to have better empirical performance than simpler alternatives such as standard uniform distribution by avoiding identifiability issues due to scale-and translation-invariance (Campbell & Yau, 2015).If desired, some   can be fixed according to prior knowledge.For example, one can set   = 0 if the th observation is deemed to have pseudotime zero (e.g., known normal tissues in the human cancer application).

3.3.4
Prior of  2  's The model is completed with an inverse-gamma prior

Posterior inference
The posterior inference is carried out through MCMC.Most sampling steps are Gibbs due to conjugacy; details are provided in Section B of the Supporting information, where we also show how to process Monte Carlo samples to get the estimates of causal graphs, causal effect functions, pseudotimes, and other parameters of interest, as well as a diagnostic procedure for checking the identifiability condition (4).

Simulation studies
We empirically demonstrate the performance of the proposed model in terms of pseudotime and causal graph structure estimation with simulated datasets.Since there is no alternative approach for joint inference of pseudotime and observation-specific causal graph structure, we compared BN-LTE with methods that infer pseudotime, SCOR-PIUS (Cannoodt et al., 2016) and TSCAN (Ji & Ji, 2016), a method that infers population-level causal BN (GDS; Peters et al. 2014), and methods that infer observationspecific association (non-causal) networks, CSN (Dai et al., 2019) and LOGGLE (Yang & Peng, 2020).Details of our simulations are provided in Section C of the Supporting information; here, we provide a brief summary.We considered two data-generation settings, each of which contains four different combinations of sample size and dimension (, ) ∈ {(300, 30), (500, 30), (500, 50), (800, 50)}.In the first setting, we generated causal graphs such that they are acyclic at any point of the pseudotime; whereas in the second scenario, we constrained the global union graph to be acyclic.For both settings, BN-LTE was better than TSCAN and SCORPIUS in terms of pseudotime inference.
As for causal graph recovery, BN-LTE substantially outperformed GDS, CSN, and LOGGLE throughout the scenarios especially in controlling the false discovery rates.
In addition, we performed sensitivity analyses with respect to the choice of hyperparameters, the number of spline basis functions, deviation from Gaussian noises, and the presence of unmeasured confounders.BN-LTE was relatively robust within the tested scenarios.We also demonstrate how the degree of heterogeneity influences the practical identifiability.

Application
Breast cancer is a molecularly heterogeneous genetic disease with distinct tumor genomic features due to various factors such as disease stage, molecular subtype, etc.It has been shown that critical gene networks can change over the course of cancer development (Huang et al., 2009;Moustakas & Heldin, 2007).However, a thorough investigation of the evolution of gene networks is still lacking, which motivated our study.We demonstrate the capability of BN-LTE in characterizing heterogeneous patient-specific gene regulatory networks along the latent trajectory of disease development, using the gene expression data downloaded from the Cancer Genome Atlas (TCGA).The dataset contains  = 1,215 heterogeneous tissue samples with 113 normal and 1,102 tumor tissues.We selected  = 58 genes that are involved in critical breast cancer disease pathways, including ESTROGEN, MAPK, PI3K-AKT, NOTCH, WNT, and P53 signaling pathways (Al-Hussaini et al., 2011;Giltnane & Balko, 2014;Gasco et al., 2002;Howe & Brown, 2004;Miricescu et al., 2020;Yager & Davidson, 2006).We ran MCMC for 5,000 iterations and discarded the first half as burn-in.The Markov chain exhibited good mixing and showed no lack of convergence by MCMC diagnostics; details are provided in Section D of the Supporting information.The diagnostic procedure for the identifiability condition (4) also did not show any sign of violation.

Disease development trajectory
For visualization, we projected the expression data onto a two-dimensional space using principal component analysis, which is shown in Figure 4 (note that it appears in color in the electronic version of this paper).The tissue samples are color-coded by the inferred pseudotime.From the plot, the points that are close to each other tend to have similar colors, suggesting that the inferred pseudotime was capable of capturing the expression heterogeneity among the tissue samples.Moreover, most of the normal tissue This seemed consistent with the simulation results, where BN-LTE and TSCAN had similar performance in recovering the pseudotime, whereas SCORPIUS had less favorable performance.

Gene regulatory networks
We estimated the gene regulatory networks with posterior expected false discovery rate controlled at 0.05 (Müller et al., 2006).In Figure 5, we report the patient-specific gene regulatory networks along the disease development trajectory at 6 evenly spaced points as well as the union network across all observations.We caution readers that the union graph can only be interpreted marginally.That is, a path  →  →  in the union graph indicates  →  is present in some tissue samples and likewise for  → , but  →  →  may not be present in any of the estimated causal graphs.We omitted nodes/genes that are disconnected from the rest of genes in the estimated patient-specific graphs.Some of the identified regulatory relationships are supported by the biomedical literature.For example, DVL1 is known as an upstream regulator of AXIN1 in the WNT pathway, and the binding of DVL1 to AXIN1 has been shown to inhibit AXIN1-mediated downregulation of CTNNB1 (Salahshor & Woodgett, 2005).MAPK1 is a well-known downstream target of HRAS in the PI3K-AKT pathway.Researchers have also shown that FGF1 signals through the activation of transmembrane FGFR1 (Ornitz & Itoh, 2015).We depict the causal effect functions for these pairs of genes and their 95% credible bands in Section D.1 of the Supporting information.Interestingly, our study generated the hypothesis that DVL1 regulated AXIN1 at the initial stage, but the regulation was reversed as the disease develops.The validity and mechanism of such a switch are unknown and need to be further investigated experimentally.In addition, the RAS family has been shown to play a crucial role in regulating the metastasis of basal-type breast cancer cells.Basal-like breast cancer is a particularly aggressive molecular subtype and a major clinical challenge (Toft & Cryns, 2011).Our analysis showed the regulatory relationship between HRAS and MAPK1 appeared at a certain tumor state, which can be further explored as a potential target for precision medicine.In Section D.1 of the Supporting information, we show the estimated gene expression baseline dynamics.For comparison, we applied GDS, CSN, and LOGGLE, where the pseudotime used in LOGGLE was estimated from TSCAN.
The results are shown in Figure S.7 in Section D.3 of the Supporting information.The graphs obtained from GDS and LOGGLE shared some similarity with ours, but LOG-GLE is undirected without causal information and GDS only infers the population-level causal graph.The graph obtained from CSN is quite dense since it infers marginal associations (rather than conditional dependencies), and is therefore susceptible to spurious/indirect correlations.

DISCUSSION
We have introduced BN-LTE for generating individualized causal hypotheses for heterogeneous multivariate data.BN-LTE explicitly accounts for heterogeneity through latent trajectory embedding.We have established the causal identifiability theory of BN-LTE for observational, cross-sectional data.Through experiments, we have, in our opinion, demonstrated the superior performance of BN-LTE compared to state-of-the-art alternative methods.
There are a few directions that can be taken.First, BN-LTE can be modified by incorporating branching processes such as mixture of Gaussian processes (Lönnberg et al., 2017) or branching Gaussian processes (Boukouvalas et al., 2018) for inference of latent trajectories that include branches.Second, BN-LTE allows for convenient factorization (1) but cycles are not allowed.This restriction can be potentially relaxed with directed cyclic graphical models for modeling gene feedback loops.One challenge, however, is to investigate its identifiability; the proof of Theorem 1 is not directly applicable as it relies on the specific characterization of MEC for BNs.The characterization of MEC for directed cyclic graphical models is far more complex.Third, it should also be possible to relax the linearity and Gaussianity assumptions in Equation (3) as well as the no unmeasured confounders and faithfulness assumptions in Theorem 1.

F
I G U R E 2 An illustrative example of trivariate BN-LTE.Top: piecewise sparse direct causal effect functions (black curves) are approximated by weighted sum of B-spline basis functions (grey curves), where the magnitude and sign are indicated by curve thickness and direction (zero weights are represented by dashed curves).Bottom: the corresponding DAG function evaluated at three distinct points.BN, Bayesian networks; DAG, directed acyclic graph.

F I G R E 4
Pseudotime inference.Gene expression data are projected to a two-dimensional space.Normal tissues are marked as asterisks and pseudotimes are color-coded.This figure appears in color in the electronic version of this article, and any mention of color refers to that version.samples (marked as asterisks) are almost exclusively red colored; note that the information of normal versus tumor tissue samples was not used in the model fitting stage.The result indicates a good recovery of the underlying disease developmental status.For comparison, we applied SCOR-PIUS and TSCAN.The results are shown in Figure S.6 in Section D.2 of the Supporting information.From this plot, we notice that the result from TSCAN resembled ours in the sense that most of the normal tissue samples (marked as asterisks) are red colored.By contrast, SCORPIUS failed to put the normal tissues at the either extreme of the trajectory (blue points in the left panel of Figure S.6) and hence the estimated pseudotime cannot be interpreted as the disease developmental status.
Gene regulatory networks.Solid arrows are positive regulations, dashed arrows are negative regulations, and node size is proportional to its degree.(a) Union network across all observations.(b) Gene annotations.(c)-(h) Patient-specific networks at 6 selected points.
) +   , where   ≡ 0 if  ∉   () and   's are the noise terms.When   (  ) =     is linear,   is called the direct causal effect of   on   .Distribution equivalent BNs are Markov equivalent but in general not vice versa