Functional Bayesian networks for discovering causality from multivariate functional data

Multivariate functional data arise in a wide range of applications. One fundamental task is to understand the causal relationships among these functional objects of interest. In this paper, we develop a novel Bayesian network (BN) model for multivariate functional data where conditional independencies and causal structure are encoded by a directed acyclic graph. Specifically, we allow the functional objects to deviate from Gaussian processes, which is the key to unique causal structure identification even when the functions are measured with noises. A fully Bayesian framework is designed to infer the functional BN model with natural uncertainty quantification through posterior summaries. Simulation studies and real data examples demonstrate the practical utility of the proposed model.


Introduction
This article develops a novel functional Bayesian network for modeling directed conditional independence and causal relationships of multivariate functional data, which arise in a wide range of applications.For example, learning brain effective connectivity networks from electroencephalogram (EEG) records is crucial for understanding brain activities and neuron responses.Another example is longitudinal medical studies where multiple clinical variables are recorded at possibly distinct time points across variables and/or patients.Knowing causal dependence of these clinical variables may help physicians decide the right interventions.Functional data can also go beyond those defined on time domain e.g., spatial domain (environmental data, spatially-resolved genomics, etc).
Joint analysis of multiple functional objects has attracted great attention in recent years with focuses mainly on reducing dimensionality and capturing functional dependence.For instance, Kowal et al. (2017) and Kowal (2019) proposed to model time-ordered functional data through a time-varying parameterization for functional time series.Using basis transformation strategies, Zhang et al. (2016) built an autoregressive model for spatially correlated functional data, while Lee et al. (2018) modeled functional data in serial correlation semiparametrically.Chiou and Müller (2014) developed a linear manifold model characterizing the functional dependence between multiple random processes.
Functional Graphical Models In a similar but conceptually different manner, functional graphical models have been recently proposed to model conditional independence of multivariate functional data.Graphical models gives rise to compact probabilistic representation of high-dimensional data through the graph-encoded conditional independence constraints.One key challenge is that the graph is typically unknown and must be inferred from data.While graphical models have been extensively studied for vector-and matrixvariate data (Yuan and Lin, 2007;Wang and West, 2009;Leng and Tang, 2012;Ni et al., 2017), only recently have there been several developments for the functional data.Zhu et al. (2016) extended Markov and hyper Markov laws of decomposable undirected graphs for random vectors to those for random functions.Qiao et al. (2019) adopted the group lasso penalty on the precision matrix of coefficients extracted from the basis expansion of functions.Zapata et al. (2022) introduced the idea of partial separability to reduce the computational cost of Qiao et al. (2019).Qiao et al. (2020) further extended Qiao et al. (2019) and proposed to characterize the time-varying conditional independence of random functions through smoothing techniques.To relax the Gaussian process assumption of the aforementioned methods, Li and Solea (2018), Solea and Li (2022), and Lee et al. (2022) proposed models based on additive conditional independence and copula Gaussian models.
Despite these exciting developments of functional undirected graphical models, the work on functional directed graphical models is sparse.Generally, undirected graphs admit a different set of conditional independence constraints from directed graphs.For example, the directed graph in Figure 1a implies X 2 ⊥ X 3 but X 2 ⊥ X 3 |X 1 , yet there exists no undirected counterpart that admits the same set of conditional (in)dependence assertions.More importantly, causal discovery (i.e., generation of plausible causal hypotheses) is only possible with directed graphs given additional causal assumptions (Pearl, 2000).To the best of our knowledge, the functional structural equation model recently proposed by Lee and Li (2022) is the only work that infers directional relationships from multivariate functional data.However, as will become evident in Section 3 and 4, our model differs from theirs in several significant aspects.
Causal Discovery As hinted earlier, one of the two important problems we intend to address in this work is discovering causality from functional observations.Causal discovery is one of the first steps to investigate the physical mechanism that governs the operation and dynamics of an unknown system.Given the learned causal knowledge, subsequent causal inference (e.g., deriving the interventional and counterfactual distributions) can be conducted under the celebrated do-calculus framework (Pearl, 2000).Therefore, inferring causal relationships potentially has more significant scientific impacts than learning associations since it may help answer fundamental questions about the nature.Bayesian networks paired with causal assumptions are among the most popular approaches in identifying unknown causal structure represented by a directed acyclic graph (DAG).One pressing obstacle of using Bayesian networks to discover causality from purely observational data is that in general, only Markov equivalence classes (MEC) can be learned based on conditional independence constraints alone.Causal interpretations of members in the same MEC can be drastically different, and, generally, only bounds on causal effects can be calculated (Maathuis et al., 2009).For example, the three DAGs in Figure 1b constitute an MEC with the only conditional independence X 2 ⊥ X 3 |X 1 , but the causal directions are completely reversed in the last graph compared to the first one.
Since 2006, numerous researchers, however, have found that causal discovery (unique causal structure identification) is indeed possible with additional distributional assumptions on the data generating process, at least for finite-dimensional data.Examples include but are not limited to linear non-Gaussian models (LiNGAM, Shimizu et al. 2006), non-linear additive noise models (Hoyer et al., 2008), and linear Gaussian models with equal error variances (Peters and Bühlmann, 2014).See more related methods in a recent book of Peters et al. (2017).Although remarkable progresses have been made in the causal discovery area for traditional finite-dimensional data, what remains lacking is method capable of discovering causality from general, purely observational, multivariate functional data.We remark that given a known causal graph, there are existing approaches that can be used to infer causal effects.For example, Lindquist (2012) developed a causal mediation analysis framework where the treatment and outcome are scalars and the mediator is a univariate random function.Our scope is substantially different from this line of works in that we do not assume the causal graph to be known; in fact, learning the causal graph structure is precisely the focus of this paper.

Proposed Functional Bayesian Networks
We propose a novel functional Bayesian network model for multivariate functional data for which the conditional independence and causal relationships are represented by a DAG.As one would expect, the proposed functional Bayesian network factorizes over the DAG and respects all directed Markov properties (i.e., conditional independence constraints) encoded in the DAG via the notion of d-separation.Then for ease of exposition, we reformulate the proposed Bayesian network constructed in the functional space to an equivalent Bayesian network defined on the space of basis coefficients via basis expansion.Because in practice, functional data are almost always observed with noises, two essential ingredients are built in the proposed Bayesian networks to capture the functional dependence and to learn the causal structure.First, we capture the within-function dependence through a set of orthonormal basis functions chosen in a data-driven way.The resulting basis functions are interpretable and computationally efficient.Second, we encode the unknown causal structure by a structural equation model on the basis coefficients.Due to the equivalence of probability measures on the functional space and the space of basis coefficients, the conditional independence and causal relationships naturally transform back to the original random functions.To allow for unique DAG identification, we move away from the Gaussian process assumption often adopted by the existing functional graphical models and instead assume our random functions are generated from a discrete scale mixture of Gaussian distributions.We theoretically prove and empirically verify that the unique DAG identification is indeed possible even when the functions are observed with noises.
To conduct inference and uncertainty quantification from a finite amount of data, the proposed model is based on a Bayesian hierarchical formulation with carefully chosen prior distributions.Posterior inference is carried out through Markov chain Monte Carlo (MCMC).We perform simulation studies to demonstrate the capability of the proposed model in recovering causal structure and key parameters of interest.A real data analysis with brain EEG records illustrates the applicability of the proposed framework in real world.We also apply the proposed model to a COVID-19 multivariate longitudinal dataset (shown in Section D of the Supplementary Material).
The rest of the paper is structured as follows.We provide an overview of Bayesian networks in Section 2. The proposed functional Bayesian network is introduced in Section 3, which includes elaborations of the functional linear non-Gaussian model (Section 3.2) and the causal identifiability theory (Section 3.3).Section 4 is devoted to Bayesian inference of the proposed model.We provide simulation studies and applications in Sections 5 and 6, respectively.The main contributions of this paper are summarized in Section 7 with some concluding remarks.

Overview of Bayesian Networks
Throughout the paper, vectors and matrices are boldfaced whereas scalars and sets are not.

DAGs and Bayesian Networks
and a set of directed edges represented by a binary adjacency matrix E = (E j ) where E j = 1 if and only if → j for = j ∈ V .DAGs do not allow directed cycles j 0 → j 1 → • • • → j k = j 0 .Each node j ∈ V represents a random variable X j ∈ X j ; we may use j and X j interchangeably when no ambiguity arises.Each directed edge → j and the lack thereof represent conditional dependence and independence of X and X j , respectively.Note that although X j is often a scalar but it does not need to be.In fact, X j is a random function or an infinite dimensional random vector in this article.Denote pa G (j) = { ∈ V : → j} the set of parents of j in graph G.A Bayesian network (BN) B = (G, P ) on X is a probability model where the joint probability distribution P of X factorizes with respect to G in the following manner, where P j is the conditional distribution of X j given X pa G (j) under P .Let de G (j) = { ∈ V : j → • • • → } denote the descendants of j in G and let nd G (j) = V \de G (j)\{j} denote the non-descendants of j.The BN factorization (1) directly implies the local directed Markov property -any variable is conditionally independent of its non-descendants given its parents, In fact, the reverse is also true: if a distribution P respects the local Markov property according to a DAG G, then P must factorize over G as in (1).In summary, BN factorization and local Markov property are equivalent.We may omit the subscript G of pa G (j) and nd G (j) and simply write pa(j) and nd(j) instead when G is clear from the context.

Causal DAGs and Causal Bayesian Networks
A causal DAG G is a DAG except that the directed edges are now interpreted causally, i.e., we say X is a direct cause (with respect to V ) of X j and X j is a direct effect of X if → j.For simplicity, we will overload nd(j) and pa(j) to denote the noneffects and directed causes of j in a causal DAG.To define a causal BN, we begin by asserting the local causal Markov assumption (Spirtes et al., 2000;Pearl, 2000) -given a causal DAG G, a variable is conditionally independent of its noneffects given its direct causes.By noting the correspondence between noneffects and non-descendants, and between direct causes and parents in DAGs and causal DAGs, the local causal Markov assumption simply states that the distribution P of X respects the local Markov property of the causal DAG G, which in turn implies that P must also factorize over G (recall the equivalence between BN factorization and local Markov property).Therefore, a causal BN B = (G, P ) is a probability model where P factorizes with respect to a causal DAG G in the same way as in (1).

Structural Equation Representation of Bayesian Networks
A BN is often represented by a structural equation model (SEM), where the transformation f j depends on X only through its parents/direct causes X pa(j) , and the exogenous variables = ( 1 , . . ., p ) T ∼ P are assumed to be mutually independent.Denote the set of transformation functions as F = {f 1 , . . ., f p }. Since F and P induce the joint distribution P of X and it is not difficult to show that the induced distribution P factorizes over G, with a slight abuse of notation, we can rewrite the BN as B = (G, F, P ).
3 Functional Bayesian Networks

General Framework
Now we introduce the construction of BNs for multivariate functional data.Denote the space of square integrable functions on domain D with respect to measure µ as L 2 (D) = {h : We focus on a compact D ⊂ R (in fact, without loss of generality, D = [0, 1]) and the Lebesgue measure µ for simplicity.
) be a collection of p random functions.Denote H = p j=1 {(ω, j) : ω ∈ D j } the joint domain of Y and (L 2 (H), B(L 2 (H)), P ) its probability space.Similarly, for any subset A ⊂ [p], denote the joint domain where each node j ∈ V represents a random function Y j .To begin with, we give the formal definition of a functional Bayesian network.
Definition 1 (Functional Bayesian Networks).We say B = (G, P ) is a functional Bayesian network for a set of random functions Y if P factorizes with respect to DAG G, for any measurable sets D j ⊂ L 2 (D j ), ∀j ∈ [p], where P j is the conditional probability measure of Y j given Y pa(j) under P .
Just like the ordinary finite-dimensional BN, the functional BN factorization implies the local Markov property and vice versa.
Proposition 1. Functional Bayesian network factorization is equivalent to functional local directed Markov property.
Proof is trivial.For modeling convenience, we use orthonormal basis expansion of random functions to (equivalently) redefine the functional BN in the space of basis coefficients.Let {φ jk } ∞ k=1 be a sequence of orthonormal basis functions of L 2 (D j ) and expand Y j = ∞ k=1 Z jk φ jk , where Z jk = D j Y j (ω)φ jk (ω)dω.The resulting coefficient sequence Z j = (Z jk ) k=1,...,∞ lies in the space of square summable sequences 2 j = {h j : ∞ k=1 h 2 jk < ∞}.The within-function and the between-function covariance can then be expressed in terms of the covariance of the coefficient sequences, Because L 2 (D j ) and 2 j are isometrically isomorphic for each j, for any disjoint subsets Hence, if Y follows the proposed BN model B = (G, P ), then the coefficient sequences Z follows B Z = (G, P Z ) for some probability measure P Z of Z, and vice versa.Each node of the DAG G either represents a random function Y j or, equivalently, its corresponding coefficient sequence Z j .Moreover, the joint probability measure P of Y factorizes with respect to G if and only if the joint probability measure P Z of Z factorizes with respect to G.
Proposition 2. Suppose Y ∼ P and let Z be the corresponding coefficient sequences from orthonormal basis expansion.Then for any measurable sets D j ⊂ L 2 (D j ), ∀j ∈ [p] if and only if The proof directly follows the preceding paragraph.Again, just like the ordinary finitedimensional BN, if one makes the causal Markov assumption, the DAG G in the proposed functional BN can be interpreted causally.Hereafter, by default, we always make the causal Markov assumption (hence G is a causal DAG, the edge strength is interpreted as direct causal effect, etc) but all the results are simply reduced to those of a directed conditional independence model when the causal Markov assumption is dropped.

Functional Linear Non-Gaussian Bayesian Networks
Section 3.1 introduces a general framework for modeling directed conditional independence and causal relationships for multivariate functional data.In this subsection, we discuss in detail one specific case of the proposed general framework, namely the Functional Linear Non-Gaussian (FLiNG) BNs.Specifically, the FLiNG-BN assumes Z follows a linear SEM, where j is an infinite-dimensional exogenous vector, is an infinite-dimensional direct causal effect matrix from Z to Z j , and → j is present in G (i.e., Z is a direct cause of Z j ) if there exist k and k j such that B j (k j , k ) = 0. Neither causal effects nor the causal graph is assumed to be known; therefore the main goal of this article is precisely to infer them from observational data.Because L 2 (D j ) and 2 j are isometrically isomorphic for all j ∈ [p], the casual relationships of Z encoded in DAG G directly transfer to the the casual relationships of Y , i.e., Z is a direct cause of In practice, the random functions Y can only be measured on finite grids with random noises.In other words, we do not observe realizations of Y but instead we observe realizations of W = (W 1 , . . ., W p ) T where W j = (W j (1), . . ., W j (m j )), which is the set of measurements of Y j on a finite grid D j = {ω j (1), . . ., ω j (m j )} ⊂ D j with independent white noises e j (m) ∼ N (0, σ j ), ∀m ∈ [m j ], W j (m) = Y j (ω j (m)) + e j (m). (3) Note that D j can be different across j (also across realizations).
One seemingly inconsequential element of the FLiNG-BN but turning out to be crucial for discovering causality is the specification of the probability distribution of the exogenous variables j = ( jk ) ∞ k=1 in (2).A tempting choice may be Gaussian but it is the non-Gaussianity of j 's that allows causal identification as we will show in Section 3.3.Specifically, we assume jk to follow a finite scale mixture of Gaussian distributions, jk ∼ M jk m=1 π jkm N (0, τ jkm ), where M jk is the number of mixture components.The non-Gaussian exogenous variables lead to non-Gaussian coefficient sequences Z, which in turn lead to non-Gaussian-process distributed random functions Y .In addition to enabling causal identification, non-Gaussian-processes are robust against outlying curves (Zhu et al., 2011).For finite sample inference, we truncate the orthonormal basis at level K such that φ j = (φ j1 , . . ., φ jK ) T , as commonly done in existing functional data analysis literature.Consequently, (3) is turned into (4)

Causal Identifiability
The proposed functional BNs are useful representations of directed conditional independence and causal relationships for multivariate functional data.The big remaining question is the learning of the underlying (causal) DAGs from observational data.Constraint-based methods, which are often model-free, have been popular for DAG learning.For the proposed functional BNs, we, in principle, can also use constraint-based methods, which test for conditional independence of pairs of functions.However, conditional independence tests are notoriously difficult and inefficient even for scalar random variables.Furthermore, even if we have access to oracle conditional independence tests for random functions, we can only hope for identifying the best MEC by definition (recall that an MEC contains DAGs with exactly the same set of conditional independence relationships).This may be acceptable if one is only interested in learning conditional independence relationships.But as mentioned in Section 1, for causal discovery, this is clearly unsatisfactory because the directionality of a potentially large number of edges of Markov equivalent DAGs may be left undetermined and hence the causal interpretations of these edges are unclear.Because the proposed FLiNG-BN is a proper probability model, we can exploit certain feature of the model, namely the non-Gaussianity, to uniquely identify the underlying causal DAG.
Let P W denote the distribution of W induced from FLiNG-BN and the noises.We say that the causal DAG of FLiNG-BN is identifiable from W if there does not exist another BN B = (G , P ) where G = G and noise variances σ = (σ 1 , . . ., σ p ) such that the induced distributions on W , P W , is equivalent to P W , i.e., P W (W ) ≡ P W (W ).
Theorem 1 (Causal Identifiability).The causal DAG of FLiNG-BN is identifiable if the number of Gaussian mixture components M jk > 1, ∀j, k.
Theorem 1 signifies that by examining the probability distribution P W , to which we have access through the observational data alone, one can gauge the likelihood that a given causal DAG is the data generating DAG.With a finite dataset, we shall focus on weighing different candidate causal DAGs by their posterior probabilities.Here, we provide the outline of the proof; the complete proof is given in Section A of the Supplementary Material.Given a chosen set of basis functions, we show the result in the space of basis coefficients.The problem then transforms to prove that, given Z = BZ + and observe W = Z + e, there does not exist another equivalent parameterization Z = B Z + and W = Z + e .Since we assume each component of follows a Gaussian scale mixture, the induced distribution on W is a multivariate Gaussian mixture (with different precision matrices).We then prove the causal effect matrix B is uniquely identifiable from such mixture model by combining the identification of Gaussian mixture components, uniqueness of LDL decomposition, and proof of causal ordering identification.We demonstrate the identifiability result with a toy example.
We fit linear regression separately to all observations and to observations in each of the four groups with the true causal direction 1 → 2 (regressing W 2 on W 1 ) and the anti-causal direction 2 → 1 (regressing W 1 on W 2 ), which are shown in Figure 2. We observe that the fitted lines are almost identical in the causal direction for all groups whereas they can be quite different across groups in the anti-causal direction.Therefore, only the true causal graph gives a unique regression coefficient among all groups.Notice that if there is only one mixture component (i.e., degeneration to the Gaussian case), no comparison can be made between the causal and anti-causal directions since there will be only one regression line.
The next counter example illustrates the necessity of the non-Gaussian assumption for causal identification.

Bayesian Inference
The inference of the proposed FLiNG-BN framework can be carried out in either a frequentist (e.g., maximizing penalized likelihood) or a Bayesian (e.g., sampling from posterior distribution) fashion.Existing frequentist functional graphical models (Qiao et al., 2019(Qiao et al., , 2020;;Zapata et al., 2022;Solea and Li, 2022;Lee et al., 2022;Lee and Li, 2022) often estimate graphs in two separate steps -(i) estimate the basis coefficient sequence of each function marginally via functional principle component analysis, and (ii) learn an directed/undirected graph based on the estimated coefficient sequences.However, the eigenfunctions that marginally explain the most variation of each individual function do not necessarily explain well the conditional/causal relationships among a set of functions.Moreover, the estimation uncertainty is not propagated from the first step to the second, which may result in overly confident inference.To mitigate these potential drawbacks of the two-step approaches, we propose a fully Bayesian inference procedure that jointly infers basis coefficient sequences and the DAG structure.This joint inference approach constructs orthonormal basis functions adaptive to their conditional/causal relationships and allows for finite-sample inference and uncertainty quantification.

Adaptive Orthonormal Basis Functions
We assume the basis functions to be shared across all random functions (Kowal et al., 2017;Zapata et al., 2022), φ jk (ω) := φ k (ω), ∀j ∈ [p], which are more parsimonious than models based on function-specific basis functions.Moreover, the common basis functions put the basis coefficient sequences Z j , ∀j ∈ [p] on an equal footing (e.g., the magnitudes of basis coefficients are directly comparable) so that they are directly comparable and the BN on Z has a more coherent interpretation.In this case, the non-zero matrix block B j corresponds to a causal connection from Y to Y j .Loosely speaking, if we regard the basis functions as signal channels, then a significant non-zero B j (k j , k ) indicates that Y directly affects Y j through its signal transmission from the k -th channel to the k j -th channel.
As mentioned above, we do not pre-specify a fixed set of orthonormal basis functions but instead they are learned adaptively from data by further expanding them with spline basis functions (Kowal et al., 2017) T is a set of cubic B-spline basis functions with equally spaced knots and A k = (A k1 , . . ., A kL ) T , ∀k ∈ [K] are spline coefficients.Because A k 's are not fixed a priori, so are φ k 's.

Prior Model
We summarize our model and its entailed parameters using a DAG shown in Figure 3.The prior distributions of the model parameters are introduced in this section.We simulate posterior samples through Markov chain Monte Carlo (MCMC).Details are given in Section B of the Supplementary Material.
Prior on B-spline Coefficients A k The prior on A k serves three purposes.First, it forces φ k 's to be orthonormal, i.e., φ k (ω)φ h (ω)dω = I(k = h), ∀k, h ∈ [K].Second, it regularizes the roughness of φ k 's to prevent overfitting and sorts the orthonormal basis functions by increasing roughness.Third, it enables posterior inference on the orthonormal basis functions simultaneously with the graph estimation without having to fix them a priori.
We summarize the main steps of prior specification and refer the details to Kowal et al. (2017).First, to regularize the roughness of φ k in a frequentist framework, one would consider a penalized likelihood with the roughness penalty, where λ k > 0 is the regularization parameter and As a Bayesian counterpart, the regularization term is equivalent to a prior on the B-spline coefficients where Ω − is a pseudo-inverse (since Ω is rank-deficient by 2).Let Ω = U DU T be the singular value decomposition of Ω.To facilitate efficient computation, we follow Wand and Ormerod (2008) and reparameterize ) T where D P is the (L − 2) × (L − 2) submatrix of D corresponding to non-zero singular values and U P is the corresponding L × (L − 2) submatrix of U .The reparameterization induces a prior on Ãk = ( Ãk1 , . . ., ÃkL ) T ∼ N (0, S k ) where S k = diag(∞, ∞, λ −1 k , . . ., λ −1 k ) with the first two dimensions corresponding to the unpenalized constant and linear terms.In practice, one can replace ∞ by a large number, say 10 8 .
Second, we constrain the regularization parameters λ 1 > • • • > λ K > 0 to identify the ordering of basis functions, which sorts the basis functions by decreasing smoothness.Unlike the functional principal component analysis (PCA) where the principal components are ordered by the proportion of variance explained, the adopted Bayesian approach is less prone to rough functions.Given the ordering constraint, a uniform prior is imposed such that . ., K, and L K = 10 −8 .Finally, consider the orthonormal constraint with J = b(ω) bT (ω) dω.This constraint can be easily enforced by projection and normalization during the course of MCMC; see Section B of the Supplementary Material for details.
Priors on the DAG Adjacency Matrix E and Direct Causal Effects B The key problem we aim to address in this article is causal structure learning, i.e., inferring the adjacency matrix, E = (E j ) (recall E j = 1 if and only if → j).We propose to use a beta-Bernoulli-like prior E j ∼ Bernoulli(r) with r ∼ Beta(a r , b r ), subject to the acyclicity constraint, We set a r = b r = 1.Scott and Berger (2010) showed that the beta-Bernoulli prior allows automatic multiplicity adjustment in sparse regression problem.In our context, the marginal distribution of E with r integrated out equals The marginal distribution strongly prevents false discoveries by increasing the penalty against additional edges as the dimension p grows.For example, the marginal (6) favors an empty graph over a graph with one edge by a factor of p 2 − p, which increases with p.
Conditional on E, we assume independent matrix-variate spike-and-slab priors on the direct causal effects, where δ O (•) is a point mass at a K × K zero matrix O and N (•|O, γI, I) is a centered matrix-variate normal distribution with row and column covariance matrices γI and I where I is a K × K identity matrix.The hyperparameter γ indicates the overall causal effect size and is assumed to follow a conjugate inverse-gamma prior, γ ∼ IG(a γ , b γ ) with a γ = b γ = 1.
Prior on the Gaussian Scale Mixture We choose conjugate priors, which allows for straightforward Gibbs sampling.As default, we set α = 1 and a τ = b τ = 1.
Prior on Observation Noises We complete the prior specification with a conjugate inverse-gamma prior on the variance of observation noises, σ j ∼ IG(a σ , b σ ), ∀j ∈ [p] with a σ = b σ = 0.01.
Finally, we summarize the differences between the proposed FLiNG-BN and the work from Lee and Li (2022).First, Lee and Li (2022) assume their functions to be noiseless whereas we consider the scenario where functions are observed with noises.The causal identifiability theory is significantly more complicated when functions are noisy.Second, they assume their functions to be Gaussian whereas our functions are non-Gaussian; this difference leads to different learning algorithms and identifiability theory.Third, their inference is a two-step procedure based on causal ordering identification and sparse functionon-function regression, while the proposed Bayesian hierarchical model admits one-step inference procedure, which learns the graph structure by directly searching in the graph space without having to learn the causal ordering first.

Simulation Studies
We conducted simulation studies to evaluate the proposed FLiNG-BN model.We considered two scenarios.In the first scenario, the functions were observed on an evenly spaced grid; this is the scenario commonly studied in the existing functional undirected graphical models (Qiao et al., 2019) and is also similar to our later EEG application.In the second scenario, the functions were observed on an unevenly spaced grid, similar to the COVID-19 longitudinal application (the details are shown in Section C of the Supplementary Material).We compared the proposed FLiNG-BN with a functional undirected graphical model (FGLASSO; Qiao et al. 2019).We did not make comparison with Lee and Li (2022) due to lack of publicly available code at the time of submission.In addition, we compared FLiNG-BN with approaches based on two-step estimation procedures.In the first step, we extracted basis coefficients obtained from functional PCA using the package fdapace (Carroll et al., 2021).In the second step, given the estimated basis coefficients, we constructed causal graphs using either the LiNGAM (Shimizu et al., 2006) algorithm (termed FPCA-LiNGAM) or the PC (Spirtes and Glymour, 1991) algorithm (termed FPCA-PC).LiNGAM estimates a causal DAG based on the linear non-Gaussian assumption whereas PC generally returns only an equivalence class of DAGs based on conditional independence tests.Their implementations are available from R package pcalg (Kalisch et al., 2020).
To mimic the EEG data application, we simulated data from FLiNG-BN with all the combinations of sample size n ∈ {50, 100, 200}, number of functions p ∈ {30, 60, 90}, and grid size d ∈ {125, 250}.The grid spanned the unit interval [0, 1].We set the true number of basis functions to be K = 5.To generate basis functions, we first simulated the non-orthnormal functions φ U k , ∀k ∈ [K] from a set of L = 6 cubic B-spline basis functions with evenly spaced knots, φ U k = L =1 A k b , where A k 's were generated from a standard normal distribution.We then empirically orthonormalized (φ U 1 , . . ., φ U K ) to get the orthonormal basis functions (φ 1 , . . ., φ K ).The simulation true causal graph G was generated from the Erdős-Rényi model with connection probability 2/p, subject to the acyclicity constraint.Given the true graph G, each block of non-zero direct causal effects Table 1: Functions observed on evenly spaced grid.Average operating characteristics based on 50 repetitions are reported; standard deviations are given within the parentheses.Since LiNGAM is not applicable to cases where q > n with q = pK being the total number of extracted basis coefficients across all functions, the results from those cases are not available and indicated by -.

Applications
We applied the proposed FLiNG-BN to the brain EEG dataset downloaded from https: //archive.ics.uci.edu/ml/datasets/eeg+database(Zhang et al., 1995).The dataset consists of 122 subjects with 77 in the alcoholic group and 45 in the control group, and was previously used to demonstrate functional undirected graphical models by Zhu et al. (2016) and Qiao et al. (2019).The 64 electrodes placed on subjects' scalps (standard positions) measuring voltage values were sampled at 256 Hz for one second.Each subject completed 120 trials under one stimulus or two stimuli.See Zhang et al. (1995) for details of the data collection procedure.We averaged all trials for each subject under the one stimulus condition.We separately analyzed these two groups to find their commonalities and differences of brain activity.Hence, we had n = 77 or n = 45 subjects and p = 64 functions representing the brain EEG signals at different scalp positions recorded at d = 256 time points.We focused on EEG signals filtered at α frequency bands between 8 and 12.5 Hz using the eegfilt function in the EEGLAB toolbox from MATLAB (Delorme and Makeig, 2004).
To check the Gaussianity of the observed functions, we performed Shapiro-Wilk normality test (Shapiro and Wilk, 1965) to each of p = 64 scalp positions at each of d = 256 time points.The null hypothesis (i.e., the observations are marginally Gaussian) was rejected for many combinations of scalp position and time point and therefore, the non-Gaussianity of the proposed model is deemed appropriate.
Five orthonormal basis functions were selected for both the alcoholic and control group according to the procedure described in Section B of the Supplementary Material.We ran MCMC for 10,000 iterations, discarded the first half as burn-in, and retained every 10th iteration after burn-in.The estimated basis functions are shown in Figure 4.As evident from the plot, they are very similar across the two groups.The causal networks estimated by thresholding the posterior probability of inclusion at 0.9 are shown in Figure 5.The sparsity level is approximately 3.0% for the alcoholic group and 2.5% for the control group.Our results reveal several interesting patterns.First, the connection is relatively dense in the frontal region for both groups.Second, the alcoholic group has more directed connections detected in the left temporal and occipital regions.Third, most brain locations tend to connect to adjacent positions, while distant locations are much less connected.Figure 6 shows the common and differential networks for the two groups, where a substantial connectivity difference is observed between the two groups.
In addition, we demonstrated the proposed FLiNG-BN model with an additional application to COVID-19 multivariate longitudinal data, which have unevenly spaced measurements, in Section D of the Supplementary Material.

Discussion
In this paper, we have proposed a functional Bayesian network model for causal discovery from multivariate functional data.We have discussed in detail a specific case of functional Bayesian network, namely the functional linear non-Gaussian model, and proved the underlying causal structure is identifiable even if the functions are purely observational and observed with noises.A fully Bayesian inference procedure has been proposed to implement our framework.Through simulation studies and real data applications, we have demonstrated the ability of our model in causal discovery.We briefly discuss several possible directions to extend our current work.First, we may replace the underlying DAG with cyclic graphs, chain graphs, or ancestral graphs for more general causal and conditional independence structures.We have chosen a linear non-Gaussian SEM on the basis coefficients but this model can be replaced with a nonlinear SEM.Second, instead of fixing the number of basis functions, one could resort to increasing shrinkage priors (Bhattacharya and Dunson, 2011;Legramanti et al., 2020) to adaptively truncate redundant components.Finally, since we have two groups of observations in the EEG application, it would be interesting to jointly estimate the brain networks or directly estimate the differential network.

Figure 2 :
Figure 2: A toy example for demonstration of causal identification.The left (right) panel shows the linear regression of W 2 (W 1 ) on W 1 (W 2 ).Data are simulated from the causal graph 1 → 2. Colored lines are the fitted linear regressions for all observations and for the observations in groups C 1 -C 4 .

Figure 3 :
Figure 3: A DAG illustrating the model hierarchy.Single-line arrows are stochastic relationships and double-line arrows are deterministic relationships.The observed node W j is shown in rectangle and other nodes are shown in circles.

Figure 4 :
Figure 4: Estimated basis functions from brain EEG records that explained 90% of the variation.

Figure 5 :
Figure 5: Estimated causal brain networks from EEG records by FLiNG-BN with posterior probability of inclusion ≥ 0.9, separately for the alcoholic (left) and control (right) group.

Figure 6 :
Figure 6: Common (left panel) and differential (right panel) connections for the two groups.Black arrows indicate common connections, red arrows indicate connections detected by the alcoholic group only, and green arrows indicate connections detected by the control group only.