Methods and tools for causal discovery and causal inference

Causality is a complex concept, which roots its developments across several fields, such as statistics, economics, epidemiology, computer science, and philosophy. In recent years, the study of causal relationships has become a crucial part of the Artificial Intelligence community, as causality can be a key tool for overcoming some limitations of correlation‐based Machine Learning systems. Causality research can generally be divided into two main branches, that is, causal discovery and causal inference. The former focuses on obtaining causal knowledge directly from observational data. The latter aims to estimate the impact deriving from a change of a certain variable over an outcome of interest. This article aims at covering several methodologies that have been developed for both tasks. This survey does not only focus on theoretical aspects. But also provides a practical toolkit for interested researchers and practitioners, including software, datasets, and running examples.

are pretty self-explanatory and can be found in several books covering causality, such as Pearl (2009), Peters et al. (2017, and Spirtes et al. (2000).
A graph G ¼ V, E ð Þ is defined by a set of nodes (or vertices) V and a set of edges where M is a set of marks (or labels). In particular, an edge can be marked as directed, undirected, or bi-directed, respectively written as U ! V or V ! U, U À V , and U $ V . Nodes U and V are said adjacent. The edge relation E is actually a partial function, that is, no more than one mark is assigned to adjacent nodes. A graph G is directed if all the edges are directed. It is a pattern if each edge is either directed or undirected. A node U V is a parent of another node V V if U ! V E. The node V is said to be a child of node U. We write Pa V ð Þ for the set of parents of V , and Ch U ð Þ for the set of children of U. A (acyclic) path in G is a sequence of distinct vertices V 1 , …, V n such that an edge V j ,V jþ1 È É , M j À Á between two consecutive vertices is in E, for j ¼ 1, …, n À 1. When all the edges are directed as V j ! V jþ1 , the path is called a directed path. In such a case, the node V 1 is called an ancestor of V n , while V n is called a descendant of V 1 . The set of all the ancestors of V is denoted as An V ð Þ while the set of descendants is written as De V ð Þ. Notice that V An V ð Þ and V De V ð Þ. A direct graph is called a directed acyclic graph (DAG) if there is no directed cycle, that is, there exists no pair of vertices V ≠ U with a directed path from V to U and from U to V .
DAGs were used by Pearl (1985) as a graphical representation for a constrained joint probability distribution of a collection of random variables. Let us consider p random variables X ¼ X 1 ,…, X p À Á with joint distribution P X ð Þ. Let P X i jS ð Þbe the marginal distribution of X i conditional to S ⊆ X.
Definition 1. Given a DAG G ¼ X, E ð Þ, the random variables X form a Bayesian network with respect to G if: Bayesian networks are graphical representations of probabilistic relations among variables, where nodes represent the variables themselves, while edges represent the conditional dependencies among the involved variables. It is worth noticing that such a representation is convenient, as it allows to clearly model how the variables are related in the system. For example, let us consider the well-known example about kidney stones, where one is interested in understanding how drug usage influences the recovery of patients. Let T be a binary variable of treatment (treated/not treated), and Y the outcome (recovered/unrecovered), such that T ! Y . Let us also consider some confounding variable U, such that U ! T and U ! Y . For instance, U could be the age of the patient or the size of the kidney stones or any other factor that might affect both the access to the treatment and the ability to recover. This set of information can be represented through the DAG shown in Figure 1. The factorization formula (1) is equivalent for DAGs (Pearl, 1989) to the following Markov condition, stating that a variable is conditionally independent ( ⊥ ⊥ ) of each of its non-descendants, given its parents.
The kidney stones example represented as a DAG: T is the treatment indicator, Y is the outcome, and U is a confounder Definition 2. (Markov Condition) Given a DAG G ¼ X, E ð Þ, the random variables X satisfy the Markov Condition if for every X X, X ⊥ ⊥ Xn De X ð Þ[Pa X ð Þ ð Þ j Pa X ð Þ.
The Markov Condition is not sufficient to read off arbitrary conditional (in)dependencies entailed by a Bayesian network. For this, we need d-separation. Let us first introduce the notion of blocking set.
Definition 3. (Blocking set) A path V 1 , …, V n in a DAG G is blocked by a set of nodes Z (not containing neither V 1 nor V n ) if there exists a node V k in the path such that one of the following conditions hold: (i) V k is a non-collider, that is, V kÀ1 ! V k ! V kþ1 or V kÀ1 V k V kþ1 or V kÀ1 V k ! V kþ1 , and V k Z; (ii) V k is a collider, that is, V kÀ1 ! V k V kþ1 , and De V k ð Þ\Z ¼ ;, that is, neither V k nor any of its descendants is in Z.
Definition 4. (d-separation) In a DAG G, we say that two sets of nodes L and M are d-separated by a third set of nodes Z, where L, M and Z are pairwise disjoint, if Z is blocking all the paths between nodes in L and M. This is denoted as: L ⊥ G M j Z.
For example, the variable Y d-separates X and Z in the DAG of Figure 2. The factorization formula (1) is also equivalent for DAGs to the Global Markov Condition (Pearl, 1989), The Faithfulness assumption reverses the direction of the above implication, so that conditionally independent variables are actually d-separated in the graph.
Definition 6. (Faithfulness) Given a DAG G ¼ X,E ð Þ, the random variables X satisfy the Faithfulness assumption if for every pairwise disjoint L, M, Z ⊆ X, if L ⊥ ⊥ M j Z then L ⊥ G M j Z.
Finally, the assumption of Causal Sufficiency states that all the common causes of a pair of nodes are measured. Although most methods rely on this assumption, it cannot always be satisfied. Due to this, some methods model the existence of latent variables.
Definition 7. (Causal Sufficiency) For a pair of observed variables X and Y , all their common causes must also be observed in the data (and modeled in a graph G).
Edges in the Bayesian network model conditional probabilities (condition on observing), but they do not necessarily represent causal effects (intervention). Intuitively, a variable X has a causal effect on Y if manipulating X changes the distribution of Y . Causal Bayesian network extends the factorization formula (1) to account for do-interventions: Þ 1 W¼w . The do-operator, denoted as do W ¼ w ð Þand introduced by Pearl (1995), represents the symbolic operation of setting the definition W to the constant value w (atomic intervention). Intervention distributions P X j do W ¼ w ð Þ ð are not necessarily equivalent to conditional distributions P XjW ¼ w ð Þ . Contrasted with this stochastic approach to causality (Pearl, 2009, sect. 1.4), there is the Laplacian's approach based on functional equations. We recall next the Structural Causal Models.

X Y Z
F I G U R E 2 Example d-separation: Y d-separates X and Z Definition 8. (Structural Causal Model) Given a DAG G ¼ X, E ð Þ, a Structural Causal Model (SCM) defines the (endogenous) random variables X as functions of their parents: and of (exogenous) independent random variables U 1 , …,U p . An assignment of values to the exogenous variables uniquely determines the values of all endogenous variables. Thus, a probability distribution P 0 over U 1 , …, U p induces a unique probability distribution P over X (entailed distribution).
An SCM is a Causal Bayesian network if interventions are modeled as follows (Peters et al., 2017).
Definition 9. (Intervention distribution) The probability P Xjdo W ¼ w ð Þ ð Þover a SCM is the distribution entailed by the SCM obtained by replacing, for X k ¼ w k in W ¼ w, the definition X k ≔ f k Pa X k ð Þ, U k ð Þ with X k ≔ w k .
Let us reconsider the kidney stones example. Performing a do-operations on the treatment variable consists of removing all the incoming links on the node T. Graphically, this is depicted in Figure 3. The causal effect of drug use can then be measured by comparing the intervention distributions P Y Þ . Another well-known method for estimating causal effects is the potential outcome (PO) model, also known as Rubin causal model (RCM), as defined in Holland (1986). Introduced in Rubin (1974), it is very popular in economics and social sciences, and logically equivalent to SCM (Pearl, 2009). Unless stated otherwise, we restrict to a binary treatment. The PO model assumes random variables X, Y 0 , Y 1 , T, where X are observed covariates about atomic research objects, called units or individuals; Y 1 is the outcome after being treated; Y 0 is the outcome after not being treated; and T is the actual intervention, with T ¼ 1 for treated and T ¼ 0 for not treated. The observed outcome Definition 10. (Potential outcome) The potential outcome for unit i given the treatment t 0, 1 f g, written Y t,i is the value of Y t for the unit i after it is treated as t. The observed outcome is The observed outcome is the one for the actual treatment of the unit. The counterfactual outcome is the one for the opposite treatment. In the kidney stones example, a patient (a unit) can either take the drug or not. Therefore, only one of the two potential outcomes is observed. The main problem in causal inference is to estimate the causal effect of the treatment over the outcome.

| CAUSAL DISCOVERY
In this section, we consider the problem of learning causal relationships among variables from observational data. We survey methods and tools to solve causal discovery and metrics to evaluate methods and public datasets to experiment with.
Example of a do-operation performed on the variable T in the kidney stones example Depending on how a causal algorithm is constructed, it can be classified as either constraint-based or score-based. This type of classification is usually applied to Bayesian-like methods, but it can be extrapolated to other methods, provided they have a similar structure.
Constraint-based algorithms employ independence tests to identify a set of edge constraints for the graph using observational data, for example, using the G 2 test (Spirtes et al., 2000). Further rules determine then the direction of the found relationships. In exceptional cases, the rule phase is skipped to create undirected graphs. These graphs are usually local, meaning they only convey a particular node's (undirected) relationships.
Score-based algorithms assign a relevance score to candidate graphs through some adjustment measures, such as the Bayesian Information Criterion (BIC). However, these algorithms are computationally expensive since they have to enumerate (and score) every possible graph among the given variables. In addition, greedy heuristics are applied to restrict the number of candidates.
The section is structured by considering observational data, namely cross-sectional, time-series, and longitudinal data. The causal discovery methods, in fact, significantly depend on the type of data under analysis.

| Cross-sectional data
The search of causal relationships in cross-sectional data is one of the most investigated causal tasks.
Definition 11. (Cross-sectional data) Observation of subjects at one point or period of time, or for which the analysis has no regard to differences in time among the observations. This type of data is characterized by the fact that they are collected through the observation of several subjects simultaneously, this being not considered a study variable. These types of data are usually analyzed by comparing differences between subjects. Variables can have continuous, discrete, binary, or text data types. An excerpt is shown in Table 2.
The usage of this type of data for causal discovery has a significant downside. Since it represents a single point in time, it is not possible to exploit causal precedence (A causes B if A happens before B). This implies that to find the relationships' direction, it is necessary to apply an extra step. Various methods exist, covering all types of variables (binary, discrete, continuous, and mixed).
Perhaps the most known constraint-based causal discovery algorithm is PC (named after its authors, Peter and Clark; Spirtes et al., 2000). It relies upon the faithfulness assumption to create the models, meaning that all independencies must obey the d-separation criterion (Section 2). Like most constraint-based methods, this methodology consists of two phases: searching for (in)dependencies (also called skeleton 2 phase) and orienting dependencies.
In the first phase, the algorithm starts with a fully connected undirected graph. For each pair of adjacent variables A and B, it tests if the conditional independence A ⊥ ⊥ B j C for a set C of variables is all adjacent to A (or, equivalently, all adjacent to B). Tests start with C ¼ ; (unconditional independence) and iterate over sets of increasing size. If conditional independence holds, the undirected edge between A and B is removed. The orientation phase applies a number of rules to direct edges (Spirtes et al., 2000): 1. Consider variables A,B, C such that A À B À C, namely A and B, B and C are adjacent, but A and C are not adjacent, that is, it holds in the skeleton phase that A ⊥ ⊥ C j D for some D. If B = 2 D, we orient the edges as A ! B C. The triple A, B, C is called a v-structure; 2. If there is a directed edge A ! B, and B and C are adjacent (B À C), but A and C are not adjacent, then B À C is oriented as B ! C; and 3. If there is a direct path between A and B and an undirected edge between A and B, orient A À B as A ! B.
PC-stable (Colombo & Maathuis, 2014) tackles a known problem inherent to PC known as order dependence. PC output depends on the order in which the variables are analyzed in the skeleton phase. This means that, if we have a order 1 V ð Þ¼ A,B, C, D, E f gand order 2 V ð Þ¼ A, D, B,E, C f g , the resulting skeletons will not be the same. PC-stable tackles this by saving discarded nodes in a separate list instead of removing them right away, at each iteration (with a given size of C). The saved nodes are only removed permanently in the next iteration. This way, removing edges is no longer affected by the order of the independence tests at an iteration.
Another variant conservative PC. After creating the skeleton, this algorithm tests every potential v-structure X À Y À Z by checking if X ⊥ ⊥ Z j N where N includes all the neighbors of X and Z. If Y is not in all the separating sets or there are no variables in the set, X À Y À Z is marked as ambiguous, and it is not directed. On the other hand, if Y is not in any separating set, the method continues as PC.
Although PC (and its variants) is a powerful tool to uncover causal relationships, it does not scale to highdimensional data. For example, in the PC-select (sometimes called PC-simple) method (Bühlmann et al., 2010), the second phase is removed, and the conditional independence test is only applied to a target variable. Furthermore, because the method does not include an orientation phase, the output is an undirected graph.
Another strategy to tackle high dimensional data is to search for causal relations only locally to a target variable. The max-min parents and children algorithm (MMPC; Tsamardinos et al., 2006) adopts this approach using a Min-Max heuristic as a conditional independence test.
Although PC is considered as a benchmark algorithm for this type of data, it assumes causal sufficiency (Definition 7), meaning that it does not allow for open systems (systems with latent variables). For cases where the causal assumption cannot be fulfilled, FCI can be used (Spirtes et al., 1995). This method applies the same two phases of PC: the skeleton and orientation phases. In the skeleton phase, FCI applies a conditional independence test to find all the potential causal relationships. It is in the second phase that FCI differs the most from PC: instead of assuming that a relationship must have a direction (Glymour et al., 2019), the method tests possible d-separations X ⊥ ⊥ Y j Z in the skeleton. If there is at least a variable in Z that d-separates the edge, then it is removed. After this, FCI applies several rules to direct the edges (Spirtes, 2001). FCI also differs from PC in the way it represents the relationships. Instead of two types of relationships (! and À), FCI current implementations have four: • X ! Y that represents X causes Y; • X $ Y that represents that there is unmeasured confounders from both variables; • X ! Y that represents either X causes Y or there is unmeasured confounders from both variables; • X À Y can represent: (1) X causes Y, (2) Y causes X, (3) there is unmeasured confounders from both variables, (4) X causes Y and there are unmeasured confounders from both variables or (5) Y causes X and there are unmeasured confounders from both variables.
The Anytime FCI is a slight modification of FCI that restricts the maximum number of variables in the separation set used to perform the conditional independence tests to a user-defined threshold.
The Adaptive Anytime FCI  is similar to Anytime FCI in the way that it restrains the number of variables in the separation set. The critical difference is that, instead of the user defining this maximum, it is calculated by the algorithm, using K ¼ max i adj C 1 , X i ð ÞjÀ1 ð Þ , in where C 1 represents the initial skeleton, X i a vertice of C 1 and adj represents the list of adjacencies from X i in C 1 .
FCI and its variants can benefit from data preparation according to the Joint Causal Inference (JCI; Mooij et al., 2020) approach. This method extracts the context from several datasets, thus creating a pooled dataset where a traditional causal discovery method can be applied. This allows the generated model to encapsulate both information about the variables and the system from where these variables were measured. It is essential to understand that JCI is not a causal discovery method but a tool to prepare the data for it. The authors advocate its use with any causal discovery method but suggest its use with FCI specifically (hence FCI-JCI).
The Really Fast Causal Inference (RFCI; Colombo et al., 2012) is another FCI-like method that performs an additional test to the conditional independences before the v-structures phase: in this extra phase, the algorithm checks every unshielded triplet X À Y À Z and examines X ⊥ ⊥ Y j Z and Y ⊥ ⊥ Z j X. If this holds and Y is not in the separating set of X and Z, then this triplet is directed as X ! Y Z.
The RFCI-BSC (Jabbari et al., 2017) is a modification of RFCI, in where the Bayesian Scores Constraints (BSC) is used as a conditional independence test.
The Greedy Equivalence Search (GES; Chickering, 2002) is a score Bayesian-based method. It scales to highdimensional data since it does not consider all existing patterns. This algorithm first adds new edges between two nodes X and Y, if these nodes are non-adjacent and there is no neighbor of Y that is not adjacent to X. Besides this, it also directs every edge of neighbor T of Y and not adjacent to X as T ! Y . Second, the method removes the best link in each iteration using the following criteria: it deletes every edge X À Y or X ! Y if there is a subset of neighbors of Y , Z that is adjacent to X. Besides, the algorithm transforms all edges Z À Y as Z ! Y and all edges X À Z as X ! Z.
The Greedy Interventional Equivalence Search (GIES; Hauser & Bühlmann, 2012) is an improvement of GES. Besides adding and removing edges, this method has a third phase. In this phase, the algorithm elongates the DAG sequence, continuously modifying the original graph without altering the graph's skeleton. This new graph has the same number of edges and can be transformed into the original one by only changing one arrow.
The Fast Greedy Equivalence Search (FGS or FGES; Ramsey et al., 2017) is another modification of GES that uses parallelization to optimize the runtime of the algorithm.
The GFCI (Ogarrio et al., 2016) is a combination between the FGES and FCI. In this new method, both the skeleton and orientation phases of the pair are used: first, the skeleton phase of FGES is applied to the data, and then FCI is used to perfect the skeleton. The same happens in the orientation phase: initially, the algorithm accesses all the directed edges using FGES. This information is given to FCI, so it can use it to correct the edges' direction further.

| Software tools
The three most known tools/libraries for causal discovery in cross-sectional data are pcalg, bnlearn, and Tetrad.
Beginning with pcalg , this package has implementations of several causal methods, such as PC (original, conservative, and stable versions), GES, GIES, GDS, AGES, FCI (original, Anytime FCI, Adaptive Anytime FCI, and FCI-JCI, FCI+, and RFCI). Depending on the type of data used, this package offers default conditional independence tests for binary (G 2 test), discrete (G 2 test), and continuous (Fisher's z-transformation) data. Moreover, it is possible to adapt other conditional dependence tests to be used in this framework. For score-based methods (such as GES), pcalg includes the ℓ 0 -penalized Gaussian maximum likelihood estimator for both discrete and continuous data.
bnlearn is a widely known and used R package (Scutari, 2010). This package provides an implementation for PC stable and MMPC, and it is possible to accommodate discrete, continuous, and mixed data by changing the conditional independence test. Bnlearn implements several conditional independence tests. For discrete data, bnlearn has the following tests available: mutual information (information-theoretic distance measure), shrinkage estimator for the mutual information (Hausser & Strimmer, 2009), and Pearson's χ 2 (classical version for contingency tables). For continuous data, the Pearson's linear correlation, Fisher's Z (transformation of the linear correlation with asymptotic normal distribution), mutual information (information-theoretic distance measure) and shrinkage estimator for the mutual information (Ledoit & Wolf, 2003) are available. Finally, for mixed data, mutual information (information-theoretic distance measure) is available.
Finally, Tetrad (Ramsey et al., 2018) is one of the most complete graphical tools for cross-sectional causal discovery. This tool implements the following methods: FCI, RFCI-BSC, FGES, GFCI, PC, and RFCI. Tetrad's methods can be applied in continuous, discrete, and mixed data by choosing the correspondent independence tests/score methods. For constraint-based algorithms, Tetrad also implements the following conditional independence tests. For discrete data, the conditional Gaussian test, χ 2 test, degenerate Gaussian likelihood ratio test, G 2 test, and probabilistic test are available. For continuous data, Tetrad presents the following tests: conditional correlation independence, conditional gaussian test, degenerate Gaussian likelihood ratio test, fisher Z test, and kernel conditional independence. Finally, the following tests are available for mixed data: conditional gaussian test and degenerate Gaussian likelihood ratio test. For score-based causal algorithms, Tetrad also offers several scoring methods. For discrete data, Tetrad offers the following tests: BDeu score, BIC score, conditional gaussian BIC score, and degenerate gaussian BIC score. For continuous data, Tetrad has CCI-score, extended BIC (EBIC) score, conditional gaussian BIC score and degenerate gaussian BIC score. Finally, conditional gaussian BIC score and degenerate gaussian BIC score are available for mixed data.
A summary overview of these frameworks can be found in Table 3.

| Datasets
We present here a few 3 key datasets used in causal discovery. These datasets originate in the context of classification or regression problems. The LUCAS (LUng CAncer Simple set) benchmark synthetic dataset is a binary dataset, proposed as a toy example for the challenges created by the Causality Workbench project (Guyon et al., 2011) and is constituted by 12 binary variables, 2000 instances and represents 12 different causal relationships. This dataset represents several potential factors to the development of lung cancer and other unrelated factors. Because of its design and consequently causal properties, this dataset can be used to evaluate methods in terms of forecasting and terms of generated patterns. This dataset was adopted in Ramanan and Natarajan (2020) to study how context-specific independencies can be used to learn causal algorithms. A sample from this dataset is shown in Figure 4.
The SIDO (SImple Drug Operation mechanisms) is a real-world binary dataset representing a set of molecule descriptors tested against AIDS HIV and is constituted by 4932 variables and 12,678 instances. The authors did not make available any representation of the causal relationships present in the data. This dataset is part of a set of challenges proposed by the Causality Workbench project. The purpose of SIDO is to discover the causes for molecular activity in the descriptors. This dataset was used by Yu et al. (2021) to study causal feature selection methodologies.
The Causal Protein-Signaling Networks in Human T Cells dataset, sometimes called the Sachs dataset (Sachs, 2005), is a widely used dataset that represents proteins and phospholipids present in human immune system cells and is constituted by 11 discrete variables ( 1, 2,3 f g), 5400 instances, and represents 17 different causal relationships. This dataset aims to discover potential connections between the molecules without the need for physical intervention on them. For example, the comparative analysis of causal discovery algorithms in Singh et al. (2018) relies on this dataset.
The breast cancer dataset is a well-known discrete data set that entails information about cancer patients. This dataset is composed of 286 different instances and 9 variables, and its purpose is to diagnose the recurrence of breast cancer ( Figure 5). Regarding causal discovery, this data was used by Dhir and Lee (2020), prove the efficiency of the algorithm proposed by them.
Finally, the asia dataset (Lauritzen & Spiegelhalter, 1988), is a widely used synthetic dataset that represents the relation between tuberculosis, lung cancer and bronchitis, and visitations to Asia (this dataset will be analyzed in more detail in Section 3.1.4). This dataset was also used in the work of Ramanan and Natarajan (2020) to study how contextspecific independencies can be used to learn causal algorithms. More examples can be found in the R packages bnlearn 4 and pcalg. 5,6

| Evaluation metrics
Several metrics are used to evaluate causal discovery methodologies. These metrics are usually called pattern metrics as they search for common patterns between the ground-truth model that explains the data (or from which the data was generated) and the model generated by the method. Since generally, the ground truth model is represented in network form (DAGs, for example), these metrics are also related to network metrics. Despite this restriction, some models generated by non-Bayesian methods can be transformed into networks as long as the generated model is a rule-like model For each pair X and Y checks whether the parents of X in the generated model are a valid adjustment set (Pearl, 2009) in the true model. If it is, it is counted as a correct procedure. If it is not, it is counted as a mistake. (such as association rule models) and given that all the generated relationships are simple (e.g., rules such as A, B f g! C f g are not allowed). Table 4 reports a collection of pattern metrics (Raghu et al., 2018).
Comparison between the generated models and the true network for dataset asia. (a) Network generated by pcalg's, bnlearn's, and Tetrad's PC. Represents the missing edges (À or !), edges incorrectly directed or extra edges (À or !), and edges directed correctly (!). All three algorithms generated equivalent graphs and pattern metrics.  Whenever a ground-truth model is not available, causal discovery methods can be evaluated regarding their performance in classification or regression tasks. In these cases, the traditional classification performance metrics are adopted (Hossin & Sulaiman, 2015).

| Running example
We elaborate an illustrative example to clarify the functionalities of libraries and the evaluation metrics. We restrict to the PC algorithm as implemented by pcalg, bnlearn, and Tetrad. The dataset under analysis is asia, 7 a discrete synthetic dataset, consisting of eight binary variables, each one with values yes and no (Table 5). This dataset was generated from the graph of Figure 6, which represents all the relationships present in the data.
Beginning with pcalg, since asia is composed of binary variables, we will use the stable version of PC (PC-stable) with the binary G 2 test. The implementation of PC in bnlearn can reason on discrete, continuous, and mixed data by changing the conditional independence test. For this example, we propose using mutual information as an independence test (others can be applied). Finally, the PC implementation available in Tetrad is similar to the previous two ones. For this example, we propose the usage of χ 2 test as an independence test.
The graphs generated by these three frameworks are equivalent and can be seen in Figure 7a.
To understand the evaluation metrics, we compare Figure  The structural hamming distance (SHD) is obtained by summing the missing edges, extra edges, and incorrectly directed edges. The structural intervention distance (SID) reflects how a mistake in the generated graph can influence the effects obtained. A summary of these metrics for the example at hand is shown in Table 7b. In addition, the derived performance metrics associated with the patterns in the graphs (adjacency and arrowhead precision and recall) are also shown. These metrics are a mixture of the more traditional prediction metrics, like precision or recall, but applied to evaluate patterns in graphs.
The adjacency and arrowhead precision and recall are calculated as follows. The adjacency precision is obtained by dividing the number of correctly predicted undirected edges (the number of edges with lines ! and À or ! in Figure 7a) by the number of predicted edges (Correctly predicted edges þ extra edges): 5 5 ¼ 1 ¼ 100%. The adjacency recall is obtained by dividing the number of correctly undirected edges by the number of true undirected edges (number of edges in the original graph): 5 8 ¼ 0:625 ¼ 62:50%. The arrowhead or directed edges precision, is calculated by dividing the number of correctly predicted directed edges (the number of edges with ! in Figure 7a) by the number of predicted directed edges (Correctly predicted directed edges þ extra directed edges): 2 5 ¼ 0:4 ¼ 40:00%. Finally, the arrowhead or directed edges recall, is calculated by dividing the number of correctly predicted directed edges by the number of true directed edges (number of directed edges in the original graph): 2 8 ¼ 0:25 ¼ 25:00%. It is worth pointing that the implementations of pcalg, bnlearn, and Tetrad are equivalent. Choosing one framework over the other depends on the user's requirements. For instance, if a user is not used to programming, Tetrad is a natural choice, given its user-friendly interface. If the user needs a wide range of causal methods and the possibility to use any conditional independence test (even the tests not available in the package), pcalg's is the best choice. However, if the user only wants to apply the library methods in an easy and fast way, bnlearn is the best choice.

| Time-series data
Time-series data include a sequence of observations about a single subject over multiple times.
Definition 12. (Time-series data) Observations about a single subject at multiple points or periods of time, indexes in time order. We write X t for the observation of random variable X at time t.
This type of data is characterized by the fact that they are being collected in adjacent time periods, and there may be a correlation between distinct observations. Data collected on a continuous basis usually does not fall under the assumptions of conventional statistical methods, thus requiring different methods and tools. These types of data be univariate (only one variable is measured) or multivariate (multiple variables are measured) and the variables can be continuous, discrete, binary, text, among other types, as seen in Table 6.
The search of causal relationships among variables in time-series data has seen an exponential increase in interest in recent years, with sequential data collection becoming a common practice. Causal discovery from this type of data can overcome the problems found in cross-sectional data. Furthermore, since there is a time component, we can assume causal precedence: events in the present cannot cause events in the past. Thus, when faced with an identified (undirected) dependence, it is safe to assume the relationship's direction as past ! future.
Several methods are specifically designed to solve the task of finding causal relationships in sequential observational data. One of the most known frameworks is the Granger causality, proposed by Granger (1969). Intuitively, X Grangercauses Y if predicting Y based on past observations and the past observations of X performs better than predicted Y based on its past only. Mathematically, this relationship can be formalized by testing that in the auto-regression: In this equation, m represents the model order or the maximum number of lags to be used, a j 's and b j 's are the contributions of the delayed observation of Y and X, respectively.
More recent approaches include TsFCI (Entner & Hoyer, 2010), which is an adaptation of FCI for time-series data. This method uses sliding windows to transform the original time series into different subsets of consecutive timestamps, disregarding the time component in each subset and treating them as cross-sectional. The method creates a model for each subset of data using the models from previous timestamps as prior knowledge. Besides this, if a relationship disappears from the model m t , this relation will be disregarded in the latter timestamps.
The PCMCI (Runge et al., 2019) is a causal graphical method designed to deal with linear and nonlinear time series. This algorithm is divided into two phases, each one corresponding to a different conditional independence test: the PC1 and MCI phases. In the PC1 phase, the algorithm applies the conditional independence strategy implemented by PC (skeleton phase) to uncover potential dependencies between each variable, in a specific timestamp, and all the other variables, in all the previous timestamps, for example, X t ⊥ ⊥ Y tÀ1 j Z, X t ⊥ ⊥ Y tÀ2 j Z, and so on, where t is the specific timestamp. Next, the method applies the MCI (momentary conditional independence) test (Runge et al., 2019) to further determine causal relationships between variables in different timestamps while taking into account autocorrelation and incorrect edge detections. PCMCI+ (Runge, 2020) is an extension of PCMCI, which admits the existence of contemporaneous links (a causal relationship between variables in the same timestamp). Because of this, PCMCI+ divides the skeleton search by type of relationships, namely lagged and contemporaneous relationships are found separately. LPCMCI (Gerhardus & Runge, 2020) is yet another PCMCI extension specifically designed to deal with latent variables. This method uses an FCI-like approach to represent the latent variables that are present in the relationships.
Time This type of data is characterized by the collection of information about the same individual at different points in time. This means that, for each subject in a dataset, there is a set of time-series variables that characterize him. The variables in longitudinal data can be continuous, discrete, binary, text, among other types, as seen in Table 7.

| Software tools
There are several libraries offered in different programming languages to solve the task of finding causal relationships in time-series data. lmtest (Zeileis & Hothorn, 2002) is an R package known mainly by its implementation of the Granger causality, as well as the standard dataset ChickEgg that is used as an example later in Section 3.2.2.
NlinTS (Hmamouche, 2020) is another R package. Similar to lmtest, this package implements a version of the Granger causality. Besides this, NlinTS implements a nonlinear version of this test.
Tetrad, the tool presented in Section 3.1.1, has also implementations for several methods that deal with time-series data, including TsFCI, FASK, and TsGFCI.
Unlike the previous data types, to the best of our knowledge, there is no tool available at the moment to deal specifically with longitudinal data. However, a few theoretical frameworks have been proposed for this type of data. One such framework is the Causal Inference over Mixtures (CIM; Strobl, 2019). This method infers the causal structure by creating a mixture of DAGs, using the Global Markov Condition (Definition 5). Explicitly designed for longitudinal medical

Dynamic time warping
Measures the distance between two sequences. Being a sequence of a set of time points, the distance between each point is measured using the euclidean distance data, it allows for cycles. Besides this, it applies the skeleton phase of (Colombo & Maathuis, 2014). The orientation phase proposed by the authors is similar to FCI. An overview of these libraries can be found in Table 8.

| Datasets
Compared to the case of cross-sectional data, there is a smaller range of benchmark datasets for time-series data 8 for times-series causal discovery algorithms is smaller. The FLAIRS 2015 (Huang & Kleinberg, 2015) dataset is synthetically generated. It comprises 22 different subsets, each with a different causal structure, with lags between 1 and 3, 20 continuous variables, and 1000 time-points each. The causal structures simulated in this dataset are common cause, common cause, and common effect, and random relationships. A sample from this dataset is shown in Figure 8 and The FinanceCPT (Kleinberg, 2012) is a simulated dataset composed of 25 portfolios (variables) with 10 causal structures each and 4000 day time periods. The structures simulated in this dataset are no dependency between portfolios, 20 random relationships with a lag one, 40 random relationships with a lag one, 20 random relationships with random lags 1-3, 40 random relationships with random lags 1-3, and many-to-one relationships at a lag of one.
The PROMO dataset is a time-series simulated dataset composed of a daily measurement of 1000 promotion variables and 100 product sales for 3 consecutive years. Its purpose is to identify which promotions affect sales. This dataset is part of a challenge proposed by the Causality Workbench project (Guyon et al., 2011).
The ChickEgg (Thurman & Fisher, 1988) is a time-series dataset with information collected annually about the number of chickens and eggs between 1930 and 1983. It consists of two variables (number of chickens and eggs per year) and three lags (this dataset will be analyzed in more detail in Section 3.2.4).
Regarding longitudinal datasets available for causal discovery, there are very few examples. Moreover, it is challenging to find ground-truth data to evaluate approaches because this area is relatively unexplored. One dataset is the National Footprint Accounts 2018, which collects data from the Ecological Footprint and biocapacity of countries across the world in over 50 years. The objective of this dataset is to understand the cause of the produced footprint values. A sample from this dataset is shown in Table 7.

| Evaluation metrics
The pattern metrics presented in Section 3.1.3 can be applied to time-series methods as well if there is a ground-truth model that represents the causal relationships present in the data. Table 9 shows a set of performance metrics specific to time-series data (Moraffah et al., 2021), to be used when this information is not available.  The accuracy is a metric used to evaluate classification models and can be defined as the fraction of correct predictions made by the model.
The mean and median errors are metrics that encapsulate the fraction of times the model got some response wrong. This error can be calculated in several ways, being the simplest one 1 À accuracy.
The euclidean distance (Iglesias & Kastner, 2013) is another symmetric metric that calculates the distance between two-time series x ! and y ! (the predicted and the ground-truth). This metric is usually used for regression problems.
The longest common subsequence (Bagnall et al., 2016) is an asymmetric metric that measures the number of correct predictions in sequence and reports the highest number. This metric is usually used in regression problems since it uses the euclidean distance to calculate the difference between the predictions and the ground truth. This is performed by reducing the difference to 0 or 1 depending on the distance. If the Euclidean distance between two values is smaller than a defined threshold, they are considered equal. Hence, the distance is 0. On the other hand, if the difference is higher than the threshold, then the distance is 1.
The Edit Distance with real penalty (Chen & Ng, 2004) is another distance metric that reports the number of edits that are needed to transform the series of predictions into the ground truth.
Finally, the Dynamic Time Warping (Berndt & Clifford, 1994) is a distance metric that calculates the difference between two-time series, taking into account the potential differences in measurement in the timestamps (e.g., different frequencies). This is done by comparing each timestamp t from one of the time series with t þ 1, t þ 2, and so on, from the second time series.
With regard to metrics for evaluating causal methods for longitudinal data, there are two options. First, the evaluation metrics presented in Section 3.1.3 can be applied if there is a ground-truth structure to compare with. Second, since time-series data is a particular type of longitudinal data, the evaluation metrics presented in this section can also be applied.

| Running example
We elaborate an illustrative example to understand the libraries and evaluation metrics for time-series data. In this example, we consider the following methods: lmtest's Granger Causality, Tetrad's tsFCI, and Tigramite's PCMCI. The dataset under analysis is ChickEgg 9 (Table 10). There is no ground-truth graph, but it is well-known that from the dataset, it can be proven that eggs came before chickens (Egg ! Chicken). To apply the evaluation metrics presented in Section 3.2.3, data is split into a training set (70%) and testing set (30%).
Beginning with lmtest, the Granger causality method needs no particular parameter insertion besides stating the number of lags. To be able to understand if Eggs cause Chicken or the other way around, it is necessary to apply the test in both cases: Eggs cause Chicken and Chicken cause Eggs). The output of the tests, reported in Code 1, shows that Eggs cause Chicken has a significant p-value (0.002996), while the other test has not. The Granger causality test is not a typical causal discovery method that creates a model that entails all the relationships in the data. Instead, to uncover relationships, it is necessary to test every combination of variables to find every relationship. Besides, this method reports only how the variables are related as a whole and not by lags (unlike the methods presented further in this section). Hence, it is impossible to compare it with the remaining ones further.
Another potential method to undercover causal relationships in time series data is Tetrad's tsFCI. In this method, it is possible to accommodate discrete, continuous, and mixed data. For the ChickEgg example, we use the Fisher's Z test as an independence test. The generated graph can be seen in Figure 9a.
Finally, PCMCI from Tigramite can also be used. This algorithm is also accommodates discrete and continuous data. For the ChickEgg example, we use the ParCorr test as an independence test. The generated graph is shown in Figure 9b. Table 11 presents the evaluation metrics from Section 3.2.3. These metrics are calculated over the test set by comparing the original and predicted values of the target variable. The metrics Edit Distance with Real Penalty, Euclidean Distance, and Dynamic Time Warping are all distances, meaning that the lower the value, the better the performance is. The same applies to the Mean Squared Error. The Longest Common SubSequence represents the size of the largest sequence of values present in both the target's true values and predicted values. For this reason, the higher it is value is, the better performance the model achieves.
From Figure 9b, we can notice that PCMCI successfully finds relationships between the egg variable at time t and the chicken variable at time t þ 1. This is also reflected in the results from Table 11, where the PCMCI performance is better than those of the tsFCI.
It is worth noting that these metrics are not specific for causal discovery. Instead, they evaluate any classification or regression model. Since, first and foremost, a causal discovery model is a classification or regression model (depending on that used), it can be evaluated with such metrics.

| CAUSAL INFERENCE
Causal inference aims at estimating the causal effect of a specific variable (treatment) over a certain outcome of interest. Depending on the context of analysis, the causal effects can be quantified by different metrics, each focusing on different granularity levels. While we adhere here to the PO model, the same metrics can be re-stated in the context of the SCM framework (Aliprantis, 2015). In this section, we first present a number of causal effects of interest. Then, we discuss methods to estimate such causal effects starting from observational data, distinguishing whether or not they make the unconfoundedness assumption. Next, we cover software tools, datasets, and present a running example.

| Causal effects
At the highest granularity level, for each unit in the PO model, the Individual Treatment Effect models the effect of treatment on the unit.

Definition 14. (ITE) The Individual Treatment Effect of unit i is:
Despite recent efforts (Shalit et al., 2017;Yao et al., 2018), the estimation of ITE is challenging, as only the observed potential outcome is available in the data. At the population level, the Average Treatment Effect is the expectation of the ITEs. See Caron et al. (2020) for a review on ITE.

Definition 15. (ATE) The Average Treatment Effect is: ATE
The same definition can also be stated under the SCM framework. Assuming a binary treatment variable T, the Average Treatment Effect is equivalent to  . In some contexts, the effect of interest can be restricted on the treated units, called Average Treatment Effect on Treated.

Definition 16. (ATT) The Average Treatment Effect on Treated is: ATT
The ATE might fail to capture causal effects due to the presence of heterogeneity among the units. This is overcome by the Conditional Average Treatment Effect.

Definition 17. (CATE) Given a subset of covariates X, the Conditional Average Treatment Effect for
Notice that CATE boils down to ATE when X is the empty set. The objective of causal inference is to estimate causal effects starting from observational data, where only factual outcomes are available.
Based on the type of available data and assumptions, we distinguish three main settings: experiments, observational data with unconfoundedness, and observational data with no unconfoundedness.

| Experimental data
In experimental settings, the treatment variable T is under the control of the researchers. Units are assigned either to a treatment group or to a control group. For instance, patients can either receive or not the drug.
The experiment is defined as a Randomized Controlled Trial (RCT) whenever the treatment assignment is performed randomly. This is the gold standard for estimating causal effects in social sciences and clinical trials. Randomization ensures that potential outcomes are independent of the treatment, that is, A further assumption is, however, needed to estimate causal effects. In the stable unit treatment value assumption (SUTVA), the observed outcome of a unit does not depend on the treatment assigned to other units (Cox, 1958), that is, (2), the ATE can then be estimated starting from a sample of observed units as follows: where n t is the number of units i in the sample with T i ¼ t. An equivalent approach is given by running a simple OLS regression of Y i given T i . See Angrist and Pischke (2008, Chapter 2) for a comprehensive approach.

| Observational data with unconfoundedness
While RCTs are the gold standard for retrieving causal effects, they might be costly, unethical, or even impossible to run. In an observational setting, a number of assumptions is required to assess the causal effects directly from an i.i.d. sample of observations. The first one requires the independence of the potential outcomes from the treatment, not in general, but for a given a set of covariates.
Definition 18. (Unconfoundedness) Unconfoundedness w.r.t. a set of covariates X holds if: Unconfoundedness is also referred to as ignorability, conditional independence (Lechner, 1999), or selection on observables (Barnow et al., 1980). Under the SCM framework, this can be assimilated to the backdoor criterion (Pearl, 2009). In the kidney stones example, if we know that people having larger kidney stones are more likely to receive the treatment compared to patients with smaller ones, then we can divide the population into smaller subgroups depending on the size of the kidney stones, and analyze each stratum as if the treatment is randomly assigned. The additional assumption of the overlap condition (or positivity) requires non-deterministic treatment assignment, namely 0 < P T ¼ 1jX ¼ x ð Þ < 1. Unconfoundedness and overlap conditions together are known as strong ignorability. Assuming ignorability, an estimate of the ATE consists of averaging the estimates of CATEs over each stratum-a strategy known as stratification or blocking: are also referred to as conditional-response surfaces. Such an approach has some issues in the case of continuous covariates or covariates with many values. An alternative relies on propensity scores (Rosenbaum & Rubin, 1983). Complete overviews of propensity scores methods for retrieving causal effects can be found in (Austin, 2011;Imbens, 2004;Imbens & Rubin, 2015).
Definition 19. (Propensity score) For a set of covariates X, the propensity score is the conditional probability of getting treated: e x ð Þ ¼ P T The estimation of propensity scores requires two key decisions: the model or functional form of e Á ð Þ and the covariates in X. For binary treatments, the logistic regression model is commonly adopted (Caliendo & Kopeinig, 2008). Propensity scores are balancing scores, namely Y 0 , Y 1 f g⊥ ⊥ X j e X ð Þ. Under the assumption of unconfoundedness, this implies that Y 0 ,Y 1 f g⊥ ⊥ T j e X ð Þ (Imbens & Rubin, 2015). Therefore, strata can be considered w.r.t. the propensity score, rather than on the covariates. This helps solve dimensionality concerns.
Another common usage of propensity score is matching (Abadie & Imbens, 2016) treated units with untreated units based on their similarity w.r.t. a distance measure either on the covariate space, or directly on the propensity score space (Yao et al., 2021). The idea is to pair individuals with different exposure to treatment, but similar in terms of the propensity score. Matching can be one to one, where each treated (resp., untreated) unit is paired with the closest untreated (resp., treated) unit, to one to many methods, where each unit can have multiple matches. By denoting with J i ð Þ the set of matched units for unit i, the ATE can be estimated from a sample of N units as: where M ¼j J i ð Þ j. Propensity score is also used to re-weight observational data (Hirano et al., 2003;Robins et al., 2000;Rosenbaum & Rubin, 1983) in such a way that the distribution of covariates is uniform across treatments, namely w x,1 In the Inverse Probability of Treatment Weighting (IPTW; or Inverse Propensity Weighting [IPW]), the following weight definition achieves that: Reweighting rebalances data to a sort of experimental data. As long as unconfoundedness and the overlap condition hold, the causal effects can be estimated as in the RCT scenario (but now for a weighted sample). Some concerns regarding the variance of the estimators arise when units have propensity score values close either to 1 or 0. A popular solution is to stabilize the weights (Robins et al., 2000). Another methodology used for retrieving causal effects in this context is the G-formula (Robins, 1986). Both G-formula and IPW fall under the marginal structural models family (Petersen et al., 2006). For a comparison of marginal structural models techniques, see Chatton et al. (2020).
Since under unconfoundedness ATE can be characterized in terms of propensity scores and CATE, a natural question that arises is whether combining both approaches can have some benefits. This concern is addressed in (Robins et al., 1994), where Augmented Inverse Propensity Weighting (AIPW) is proposed. The AIPW estimator is defined as: and the propensity score e x ð Þ are estimated using regression models. The above estimator is also known as a doubly robust estimator, as it requires that only one of either the propensity score estimate or the response surface estimate is consistent to make the estimator consistent for the ATE (Scharfstein et al., 1999). Moreover, AIPW estimator is optimal among the non-parametric estimators, in the sense that it attains the Cramer-Rao bound (Hahn, 1998). See Glynn and Quinn (2010) for an overview of AIPW.
Another robust alternative is the Targeted Maximum Likelihood estimator method (Van Der Laan & Rubin, 2006), which involves four steps. In the first step, an estimation of the conditional expectation of outcomes given treatment and covariates  Y j T, X ½ is performed. In the second step, propensity scores are estimated. The third step adjusts the estimated conditional expectations using the estimated propensity scores. In the last step, the updated estimate of  Y j T, X ½ is used to generate pairs of potential outcomes and, for each unit, the difference between pairs of potential outcomes. The average of these differences is then an estimate of the ATE.
The methods presented assume homogeneous effects of treatment, which may not be the case in some contexts. Let us go back to the kidney stones example. In a (plausible) scenario where younger patients are more reactive to the treatment, the effectiveness of the drug will vastly change depending on the age of the patient. This is a case of heterogeneous effects. Here, the causal measure to focus on is the CATE. A semi-parametric approach for estimating CATE can be developed based on linearity assumptions [Chernozhukov et al., 2018;see Nie and Wager (2021) for a generalization to a nonlinear setting].
Another popular nonparametric approach is Bayesian Additive Regression Trees (BART; Hill, 2011), where the conditional expectation of the outcome given treatment and covariates is estimated by an ensemble (a sum) of decision trees, that is, g q x, t ð Þ with g q x, t ð Þ denoting a Bayesian decision tree. Therefore, the CATE τ x ð Þ can be defined as f x,1 ð ÞÀf x,0 ð Þ. In a similar fashion, other ensembles of trees and random forests have been investigated recently (Athey & Imbens, 2016;Wager & Athey, 2018). For a comprehensive review of these approaches, see Yao et al. (2021).

| Observational data without unconfoundedness
A practical problem for achieving unconfoundedness is to identify and collect all relevant covariates that make treatment assignments independent of the potential outcomes. A similar issue occurs for satisfying the overlap condition. Let us consider here approaches that do not assume unconfoundedness.
Instrumental variables (IV), first introduced by Wright (1928) and named by Reiersø (1945), were developed in the structural econometrics setting and were used to address endogeneity issues. For simplicity, assume no covariate X, and consider a linear structural model Y ¼ α þ β Á T þ ε of the outcome Y given the treatments T, with In such a setting an OLS regression can consistently estimate the parameter β. Conversely, if unconfoundedness does not hold, it might be  εjT ½ ≠ 0, hence the estimation of the causal effect β through OLS regression is not consistent.
Other estimators to retrieve β when multiple instruments are available are the Two Stage Least Squares (2SLS; Theil, 1961) and the Method of Moments (Baum et al., 2003). In summary, the following conditions are required for using the IV approach: exogeneity condition: Z ⊥ ⊥ ε; exclusion restriction: the instrumental variable Z affects the outcome Y only through T; relevance condition: Cov Z, T ð Þ≠ 0. The exclusion restriction can be easily understood through a graphical example, where no arrows pointing to Y are coming directly from Z, as shown in Figure 10a.
Modeling the treatment effect as a linear model can be considered a strong assumption. However, as shown by 2021 Nobel's price laureates in (Angrist et al., 1996;Imbens & Angrist, 1994), the IV strategy can also be used in the potential outcome framework where units might not comply with the designation of the treatment. Let us think, for example, of an experiment where the patients are randomly assigned to take a certain drug, but not all of them actually follow the assignment. In such a scenario, we can denote the outcome as Y i , the treatment received as T i (non-random) and the treatment to which each unit was assigned as Z i (random). In this setting, we can frame the actual treatment T i as a potential outcome of the assignment Z i , that is, there are T 0,i and T 1,i . Moreover, the outcome is now a potential outcome of both T i and Z i and it can be defined (with a slight abuse of notation) as Y t,z f g t,z 0,1 f g . At this point, the three conditions of instrumental variables are met: exogeneity is ensured by the randomness of Z, the exclusion restriction holds as Z affects the outcome only through the actual assignment and, finally, the treatment assignment influences the actual taking of the drug, fulfilling the relevance condition. By looking at the actual treatment and the treatment assignment, we can identify four sub-populations in our sample: the compliers, who are those who follow the treatment designation, the always-takers, who are those that always receive the treatment no matter what is the assignment, the never-takers, who are those that never receive the treatment no matter what is the assignment, and the defiers, who are those that do the opposite of what they were assigned to. By adding the additional assumption of no identifies the treatment effect for those who comply with the assignment. Such an effect is also known as Local Average Treatment Effect (LATE). See Imbens (2014) for an overview of other instrumental variables approaches.
Another popular technique dealing with the lack of unconfoundedness is the front door criterion (Pearl, 1995) under the SCM framework. As depicted in Figure 10b, a few requirements need to be met in order to exploit it in the estimation of causal effects. In particular, M must block all the directed paths from T to Y , T has no unblocked backdoor paths to M and T blocks all the paths from Y to M. As a fact, M can be defined as a mediator variable. By estimating the effect of T on M and then of M on Y , it is possible to retrieve the effect of T on Y .
Another option arises whenever the treatment assignment depends only on the values of a specific variable. For instance, consider in the kidney stones example, that patients are assigned to treatment only if the size of their kidney stones is bigger than a certain threshold. The Regression Discontinuity Design (RDD; Thistlethwaite & Campbell, 1960) precisely assumes that the treatment depends on a variable C, called the running variable, and a threshold c 0 as, for unit i: T i ¼ 1 C i > c 0 . The core idea of this methodology is that observations close to the threshold are in principle pretty similar to each other, and they can be used to retrieve the causal effect of the treatment. In particular, two assumptions are made: (i) the probability of receiving the treatment jumps at the cutoff c 0 : and, (ii) the potential outcomes are continuous at the cutoff, namely there exist: Under such assumptions, the ATE can be calculated as: ATE RD can be estimated by either parametric or nonparametric approaches (Cattaneo et al., 2020Lee & Lemieux, 2010).
In the case of longitudinal data, namely for each unit there are multiple observations over time, some further approaches are available. Let us consider a setting where the observed outcome Y i,l and a binary treatment T i,l are measured over time l ¼ 1, …, L for each unit i ¼ 1, …, N in a sample. Let us also suppose that the two variables are linearly related as: where Y i,l 0 ð Þ refers to the potential outcome under no treatment at time l and τ can be seen as the constant causal effect of treatment T i . Assuming that the effect is not heterogeneous, that is, the treatment affects the units in the same manner in all the periods, and that there are no treatment dynamics, that is, the outcome at time l depends only on the treatment at the same time, a causal estimate can be modeled as: This two-way model assumes that there are two fixed effects, one for each unit and one for each period. In the simplest possible scenario, that is, when there are just two periods (L ¼ 2), some units never get treated, and the remaining are treated just in the second period, the OLS estimator coincides with is what is referred to as a difference-in-differences (DID) estimator of the ATE. A seminal example of this technique can be found in Card and Krueger (1994), where the authors studied the effect of a rise in the minimum wage on unemployment. This was done by comparing New Jersey workers, where a minimum wage raise was introduced, with Pennsylvania workers, where no raise occurred. A crucial assumption of the difference-in-differences approach is the parallel trend, that is, although treatment and comparison groups may have different levels of the outcome prior to the start of treatment, their trends in the outcomes before the treatment should be the same. Moreover, as pointed out in Bertrand et al. (2004), extra care should be devoted when handling the error terms ε i,l . For a comprehensive review of DID estimators, see Lechner (2011). In order to overcome the assumption of a fixed α i over time, some approaches have been developed, such as the synthetic control methods, first introduced in Abadie and Gardeazabal (2003) and later developed in Abadie et al. (2010). The core idea of synthetic methods is to re-weight the units that were not treated so that the parallel trend assumption becomes more plausible. A more recent advancement in this direction is the synthetic difference-in-differences method, presented in Arkhangelsky et al. (2019), which also provides a unified perspective of DID and synthetic control methods.

| Software tools
Many software tools for estimating causal effects have been developed in the recent years. We review a non-exhaustive list (Table 12), discussing which approaches they implement and their most relevant features. DoWhy (Sharma et al., 2019) is one of the most complete tools. Written in Python, it provides a unifying framework for several methodologies, covering virtually the whole process of causal inference. DoWhy covers four tasks: model the causal problem through a causal graph, identify the causal estimand of interest, estimate the causal effect and validate the obtained results. The following identification strategies are currently implemented: backdoor criterion, front-door criterion, instrumental variables, and mediation analysis. For each of these, a few methods for estimating causal effects are provided. Moreover, DoWhy natively connects to the wide range of machine-learning-based estimators from EconML (Microsoft Research, 2019).
Propensity score matching is implemented in the following three tools. The popular R package Matching (Sekhon, 2011) offers three main functions: Match, MatchBalance, and GenMatch. The first one performs propensity score matching. The second one checks whether the matching method balances the data. The last function generates optimal weights for each covariate through a genetic search algorithm to automatically balance the data. MatchIt (Ho et al., 2011) is another popular choice. The package offers the function matchit, which generates matched data through different methods. After the matching is performed, standard regression techniques can be applied to the obtained data to retrieve the causal effects of interest. Another interesting option for large datasets with discrete covariates is the R package R-FLAME (Orlandi et al., 2020) and its Python version dame-flame. The packages provide a framework that implements the Fast, Large-Scale Almost Matching Exactly (FLAME; Wang et al., 2021) and Dynamic Almost Matching Exactly (DAME; Dieng et al., 2019) approaches.
Propensity score weighting is implemented in the following other four tools. PSW (Mao & Li, 2018), one of the main reference libraries, provides several techniques based on propensity score through the function psw. This package allows to check visually the propensity score distribution in both treatment groups, evaluate the covariates balance, and test the specification of the propensity score model. CBPS (Fong et al., 2021), available in R, implements several methods presented in Imai and Ratkovic (2013), both for cross-sectional data and longitudinal ones. CBPS maximizes at the same time covariate balance and the prediction of treatment assignment, while typically propensity score algorithms predict the treatment assignment and then perform a check on the covariates to see whether they are balanced among different treatment groups. This makes the method more robust to misspecifications. CBPS includes matching, weighting, and double-robust methods based on the estimated propensity scores. The package ipw (van der Wal & Geskus, 2011) implements the inverse probability of treatment weighting, both for time-fixed and time-varying frameworks. Similarly, PSweight (Zhou et al., 2021) covers propensity-score based estimators through Propensity Score weights, exact-matching weights, entropy weights, ATT weights, and overlap weights.
RISCA  is a viable option for users interested in marginal structural models, as it provides functions for G-estimation and Inverse Probability Weighting.
Doubly robust estimators are offered also by CausalGAM (Glynn & Quinn, 2017), an R library that implements both standard estimators and the AIPW estimator, or by the tmle (Gruber & van der Laan, 2012) package, which provides an R implementation of the Targeted Maximum Likelihood Estimators.
Several tools rely on machine learning for estimating causal effects. BART (Sparapani et al., 2021) provides a package for the estimation of Bayesian Additive Regression Trees. grf (Tibshirani et al., 2020) is an R package that implements tree-based methodologies  for CATE estimation. The main function is causal_forest, which trains a causal forest to retrieve heterogeneous treatment effects. For Python, we refer to EconML and Causal ML packages. The former provides several implementations of state-of-the-art methodologies for retrieving heterogeneous causal effects, such as Double Machine Learning (Chernozhukov et al., 2018), causal trees/forests (Athey & Imbens, 2016;Wager & Athey, 2018), Doubly Robust Learning (Foster & Syrgkanis, 2020), and Meta-Learners (Künzel et al., 2019). The latter (Chen et al., 2020) covers similar approaches, yet less extensively. Further Python packages are available to estimate ITE, such as CEVAE (Shalit et al., 2017), and SITE (Yao et al., 2018), which follow the work of (Yao et al., 2018).
Instrumental variables estimation is offered by the R package ivreg (Fox et al., 2021). The function ivreg fits different estimators, including the Two-Stage Least Squares (2SLS) and the Method of Moments (MM).
rdrobust (Calonico et al., 2021) is one of the main tools offering an RDD implementation. It covers all the required steps through different functions: rdplot deals with the graphical exploration of the setting, rdbwselect picks the optimal bandwidth, and rdrobust computes the RDD estimator under different assumptions. Other packages for RDD estimation include (Stigler & Quast, 2016;Dimmery, 2016).
A canonical implementation of difference in differences can be obtained by simply instantiating a common panel linear regression [plm (Croissant & Millo, 2008) for R users, linearmodels (Sheppard et al., 2021) for Python researchers]. More advanced techniques, such as synthetic control, can be implemented through the Sytnth (Abadie et al., 2011) package and CausalImpact (Brodersen et al., 2014). The latter is available both in R and in Python.

| Datasets
We present a few open-access datasets widely used in papers about causal inference.
IHDP is based on the RCT developed by the Infant Health and Development Program (Brooks-Gunn et al., 1992). There are 25 covariates about infants and their mothers. The treatment regards the access to comprehensive early intervention. The goal of the study was to understand how this extra care could help reduce the developmental and health problems of low birth weight for premature infants.
Lalonde is a well-known observational dataset used to evaluate propensity score matching in (Dehejia & Wahba, 1999). The variables refer to workers' characteristics, such as age, education, ethnicity, marital status, exposure to the training program (the treatment), and salary.
Several datasets can be obtained from the Wooldridge package in R. This library includes all the 114 datasets of Wooldridge (2015). For instance, the mathpnl dataset (Papke, 2005) regards student performances at schools. Some data can be obtained exploiting the popular RDD packages previously described. In particular, the house dataset from Lee (2008) and also used in Imbens and Kalyanaraman (2012), is included in the rddtools library. The dataset refers to observations for elections, and it was used to estimate the effect of being the incumbent, exploiting the percentage of votes as a running variable.
The Card-Krueger 1994 dataset is a notorious example for DID estimation from (Card & Krueger, 1994; see also the version from Ropponen (2011)). It contains data about workers in the fast-food industry.
Finally, additional datasets from the econometrics domain are included in the R-package AER (Kleiber & Zeileis, 2008). These datasets can be used to estimate several causal inference techniques. For instance, the Cigarettes dataset is suited for instrumental variable approaches.

| Running example
A toy example is provided here to compare the performance of several methodologies under the unconfoundedness assumption. In such an example, we are going to generate data that simulate patients that undergo a diet. The code used for the whole experiment can be found through the link here 1. The structure of the data is illustrated in Figure 11.
For each instance, seven variables are simulated. The binary variable man is equal to one if the patient is a man, and it is generated through a Bernoulli 0:5 ð Þ. The binary variable white is equal to one if the patient is white, and it is distributed as a Bernoulli 0:65 ð Þ. The binary variable overweight is equal to one if the unit was overweight in the past, and it follows a Bernoulli 0:32 ð Þ if white is equal to one, otherwise, it follows a Bernoulli 0:4 ð Þ. The variables N T and N Y are noise variables distributed as Normal 4, 2 ð Þ and Normal 15, 1 ð Þ respectively. The binary variable treat refers to whether the unit has been on a diet and it is defined as follows: if N T þ white þ 3 Ã overweight À man ð Þ > 6, it is equal to one, otherwise it is equal to zero. Finally, the outcome variable weight, measuring the patient's weight, is simulated as follows: if the unit is a man, then weight ¼ Normal 200, 30 In such a setting, τ captures the causal effect of the impact of treat over weight. Let us notice that the relationship between treat and weight is assumed to be linear and that unconfoundedness holds given man, white, and overweight.
Therefore, the goal of the whole example is to estimate τ. Three different scenarios are simulated: one where the impact is high (τ ¼ À30), one where it is smaller (τ ¼ À10), and one where there is no effect (τ ¼ 0).
For each scenario, 1000 simulations were run. In each simulation, a dataset containing 10,000 observations is generated and an estimate of the ATE is computed using the following methods: OLS linear regression, exact (one-to-one) matching estimator, full matching, stabilized IPW, and doubly robust AIPW.
The OLS regression coefficient is retrieved by running a linear regression of weight over treat, man, white, overweight, and the interactions of the centered-around-the-mean versions of man, white, and overweight with treat. For exact and full matching, MatchIT package is used. In exact matching, a complete cross of the covariates is used to form subclasses defined by each combination of the covariate levels. If a subclass does not contain both treated and control units is discarded. In this way, the remaining subclasses contain treatment and control units that are exactly equal in the included covariates. On the other hand, in full matching, both treatment and control units (i.e., the "full" sample) are assigned to a subclass and receive at least one match. In this case, the sum of the absolute distances between the treated and control units in each subclass is as small as possible. Stabilized IPW estimates are found using the ipw package, while the CausalGAM implementation is used to retrieve the Doubly Robust AIPW estimate. Table 13 shows the mean of the estimated ATEs for each methodology over the 1000 runs and the relative standard deviations. Comparing the estimated ATEs with the ground-truth coefficient, we point out that all the methodologies are quite accurate as per mean values. Full matching shows the highest standard deviation (one order higher than the others), while OLS regression has the smallest standard deviation. This is indeed expected as the relationship between weight and treat is actually linear.
To conclude, it is worth noticing that on average OLS regression is the fastest, as it requires just 0.005 s to run. Exact Matching and IPW are also pretty fast, taking around 0.05 s to compute the parameters. On the other hand, AIPW takes around 30 s to estimate the parameter, while Full Matching requires almost 1 min on each run.

| CURRENT TRENDS AND POTENTIAL FUTURE RESEARCH
In recent years, the demand for trustworthy AI systems fostered the introduction of causality approaches in machine learning (ML) research. As highlighted by Pearl (2018), causal reasoning is crucial to overcome current ML limitations. For instance, the widespread usage of black-box models to socially sensitive decision making requires explainations of the logic involved (Guidotti et al., 2019). In fact, traditional ML algorithms build on the correlation among variables rather than on proper causal structures, with the risk of making wrong, biased, or harmful decisions.
In the eXplainable AI (XAI) branch, several works started exploiting causal frameworks to investigate black boxes decisions (Moraffah et al., 2020). CXPlain system, developed by Schwab and Karlen (2019), exploits Granger-causality to determine the importance of features for a black box model. Causal concepts, such as counterfactuals, are used in XAI (Verma et al., 2020) for post-hoc explanations which answer the question "which changes in an instance's features would have changed the ML model prediction?" Early approaches generate counterfactuals by solving an optimization problem (Wachter et al., 2017). A number of refinements of the optimization constraints cover efficiency, robustness, diversity, actionability, and plausibility. Zhao and Hastie (2021) exploit the connection between partial dependence plots, an XAI tool, and the backdoor criterion, to extract causal information from black-box models. Beyond post-hoc explanations, causal discovery algorithms can provide inherently interpretable methods (Xu et al., 2020).
Causal reasoning in AI also supports the enforcement of ethical concerns. In fact, spurious correlations and other forms of bias can lead to discriminatory decisions again protected-by-law social groups. Fair ML aims at the design of models that do not discriminate. The approach of (Kusner et al., 2017) defines fair decisions using counterfactual reasoning. Counterfactual fairness addresses what would have happened if membership to the protected group had been different and the other features had been the same, by exploiting the use of DAGs and do-operations. For a comprehensive review of fairness and causality, see Makhlouf et al. (2020).
Another concern about ML algorithms is the assumption that training and test set data are from the same distribution. In real environments, such an assumption is often not met due, for instance, to distribution shifts (Quiñonero-Candela et al., 2009). Domain adaptation studies how to extend ML models that are trained in certain domains to others. This branch can benefit from the usage of causal tools, as going beyond simple correlations is crucial to achieve robustness. Research bridging causality with domain adaptation includes Zhang et al. (2015), where the authors study which knowledge is transferable from one domain to another and find optimal target-domain hypothesis. In Pearl and Bareinboim (2014), the authors present conditions for transferring the causal effects learned in experimental studies to a new population where only observational studies are possible. Such a problem is identified as transportability. Transportability requires full knowledge of the causal graph. An extension to cases where there is partial knowledge of the causal structure is considered in Magliacane et al. (2018), where the authors build on JCI (Section 3.1).
Reinforcement learning (RL) is another branch of AI that can benefit from causal reasoning. In RL, an agent interacts with an environment aiming to maximize its cumulative reward within a certain time horizon.
Causality is a natural addition to model-based RL, where the agent has access to the state-transition probabilities, as it allows to overcome some concerns, such as confounding factors. For instance, Bareinboim et al. (2015) explore the relationship between causal models with unobserved confounders and the popular sequential decision-making setting of Multi-Armed Bandits (MAB), where latent variables influence both the reward distribution and the agent's intuition. Similar results for Markov Decision Processes (MDP), another popular setting for modeling sequential decision making, are provided by Zhang and Bareinboim (2016). In Lee and Bareinboim (2018), the authors study how to identify the best action in MABs, where an action corresponds to interventions on an arbitrary causal graph, taking into account also latent factors. Unconfoundedness is also studied in Bruns-Smith (2021), where cases of persistent confounding variables are investigated. Another work worth mentioning for causal RL is Lattimore et al. (2016), where noninterventional observations are used to improve the rate at which high-reward actions can be identified.
Let us turn now on the challenges posed by the usage of causality in AI and on a few open research directions. First, there is, in general, a lack of knowledge regarding the causal model that underlies the data generating process in the majority of the application context. This is a general concern, which also limits the validation of causal discovery techniques, as few datasets are documented with a proper causal structure. This is a main challenge to be addressed, which requires a multi-disciplinary effort to collect domain expertise. Legal initiatives which demand for trustworthy AI development, such as the European Union draft regulation on AI, 10 can be a stimulous for the development of domain knowledge in the form of a causal graph in specific high-risk application domains.
Second, causal discovery research needs a boost to deal with non-tabular data, such as images and videos. Some examples of works in this area can be found in Lopez-Paz et al. (2017) and Li et al. (2020). The former work deals with retrieving causal signals in the context of images, while the latter studies the problem of learning causal structures from videos. In this context, a repository with publicly available datasets and tools is currently missing, and it would be a cornerstone for research in the area. The study of techniques designed for longitudinal data is also worth exploring, as few methodologies are available to retrieve causal structures (Section 3.2). Dealing with high-dimensional data and missing values is another critical open issue worth being addressed.
Third, some lines for future research can be outlined for causal inference. The assumption of unconfoundedness is often too strong, which makes inapplicable the tools surveyed in Section 4.1.2. Additionally, the overlap condition rarely holds in contexts with high-dimensional data. Also, the SUTVA assumption can quickly fail when network effects take place. Finally, methodological guidelines should be developed for specific application contexts to inform and guide practitioners in using the growing range of causal inference techniques.

| CONCLUSIONS
Causality is a dynamic and multidisciplinary field with both a long history and large developments to come. Several new techniques have been proposed in the last decades, and new software has become available for practitioners to perform causality-related tasks.
In light of this, this survey paper recollected together the principal methodologies, tools, datasets, and metrics to perform and evaluate both causal discovery and causal inference. Current trends of using causality in the realm of (trustworthy) Artificial Intelligence was provided, including open issues and potential new directions. A companion website (https://tinyurl.com/Causal-Discovery-and-Inference) is also available with additional resource lists and software scripts.

CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in "Methods and Tools for Causal Discovery and Causal Inference" at https://github.com/AnaRitaNogueira/Methods-and-Tools-for-Causal-Discovery-and-Causal-Inference.git.