On nonparametric conditional independence tests for continuous variables

Testing conditional independence (CI) for continuous variables is a fundamental but challenging task in statistics. Many tests for this task are developed and used increasingly widely by data analysts. This article reviews the current status of the nonparametric part of these tests, which assumes no parametric form for the joint continuous density function. The different ways to approach the CI are summarized. Tests are also grouped according to their data assumptions and method types. A numerical comparison is also conducted for representative tests.


| INTRODUCTION
Independence and conditional independence (CI) are fundamental concepts in statistics. Dawid (1979) and Dawid (1980) discuss important properties of CI for exploring queries in statistical inference, such as sufficiency, parameter identification, adequacy, and ancillarity. Blitzstein (2011) states that conditioning is the soul of statistics. With the rise of the related application fields including causal inference, graphical model learning and feature selection (Koller & Sahami, 1996;Xing, Jordan, & Karp, 2001), CI testing has attracted increasing attention from both theoretical side and application side.
The high demand of CI tests is related to the common desire to identify the relationships among different objects, including how they are connected, what is the mechanism they interact with each other, and how information flows among these objects (Gerstein et al., 2012). This desire entails the attraction of probabilistic graphical models, which express the relations among variables in terms of independence and CI. They are now extremely widely used in various disciplines, for example, information extraction, speech recognition, computer vision, modeling of gene regulatory networks, gene finding, and diagnosis of diseases. As the building blocks of graphical models, CI relations are the keys to answer scientific questions of these areas.
Formally, the problem of interest is testing the CI between two random variables X and Y given a third random vector Z based on a finite sample. In statistical symbol, the null hypothesis is written as H 0 : X ⊥ Y | Z, where ⊥ denotes "independent from." The alternative hypothesis is written as H 1 : X Y | Z, where denotes "dependent with." Conditional independence tests for the case in which the variables are categorical are abundant in literature. Most wellknown tests include the Pearson's χ 2 test and the likelihood ratio test (Edwards, 2000;Pearson, 1992;Tsamardinos & Borboudakis, 2010). If Z is discrete, CI test turns into testing for unconditional independence X⊥Y Z j = z i for every possible value z i that Z takes (Tsamardinos & Borboudakis, 2010).
Testing CI of continuous random variables is a much more difficult case (Bergsma, 2004). That prompts the use of simplifying assumptions. In the early days, one commonly used method is the test based on the partial correlation coefficient, which requires the assumption of Gaussian variables (Fisher, 1915(Fisher, , 1921. Milder assumptions such as linearity and additivity are well studied in the literature (Daudin, 1980;Spirtes, Glymour, & Scheines, 1993;Geiger & Heckerman, 1994;Peters, Mooij, Janzing, & Schölkopf, 2014). However, real-life situations where linear and additive conditions are not met are still plenty (e.g., physiological signals, stock market prices, weather status, etc.; Margaritis, 2005), thus recent research has focused on nonparametric CI tests, where no assumption on the functional form between the variables or the data distributions is imposed. However, nonparametric CI testing faces many challenges. In general, current nonparametric CI tests either suffer from the "curse of dimensionality" in terms of the dimensionality of the conditioning vector Z (Bergsma, 2004), or are limited by scalability, that is, costing too much time in the large sample size setting. Moreover, Shah and Peters (2019) prove that no CI test can control Type-I error for all CI cases. More specifically, if the joint distribution of (X, Y, Z) is continuous, a CI test that both (a) controls Type-I error under the significance level α over the entire class of distributions satisfying CI, and (b) has power better than the significance level α against any alternative hypothesis, cannot exist.
Although the dream of nonparametric CI tests to be both uniformly valid and powerful may not be achievable (Shah & Peters, 2019), their validity for a wide range of CI cases and their robust high power against a wide range of alternative hypotheses still make them the popular choice for many CI test problems. But since there is no non-trivial CI test that is valid for all CI distributions, it is crucial to choose the right test. Thus it is extremely important to understand these nonparametric CI tests, as well as understand the specific problem in hand. For the latter task, domain knowledge for the specific problem is needed. For the former task, we write this article to provide a brief review of current nonparametric CI tests for continuous variables, with emphasis on the CI characterization they are based on and the methodology line they used. We also provide an empirical comparison to show their performances on a range of popular distributions.

| CI CHARACTERIZATION APPROACHES
Let X, Y, and Z be three random vectors with domains X , Y, and Z, respectively, and assume that their joint distribution is continuous with density p X,Y,Z x,y,z ð Þ. By definition, the CI between X and Y given Z is defined as Obviously, it is impossible to check this equality for every value of (x, y, z) based on a finite sample. Thus, researchers have come up with different approaches to characterize the CI. Some CI characterizations are of weaker sense (Daudin, 1980) as compared to the above CI definition, thus we call the characterizations equivalent to the above CI definition as "strong" CI. The word "strong" is only added to distinguish from the definition of weak CI (Daudin, 1980). CI or "strong" CI implies weak CI, but the converse is not true.

| "Strong" CI
There are at least the following main approaches to characterize CI.
(A) We say that X is conditionally independent of Y given Z, that is, X ⊥ Y | Z, if the joint distribution p X, Y, Z (x, y, z) factorizes as p X, Y, Z (x, y, z) = p X|Z (x, z)p Y|Z (y, z)p Z (z) (or equivalently, p X, Y j Z (x, y, z) = p X|Z (x, z)p Y|Z (y, z), p X j Y, Z (x, y, z) = p X| Z (x, z), p Y j X, Z (x, y, z) = p Y|Z (y, z) or p X, Y, Z (x, y, z)p Z (z) = p X, Z (x, z)p Y, Z (y, z), see, e.g., Dawid (1979)). (B) Daudin (1980) provides a characterization of CI by enforcing the uncorrelatedness of functions in certain L 2 spaces. Let L 2 X,Z denote the space of all functionsf such that Ef X, Z ð Þ 2 <∞, that is, L 2 X,Z =f j Ef X,Z ð Þ 2 <∞ n o , and define L 2 Y,Z , L 2 X , L 2 Y , and L 2 Z analogously. Consider the constrained L 2 spaces The following conditions are equivalent: Note: The above result can be viewed as an extension of the partial correlation-based characterization of CI for Gaussian variables. For one-dimensional variables X and Y, suppose that (X, Y, Z) is jointly Gaussian, then the partial correlation coefficient ρ X, Y|Z (the correlation between the residuals e X and e Y resulting from the linear regression of X with Z and of Y with Z, respectively) equals 0 if and only if X ⊥ Y | Z. As in the above characterization (B), if we construct functionsf andg from the corresponding L 2 spaces through nonlinear regression, for example, for any function f ∈L 2 then the claim (ii) implies that X ⊥ Y | Z equivalents to the vanishing of the correlation between any "residual" function of (X, Z) given Z and that of (Y, Z) given Z.
(C) Characterization of CI can also be given in terms of the cross-covariance operator on reproducing kernel Hilbert spaces (RKHSs; Fukumizu, Gretton, Sun, & Schölkopf, 2008). Let k X be a positive definite kernel on X , and H X be its corresponding RKHS. Define k Y , H Y , k Z , and H Z similarly. The cross-covariance operator Σ XY : H Y ↦H X is defined as follows: for all f ∈H X and g∈H Y . Thereby the definition of the conditional cross-covariance operator of (X, Y) given Z can be presented by where P X and Q X are two probability distributions of X. Note: In K. Zhang, Peters, Janzing, and Schölkopf (2012), the authors point out that though the characterization (B) is intuitively more appealing, the characterization (C) is more useful, since in (B) one has to consider all functions in L 2 , whereas (C) exploits the RKHSs corresponding to some characteristic kernels, which usually are much smaller.
In fact, if we restrict the functions f and h 0 in claim (ii) of (B) to the space H € X and H Z , respectively, then (B) is reduced to (C).
(D) Conditional mutual information (CMI) can be applied to characterize CI. CMI for random variables X, Y, and Z is defined as From the definition of CMI it is immediately clear that X ⊥ Y | Z if and only if I(X; Y| Z) = 0. Daudin (1980) defined the weak CI, which can be considered as a mean property of CI.

| Weak CI
(E) Following the expression used in Q. , the weak CI between X and Y given Z is characterized by zero expected conditional covariances of square integrable functions, that is, Þ are the regression residuals for f X ð Þ and g Y ð Þ, respectively. We can see from above that weak CI is equivalent to the uncorrelatedness of residuals n f X, Z ð Þ and n g Y, Z ð Þ for all f ∈L 2 X and g∈L 2 Y . Obviously, CI implies weak CI but the converse is not true. Though tests based on weak CI cannot fully characterize CI, they potentially benefit from a simpler procedure and an improved power.

| SPECIFIC CI TEST GROUPS
In this section, we give a brief overview of some state-of-the-art nonparametric CI testing methods proposed in the literature according to their different data assumption and methodologies.

| Discretization-based tests
This type of tests discretizes the domain of variable Z as z 1 ,z 2 ,…, z k f gin some optimal fashion, resulting in a partition of the data set into k disjoint data sets, and hereby transform CI to the unconditional ones, X⊥Y Z j = z i , 8i. This idea resembles the characterization of CI when Z is discrete. For instance, Margaritis (2005) obtains a posterior probability of independence from each subset; the test in T.M. Huang (2010) is constructed using the estimator of a weighted average of the form is the maximal nonlinear conditional correlation of X and Y given Z. Inevitably, discretization suffers from the curse of dimensionality, as high-dimensional conditioning set Z requires a dramatically large sample size (Muandet, Fukumizu, Sriperumbudur, & Schölkopf, 2017). On the other hand, this type of CI tests are expected to work well if the joint distribution of (X, Y, Z) can be approximated well by a certain discretization of Z.

| Metric-based tests
The intuition behind this category is the observation in (A) and extending measures of marginal independence to the conditional setting. For example, Su and White (2008) and M. Huang, Sun, and White (2016) use the weighted Hellinger distance and a family of generically comprehensively revealing functions, respectively, to measure the distance between p X,Y,Z p Z and p X, Z p Y, Z ; Su and White (2007) and Su and White (2014) measure the distance between p Y j X, Z and p Y|Z by applying their conditional characteristic functions and the smoothed empirical likelihood ratio. The conditional distance independence test (CDIT) proposed by Wang, Pan, Hu, Tian, and Zhang (2015) is also based on conditional characteristic functions, like the method in Su and White (2007), whereas CDIT measures the distance between p X, Y|Z and p X|Z p Y|Z . Comparing to Su and White (2007)'s, the definition of CI used by CDIT can be dealt with and computed more easily. Runge (2018) derives the CMI test based on the result in (D), which is in fact a property of the Kullback-Leibler divergence from p X, Y|Z to p X|Z p Y|Z . Through a nearest-neighbor approach, the test efficiently adapts also to non-smooth distributions due to strongly nonlinear dependencies. However, this category involves the explicit estimation of the conditional densities or their associated variants, which is usually not an easy task, hence the power of these tests also deteriorates rapidly when encountering the curse of dimensionality (K. Zhang et al., 2012).

| Permutation-based two-sample tests
Tests in this category also resort to the observation in (A), but instead of measuring the distance between p X,Y,Z and p CI X,Y,Z ≜p X Z j p Y Z j p Z , they generate samples from p CI X,Y,Z and compare with the original data set, which is sampled from the true joint distribution p X,Y,Z . For instance, Doran, Muandet, Zhang, and Scholkopf (2014) propose a kernel CI permutation test (KCIPT). They use a specific permutation of the samples to simulate data from p CI X,Y,Z , which unfortunately requires solving a time-consuming linear program, then perform a kernel-based two-sample test (Gretton, Borgwardt, Rasch, Schölkopf, & Smola, 2012). Lee and Honavar (2017) observe that by increasing the number of bootstraps, KCIPT suffers from a loss of calibratedness although obtaining a higher power. The authors then propose the self-discrepancy CI test, with a parameter-free (for a given choice of kernel and distortion measure) test statistic based on modified unbiased estimate of maximum mean discrepancy (Gretton et al., 2012), which is the largest difference in empirical expectations over features of the given sample and its permuted counterpart in the unit ball of a RKHS. Sen, Suresh, Shanmugam, Dimakis, and Shakkottai (2017) first use a nearest-neighbor bootstrap to generate data from p CI X,Y,Z , and then apply binary classifiers to achieve the CI testing goal. As these tests essentially have the flavor of nearest neighbor by using distance metrics which weight each variable equally, the required sample size increases dramatically when the conditioning set becomes larger (Aggarwal, Hinneburg, & Keim, 2001).

| Kernel-based tests
A Hilbert space embedding of a distribution, that is, a kernel mean embedding, has recently emerged as a powerful tool for statistical inference. See Muandet et al. (2017) for a review. Kernel-based methods in general have strong empirical performance in the presence of curse of dimensionality and estimating the kernel mean embedding is easier than estimating the distribution itself (Muandet et al., 2017).
The Hillbert-Schmidt independence criterion (HSIC) is initially proposed for testing unconditional independence. It can be interpreted in two ways: (a) a measure of the distance in the RKHS between the embedding of the joint distribution of X and Y, p X, Y , and the embedding of the product of X and Y's respective marginal distribution, p X p Y ; (b) the Hilbert-Schmidt norm of the cross-covariance operator of (X, Y). Fukumizu et al. (2008) extends HSIC to the CI setting through utilizing the Hilbert-Schmidt norm of the conditional cross-covariance operator, and the corresponding CI characterization is concluded in (C). However, the null distribution of the test statistics in Fukumizu et al. (2008) is unknown, and its approximation relies on a computationally expensive resampling scheme.
An excellent representative of this test category is the kernel CI test (KCIT), which is proposed by K. Zhang et al. (2012). KCIT is motivated by the characterizations of CI in (B). As depicted in the previous section, (C) can be viewed as a relaxation of (B) by using a smaller yet sufficiently rich class of functions from certain RKHSs. Intuitively, KCIT works by testing for vanishing correlation between residual functions in RKHSs. A prominent advantage of KCIT is that the null distribution of the test statistic is derived and can be estimated efficiently. Nonetheless, as KCIT involves computing matrix inverses which scale strictly greater than O(N 2 ), KCIT suffers from the drawback of lack in scalability, which places it out of rage of large sample size analysis, and therefore high-dimensional scenes.
To remedy the aforementioned issue, Strobl, Zhang, and Visweswaran (2019) utilize random Fourier features to approximate KCIT and thereby proposes two tests, namely the randomized conditional independence test (RCIT) and the randomized conditional correlation test (RCoT). RCIT explores the partial cross-covariance matrix of (X, Y) and Y, and RCoT X and Y after subjecting the variable sets to the nonlinear transformations and then nonlinearly regressing out the effect of Z. Simulation shows that both RCIT and RCoT give similar performance as KCIT but they are orders of magnitude faster in large-scale settings. RCoT is also related to the kernelized two-step CI test in H. Zhang, Zhou, et al. (2017), which we will discuss later.

| Regression-based tests
In causal discovery, the observed data are often assumed to be generated under the additive noise model (ANM): Y = f X ð Þ+ ε, where ε is a random noise term that is independent of X (Hoyer, Janzing, Mooij, Peters, & Schölkopf, 2009;J. Peters, Janzing, & Scholkopf, 2011;Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006). So, CI in ANM can be characterized by the following way (H. : assume X and Y are some deterministic functions of Z plus an additive noise term, that is, where n x and n y are zero-mean random variables independent of Z, then we have X⊥Y Z j , n x ⊥n y : Therefore, tests in this category (Hoyer et al., 2009;Jonas Peters et al., 2014;H. Zhang, Zhou, & Guan, 2018) proceed mainly in two steps: (a) regressing X on Z and Y on Z, and then (b) testing for the unconditional independence between the . However, if the data departures from the ANM assumption, which is not straightforward to verify, the two-step procedures can lead to a substantial inflation of false discoveries (H. . To address this problem, H.  propose the KRESIT test, where feature representations of X and Y are constructed before doing regressions in the first step. They explained that the feature representation of the responses in regression is their key strategy, as it greatly relaxes the modeling assumptions. The authors also point out that KRESIT is closely related to RCoT (Strobl et al., 2019), as these two tests essentially compute the same test statistic, but uses different estimation procedures of the asymptotic null distribution. Note that the two-step tests are effectively testing for weak CI defined in the characterization (E), since they focus on individual effects of the conditioning variable Z on X and on Y. With the ability to incorporate nonlinearities, this category can also be considered as extensions of partial correlation.
H.  proposes a CI test to relax the test of X ⊥ Y | Z to simpler unconditional independence tests of under the ANM assumption. The residual prediction test in Heinze-Deml, Peters, and Meinshausen (2018) performs a regression from X on Z, using an appropriate basis expansion (if not available, use random features as an approximation), and then tests for dependence between the scaled residuals and (Y, Z).
Recently, Shah and Peters (2019) propose the generalized covariance measure (GCM) test. For univariate X and Y, instead of tests for (unconditional) independence between the residuals from regressing X and Y on Z, the GCM tests for vanishing correlation, which is a normalized covariance between the residuals. The GCM test can also be extended to handle settings where X and Y are potentially high dimensional. Empirical comparison shows that the GCM test performs well when ANM assumption is violated.
Recall the aforementioned argument in Shah and Peters (2019), there is no universally valid CI test, and domain knowledge should be considered for finding an appropriate test for certain data. However, obtaining knowledge on the entire joint distribution for seeking the suitable test is usually challenging. In this context, regression-based tests have an extra advantage, in that they convert the problem of guessing the whole joint distribution to selecting regression procedures, which is a more familiar task for both practitioners and researchers. Note that the validity of GCM is proven to be relied almost entirely on the weak requirement that the regression methods can estimate the conditional means X given Z, and Y given Z, at a slow rate (Shah & Peters, 2019).

| Other tests
There exist surely many other CI tests based on IID data. For example, Song (2009) provides a slightly weaker test, which tests the hypothesis X ⊥ Y | λ θ (Z), where λ θ (Á) is a scalar-valued function known up to an estimable parameter θ. Bergsma (2010) derives a CI test based on a new concept, the partial copula.
3.2 | Tests with non-IID data Granger causality, proposed in Granger (1969), has been widely used in economics since the 1960s and gains growing interest in neuroscience over recent years. The Granger causality test is a statistical hypothesis test for determining whether one time series is useful in forecasting another, which is a form of CI (Florens & Fougere, 1996). Most of the existing CI tests are derived under an IID assumption. When encountering non-IID data, these tests will all be biased. There are a few nonparametric CI tests designed for time series data, including the ones in Su and White (2007, the copular-based test in Bouezmarni, Rombouts, and Taamouti (2012), and the characteristic function-based test in Wang and Hong (2018). Flaxman, Neill, and Smola (2015) highlights the advantage of incorporating non-IID observations in causal inference. The authors state that if spatial and temporal variation is properly accounted for, the resulting causal graphs could be more reasonable. A regression-based CI test is proposed in (Flaxman et al., 2015) to deal with non-IID data.

| EMPIRICAL COMPARISON
We next examine the empirical performances of some representative CI tests, including KCIT by K. Zhang et al. (2012), CDIT by Wang et al. (2015), RCIT and RCoT by Strobl et al. (2019), CCIT by Sen et al. (2017), KRESIT by H. , CMIknn by Runge (2018), and GCM by Shah and Peters (2019). Codes of all these tests are freely available online. 1 For CMIknn, the option "transform = 'ranks'" is used as in Runge (2018). CCIT is run with 30 bootstrap iterations in F I G U R E 1 Test performance comparisons with post-nonlinear noise model and similar setup as in Strobl et al. (2019). Shown are KS (left column), AUPC (center column), and runtime (right column) for an experiment with d Z = 1 (top row), d Z = 5 (middle row), and d Z = 10 (bottom row). The original CPU time t is transformed in this way, log(t + 1), for calculating runtime. The horizontal axis corresponds to the sample size. Error bars give the bootstrapped standard errors. AUPC, area under the power curve; KS, Kolmogorov-Smirnov compliance with the example in its readme file. As the default setting of KRESIT is too time consuming, we adjust the parameters to "num_shuffles=300, max_iter=2000, learning_rate=0.05." For other packages, we use their default settings.
To evaluate the statistical power, we add the same small error ε b~N (0, 0.8 2 ) to both X and Y as follows: X = g 1 (ε b + ε 1 ), Y = g 2 (ε b + ε 2 ), Z~N(0, I dz ),where I dz is a dz-dimensional identity matrix. In this case, we have X Y | Z. All data are normalized.
The comparisons are done for sample sizes 50, 100, 200, 400, 700, and 1,000, the conditional dimension dz taking values in {1, 5, 10}. As in Strobl et al. (2019), we calculate the Kolmogorov-Smirnov (KS) statistic as a metric to quantify how uniform the distribution of p-values is, under the null hypothesis. A smaller KS value indicates a better test. We evaluate the test power by computing the area under the power curve (AUPC), which is equivalent to the area under the empirical CDF of the p-values returned by a certain CI test when the null does not hold. Thus, tests with AUPC closer to 1 have higher power.
The simulation results are shown in Figure 1. All metrics were evaluated from 100 realizations of the model and error bars give the bootstrapped standard errors. KCIT performs well in most of the cases except for small sample sizes. RCIT and RCoT tend to give similar results, but RCoT outperforms RCIT as d Z increases. As RCIT and RCoT are approximation based, they are unreliable for sample sizes under 100. CDIT seems not applicable for high-dimensional cases. In Wang et al. (2015), simulations are conducted only for d Z ≤ 2. In our experiment, CDIT has AUPC below 0.1 for all sample sizes when d Z = 10, so the CDIT curve is not shown in Figure 1h. The performance of CCIT might be improved by increasing the bootstrap iteration number. And since CCIT involves splitting input dataset into two groups for constructing data of two classes (after which one can choose a classification method for performing CI test), it has low AUPC for small sample sizes. The strength of CCIT lies on scalability to high-dimensional data with d Z > 20, which is out of the scope of our experiments. CMIknn and GCM are comparably more stable in the sense that they are well calibrated for almost all circumstances. KRESIT is found to be very time consuming; it takes nearly 2 hr to test one dataset of sample size 1,000, we therefore only provide simulation results of KRESIT for sample sizes 50, 100, 200, and 400. In these experiments, KRESIT exhibits impressively high performance. Tests with low KS statistics may also have low AUPCs, and tests with high AUPCs may get high KS statistics. For a more comprehensive criterion, Table 1 lists the top three Matthews correlation coefficients (MCCs) of different settings (with significant level α = .05). Since a main application of CI tests is graph model structure learning, where a selected CI test would be used for testing data with different d Z 's at the same time, it is reasonable for us to pay more attention on the bottom row of Table 1, where MCCs are calculated by pooling together p-values of a given sample size and all d Z 's. The results indicate that, for n is around 50, CMIknn and GCM outperform the others. For n range from 100 to 400, KRESIT is not only the top 1 test in the last row, it is top 1 or is very close to top 1 for all other cases. But if runtime is of prime concern, CMIknn and RCoT are more preferred. For data of larger sizes, such as n equals 700 or 1,000, RCoT could be the best choice. Note that KRESIT is not compared in these cases because of its computational burden, yet we can expect it to have high performances. After all, as mentioned in Section 3.1.5, KRESIT and RCoT are essentially estimating the same test statistic, and RCoT can be viewed as a large-scale approximation of KRESIT (H. .

| CONCLUSION
Since no nontrivial CI test can control the Type-I error for the entire class of distributions satisfying CI, the users of nonparametric CI tests need both the understanding of available CI tests and the knowledge of the problem in hand in order to choose the right test. As an effort to service this need, we provide this brief review of nonparametric CI tests. The review begins with a brief introduction to classical applications and the development of CI tests, then provides a short review of the theoretical background of CI, and discusses different angles for constructing nonparametric CI tests. In the end, we construct a simulation study comparing the performance of several popular tests.
From the empirical comparison, we see that CI testing is still an open challenge. As demanded by the growing interest on probability graphical models and casual inference, CI tests with lower computational complexity and the ability to solve higher dimensional problems attract researchers' interests.
Recently, some new ideas emerge. Burkart and Király (2017) address CI testing problems by establishing a theoretical link between CI testing and supervised learning task, which enables extensively studied supervised learning workflows to be directly transferred to independence testing workflows. Sen, Shanmugam, Asnani, Rahimzamani, and Kannan (2018) cast this problem under a so-called meta-algorithm, "Mimic and Classify," which (a) mimics the CI distribution close enough to recover the support, and (b) classifies to distinguish the joint and the CI distribution. By doing so, CI testing can be handled by methods from the modern advances in Deep Learning. For further reading on CI testing and other domains, we direct the reader to Pearl (1998), Friedman (2009), Scutari (2010), and Murphy (2012).

CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.