qad: An R‐package to detect asymmetric and directed dependence in bivariate samples

Correlations belong to the standard repertoire of ecologists for quantifying the strength of dependence between two random variables. Classical dependence measures are usually not capable of detecting non‐monotonic or non‐functional dependencies. Furthermore, they completely fail to detect asymmetry and direction in dependence, which exist in many situations and should not be ignored. In this paper, we present qad (short for quantification of asymmetric dependence), a nonparametric statistical method to quantify directed and asymmetric dependence of bivariate samples. Qad is applicable in general (e.g. linear, non‐linear, or non‐monotonic) situations, is sensitive to noise in data, exhibits a good small sample performance, detects asymmetry in dependence, shows high power in testing for independence, requires no assumptions regarding the underlying distribution of the data and reliably quantifies the information gain/predictability of quantity Y given knowledge of quantity X, and vice versa (i.e. q(X,Y) ≠ q(Y,X)). Here, we briefly recall the methodology underlying qad, introduce the functions of the R‐package qad, which returns estimates for the measures qX,Y denoting the directed dependence of Y on X (or, equivalently, the influence of X on Y ), qY,X the directed dependence of X on Y , aX,Y≔qX,Y−qY,X the asymmetry in dependence. Furthermore, qad can be used to predict Y given knowledge of X, and vice versa. Additionally, we compare empirical performance of qad with that of seven other well established measures and demonstrate the applicability of qad on ecological datasets. We illustrate that direction and asymmetry in dependence are universal properties of bivariate associations. Qad thus provides additional information gain and avoids model bias and will therefore advance and facilitate the understanding of ecological systems.

be used for continuous data, whereas Spearman's is advised for data on the ordinal scale. Both just mentioned dependence measures, however, provide information on different aspects of bivariate distributions: Pearson's r quantifies how linear a relationship is, whereas Spearman's measures the extent of monotonicity.
Additional insight may be gained by considering other less frequently applied or less well-known dependence measures. Examples are distance correlation (dCor; Székely et al., 2007), which is implemented in the R-package energy or the information-theoretic-based maximal information coefficient (MIC; Reshef et al., 2011;R-package minerva). Very recent developments are the asymmetric dependence measures xicor (Chatterjee, 2021) and quantification of asymmetric dependence (qad; Junker et al., 2021).
In recent years, the usefulness of symmetric dependence measures for inferring the structure of complex systems or causality in bi-variate associations has been debated and potential biases have been discussed (see, for instance, Zhang et al. (2015), Wang and Huang (2014), Okimoto (2008), Hirano and Takemoto (2019)). Thus, the concept of asymmetry/direction in dependence, which exists in most situations, should not be ignored in data analysis. Whereas in a linear setting, the dependence between two variables X and Y is indeed symmetric (Figure 1a) in the sense that Y can be equally well predicted by knowing X as vice versa, the situation, however, is different in more complex relationships. For instance, for a twodimensional sample in the form of a parabola ( Figure 1b) or a sinusoidal curve (Figure 1c), the dependence structure is clearly asymmetric.
In these cases, knowing the value of the variable X strongly improves the predictability of Y, whereas in the other direction, the information gain is significantly smaller. As an example, consider the year of deglaciation along a glacier forefield and plant diversity (Junker et al., 2020). Naturally, the year of deglaciation has a strong influence on plant diversity (not vice versa), and this directed dependence structure is clearly captured by qad ( Figure 1d). Especially, in cases where no a priori knowledge about the causal relationship is available, directional dependence is a useful measure for exploring and estimating the association between two random variables in a more detailed and more realistic way than classical (symmetric) dependence measures. On top, qad will provide more detailed insights into the structure of communities and functional linkages between organisms or individuals and may thus assist network inference. The limitations of standard methods (e.g. Spearman's correlation coefficient) in network inference have been recently pointed out (Coenen & Weitz, 2018) and directed and asymmetric approaches have been demanded (Amblard & Michel, 2011;Carr et al., 2019;Karmon & Pilpel, 2016).
Here, we present the method qad, a nonparametric and directed, hence asymmetric, measure of dependence, which is publicly available in the free software environment R Junker et al., 2021). qad returns estimates for the measures q(X, Y) denoting the directed dependence of Y on X (or, equiv- The measure a(X, Y) for asymmetry in dependence can be interpreted as the difference of the predictability of Y given knowledge on X and the predictability of X given knowledge on Y. In this paper, we first describe the methodology of qad and demonstrate the application of the R-package qad. Furthermore, we compare the empirical performance of qad with existing publicly available dependence measures and highlight the information gain by considering asymmetry and direction in dependence. A complementary R-shiny app is available as Supporting Information (https://r-qad. shiny apps.io/quant ifica tion_of_depen dence/) facilitating the interpretation and comparison of the results and performance returned by qad and other dependence measures. An application of qad to real world data concludes the paper. We hope that this introduction to qad and the executed comparative analyses as well as the resources provided will be helpful for ecologists and researchers from other disciplines.
F I G U R E 1 (a-c) Samples of size n = 50 drawn from (a) symmetric/undirected as well as (b, c) asymmetric/directed dependence structures.
(d) Depicts real-world data representing plant diversity as a function of the estimated year of deglaciation at n = 140 studied plots. In the symmetric setting (a), the knowledge of X provides roughly as much information on Y as vice versa, whereas in the asymmetric and the realworld data setting (c-d) knowing the value of X allows to predict the value of Y much better than vice versa. Asymmetry in dependence is detected by the dependence measure qad(q(X, Y) and q(Y, X)), whereby Pearson's r and Spearman's are not capable of taking into account asymmetry in dependence:

| B RIEF ME THODOLOG IC AL DE SCRIP TI ON OF THE COPUL A-BA S ED DEPENDEN CE ME A SURE qad
Commonly used approaches to quantify the strength of associations between two variables such as correlation or regression capture only a fraction of the information that is contained in the data. In contrast, copulas contain full information about associations and are therefore frequently applied on finances and other disciplines (Ghosh et al., 2020). In fact, in the bivariate case, copulas are two-dimensional distribution functions restricted to the unit square with uniformly distributed univariate marginals. The theorem of Sklar (see Nelsen (2007)) allows to split the joint distribution function H of the random vector (X, Y) into the dependence structure C and the marginal distributions F and G, that is, H(x, y) = C(F(x),G(y)) for every (x, y) ∈ ℝ 2 . The aforementioned dependence structure C is exactly the copula. Since copulas are scale-invariant (see again Nelsen (2007)), it is natural to study scalefree dependence measures on a copula basis. For more background on copulas and their application in dependence modelling, we refer to the books of Nelsen (2007) and Durante and Sempi (2015). The copula-based dependence measure qad, originally introduced as ζ 1 in Trutschnig (2011), is defined as a type of distance between the conditional distribution functions of the copula C underlying the random vector (X, Y) and the uniform distribution representing independence of X and Y. In other words, qad measures how much the dependence structure of (X, Y) differs from independence. Contrary to many other approaches, qad is able to detect both complete dependence (i.e. Y is a function of X) as well as independence. The method works as follows: Given a two-dimensional sample x 1 ,y 1 , … , x n ,y n of size n from the random vector (X, Y) (see Figure 2a), the normalized ranks of the sample are calculated first (i.e. we get values of the form (i ∕ n, j ∕ n) for i, j ∈ (1, … , n)). Then the so-called empirical copula Ê n is computed (see Figure 2b). As next step, the empirical copula is aggregated to the empirical checkerboard copula (two-dimensional histogram in the copula setting). In fact, the masses of the small squares (empirical copula) are summed up to the larger N × N squares, whereby the resolution N depends on the sample size n (see Figure 2c,d). Note that by default the resolution of the empirical checkerboard copula is proportional to the square root of the sample size; thus, as for any statistical method, qad results become more reliable as the sample size increases. We recommend a sample size of no smaller than n = 16, resulting in a resolution of N = 4. Finally, the conditional distribution functions of the checkerboard copula are compared with the distribution function of the uniform distribution on the unit interval (in the sense that the area between the graphs is calculated). This step is conducted both for the vertical strips (to calculate the influence of X on Y) and the horizontal strips, see, for instance, Figure 2e yields the two directed qad-values q(X, Y) ∈ 0, 1 , quantifying the influence of X on Y and q(Y, X) ∈ 0, 1 , denoting the influence of Y on X.
High values indicate strong associations, whereas low values describe weakly dependent random variables. Note that for dependence measures which are strictly positive (e.g. qad), deviation from 0 in the case of independence is to be expected. As example, a value of q(X, Y) = 0.2 is common for independent random variables X and Y. Thus, the value of q(X, Y) alone is clearly insufficient for deciding if, or if not, the sample is likely to come from independent random variables. Therefore, overcoming this problem, a permutation test is implemented in the R package qad to obtain a p-value for q(X, Y) and q(Y, X) in testing for independence, that is, testing the hypothesis H 0 : q(X, Y) = 0 = q(Y, X) . Therefore, non-significant qad values (p-value >0.05) indicate no dependence. This allows to interpret the obtained values and puts them into perspective.
Furthermore, if we have q(X, Y) > q(Y, X), then the qad estimator informs us that the variable X provides more information about Y than vice versa. The same holds for the reverse direction. This information is also gathered in the measure for asymmetry, which is computed as and can therefore attain values within the interval ( − 1, 1). Additionally, as a rank-based quantity qad is robust to outliers and invariant with respect to monotone transformations, for instance, log-transformations.

| APPLI C ATI ON OF THE R PACK AG E qad
The package qad is implemented in the software R (R Development Core Team, 2020) and is publicly available on CRAN (https://cran.rproje ct.org/web/packa ges/qad/index.html). The development version of qad is accessible via GitHub (https://github.com/grief l/qad).
In the following, we briefly sketch the main functions of the package.
Additionally, each function contains examples in the description, which are called via the R-help function (e.g. ?qad). The following code snippets, which are applied on the data depicted in Figure 1d, sketch the application of qad.

| Calculating the directed dependence measure q
Given bivariate observations x 1 ,y 1 , … , x n ,y n of size n the function qad(…) computes the dependence values q(X, Y), q(Y, X), the maximum dependence (i. e. max(c(q(X,Y), q(Y,X)))), and the asymmetry in dependence a(X, Y). The implemented method qad(…) requires two numeric vectors containing the observations of the sample, or, alternatively, accepts a numeric data frame of the form data.
frame (sample_X, sample_Y). The optional argument p.value (default is TRUE) allows to calculate p-values (based on permutations with nperm runs) for q(X,Y) and q(Y,X). A p-value below 0.05 strengthens the hypothesis that X and Y are not independent. The output of qad shows the dependence values and their respective p-values as well as further descriptive statistics, for example, sample size and the number of unique ranks, which are essential in calculating the resolution of the underlying empirical checkerboard copula. The checkerboard resolution is adjustable through the parameter resolution, however, since the output strongly depends on the resolution, we highly recommend to use the default setting (resolution = NULL), F I G U R E 2 Illustration of the methodology of qad. (a) Sample of size n = 40 drawn from a slightly noisy U-shaped function. (b) Empirical copula and normalized ranks (points). Note that the masses are uniform on each squares and that, by construction of the empirical copula, the upper right corner of the squares are the normalized ranks. (c) Empirical copula and the checkerboard grid with resolution N = 6. (d) Checkerboard aggregation. (e) Distance between the conditional distribution functions of the checkerboard copula and the uniform distribution representing independence, for vertical strips (magenta area depicting the distance for one strip) and (f) for horizontal strips.
which uses the optimal choice (optimal in the sense that the estimator performs well independent of the underlying dependence structure; Junker et al., 2021). which indicate evidently an asymmetric setting (a = 0.157). The additio nal plots underline the findings and insinuate a slightly inverted U-shaped pattern.

| Using qad as a prediction tool
As a by-product of the checkerboard approach, the random variable Y given = and given = can be predicted for every ∈ ( ) and ∈ ( ). This additional feature is implemented in the R-function predict.qad(…). Note that prediction is possible only within the range of measured X and Y values; since qad is calculated independently of a parametric regression function, no extrapolation is possible. In contrast to regression methods and many machine learning algorithms, qad does not return point estimates, but probabilities that values of Y fall in a given range given X (or vice versa). The function predict.qad(…) requires three arguments: a 'qad' object, the conditioning variable and a vector of x-values. Then the function returns the probabilities of the event that Y falls into the interval I j given X = x, or vice versa. Thereby the intervals I j are calculated as the retransformed intervals defining the checkerboard grid, that is, for every j ∈ (1, … ,N) the interval I j is defined as I j ≔ G n year around 2000 the probability is obviously higher (probability of 0.357). Each of the functions provide several parameters that enables specific adjustments and modifications. For this purpose, we refer to the R-documentation  or the vignette available, for example, using the following lines of code:

| PERFORMAN CE AND COMPARISON OF qad WITH OTHER DEPENDEN CE ME A SURE S
The main features of qad compared with seven other well established and in R available dependence measures are summarized in Table 1 and also discussed in Supplementary Information 3. For each measure, we provide information on whether it allows for linear, monotonic or general dependence estimation, whether it is scale-invariant, whether the estimator returns a value in [0,1] and whether it captures asymmetry in dependence. Dependence measures that capture the dependence in nonlinear situations should assign similar scores of dependence to equally noisy data in a manner independent of the concrete functional relationship (Reshef et al., 2011). Accordingly, the measure qad decreases with increasing noise irrespective of the functional relationship between X and Y (see Figure 3a,b,d-f). Note that qad returned dependence values slightly smaller than 1 in functional settings without noise (see Figure 3a,b,d-f), which is directly caused by the checkerboard binning. It is guaranteed, however, that asymptotically qad attains the maximum value 1 in these settings.

| APPLYING qad ON ECOLOG IC AL DATA
We tested the qad-package on a dataset of microbiota and additional environmental metadata publicly available at http://ocean -micro biome.
embl.de/compa nion.html (Albanese et al., 2018;de Vargas et al., 2015;Sunagawa et al., 2015;Villar et al., 2015). More precisely, we used  Considering non-monotonic and non-functional relationships naturally expands our ability to detect more complex, and potentially asymmetric relationships between organisms and their environment. We demonstrated that neither of the methods discussed here outperforms all other methods in full generality, every statistical tool exhibits limitations in specific settings. If it is known in advance that the data originate from a linear or a monotonic setting, we recommend classical measures of association such as Pearson's r, Spearman's or dCor. These measures are well established and show greater power in these settings than other methods. In most situations, however, wrongly imposing linearity/ monotonicity without prior knowledge may lead to wrong conclusions. We therefore recommend the use of qad for quantifying pairwise dependencies in the general case. We showed that qad is powerful in detecting dependence and provides reliable and easily interpretable results.
Another important property of bivariate associations is asymmetry and direction in dependence in the sense that predictability of quantity Y given knowledge of quantity X is not the same as vice versa. Considering direction and asymmetry in dependence facilitates the detection and extraction of patterns from ecological datasets and the testing of refined hypotheses. For instance, correlation analysis testing for relationships between the abundance of pairs of taxa is usually performed as basis for network inference, which, in turn, facilitates the interpretation of, for example, microbiome structure. Ecological relationships between organisms may be reciprocal in the sense that taxa mutually affect each other, either positively (mutualism) or negatively (competition). They may, however, also be directed in such a way that a given taxon is facilitating or inhibiting the growth of another taxon without being affected itself by the other taxon (e.g. commensalism, amensalism). As shown before, conventional correlation analysis neither detects directed relationships nor discriminates between directed and mutual relationships and is therefore of limited value for the interpretation of community dynamics. We are aware of only two methods that are able to quantify directed dependence, namely qad  and xicor (Chatterjee, 2021). We have shown that qad has a higher overall power in detecting deviation from independence, especially in very noisy datasets qad performs better than xicor. The power deficiency of xicor is also discussed in Shi et al. (2022). Furthermore, the implemented estimator in qad always attains positive values, whereas xicor can attain negative values, which is hard to interpret.

F I G U R E 4
Power analysis for different dependence measures. (a-c) Three (noisy) relationships considered in statistical power analysis testing for independence (further results are given in Supplementary Information 3). Empirical power is illustrated for a relationship with vertically added noise following a normal distribution with mean 0 and standard deviation 1. The underlying noise-free relationship is depicted in the left top corner. In each setting, the sample size increases from left n = 10 to right n = 500 with increments of 10.

F I G U R E 5
Application of Pearson's r, Spearman's and qad to a subset of the Tara Oceans dataset. (a, f) Hex bin plot of qad versus Pearson's r (Spearman's ) for all pairwise relationships; colour code corresponds to count numbers per hexagonal bin. (b, g) Venn diagrams depicting the number of significant relationships across all pairwise associations. (c) Scatterplot of selected pairwise relationships significant w.r.t. Pearson's r but not w.r.t. qad. The outlier in the top right corner strongly determines the high correlation value, whereas qad is robust to outliers. (d, e, i, j) Scatterplots of selected pairwise relationships (highly asymmetric) which are not significant w.r.t. Pearson's r (or Spearman's ) but highly significant w.r.t. the measure qad. In some cases, transforming the axis reveals the underlying dependence structure (e.g. In very large datasets, however, xicor is more efficient with respect to runtime due to the fact that it uses a p-value based on asymptotic theory, whereas qad runs a permutation test.
An additional feature of the R-package qad is that it provides user-friendly outputs and a number of additional features that facilitate the interpretation of the results as well as functions to use qad as a prediction tool.
We conclude that the interpretation of ecological data may be strongly biased by the choice of statistical approaches quantifying dependence between two random variables. The acknowledgement and adequate handling of asymmetry, a universal property of bivariate associations, is an important step towards additional information gain and the avoidance of model bias for small, medium and large datasets, and will advance and allow for a deeper understanding of ecological systems.

CO N FLI C T O F I NTE R E S T
The authors declare no conflict of interest.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.13951.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data and supplementary code used in the study can be found at other sources (mentioned at the corresponding paragraphs). The qad package is available for the R programming language and can be downloaded at https://cran.r-proje ct.org/web/packa ges/qad/index. html. This paper describes the latest CRAN-version of qad (v.1.0.2).