Systems involving many variables are important in population and quantitative genetics, for example, in multi-trait prediction of breeding values and in exploration of multi-locus associations. We studied departures of the joint distribution of sets of genetic variables from independence. New measures of association based on notions of statistical distance between distributions are presented. These are more general than correlations, which are pairwise measures, and lack a clear interpretation beyond the bivariate normal distribution. Our measures are based on logarithmic (Kullback-Leibler) and on relative ‘distances’ between distributions. Indexes of association are developed and illustrated for quantitative genetics settings in which the joint distribution of the variables is either multivariate normal or multivariate-t, and we show how the indexes can be used to study linkage disequilibrium in a two-locus system with multiple alleles and present applications to systems of correlated beta distributions. Two multivariate beta and multivariate beta-binomial processes are examined, and new distributions are introduced: the GMS-Sarmanov multivariate beta and its beta-binomial counterpart.
The analysis of systems involving several variables is of great importance in population and quantitative genetics, for example, in multi-trait prediction of breeding values or in the study of networks using multi-locus data. The problem addressed here is that of measuring departures of the joint distribution of sets of genetic variables from stochastic independence. The target variables could be, for instance, pleiotropic additive effects of many loci under the infinitesimal model of Fisher (1918), correlated allelic frequencies of loci within bins of physical distance defined by numbers of base pairs, or sets of allelic frequencies that vary at random over clusters of individuals or sub-populations. Likewise, a study may involve systems of alleles whose joint distribution departs from independence, due either to proximity within a chromosome (linkage) or to forces such as random drift or selection favoring epistatic combinations, which may also create associations among loci on different chromosomes. When the problem is that of the study of the statistical association between allele states at different loci, this usually is referred to as gametic or ‘linkage disequilibrium’ (LD), with pairwise correlation measures typically employed for characterizing such association. If alleles at two loci are associated due to physical linkage, equilibrium (independence) is eventually attained asymptotically as generations of random mating accrue. LD has received enormous attention in population genetics (e.g., Hill 1974; Hedrick 1987; Lewontin 1988; McVean 2007) and has become an increasingly important topic of research in the light of emerging genomic data, e.g. single nucleotide polymorphism markers and sequence data enable the study of joint evolution of genomic blocks as well of as genome-wide association with complex disease or quantitative traits. This last area is one in which Professor Morris Soller has made important contributions, for example, a pioneering paper by Soller & Genizi (1978) in marker-assisted selection, and his 1990 work with Weller and Kashi on designs for inferring linkage between markers and quantitative trait loci in dairy cattle. The present paper is in his honor.
The objective of this work is to introduce new measures of association involving a system of variables based on notions of statistical distance between distributions. The resulting metrics are more general than correlations, which are pairwise measures only and lack a parametric interpretation beyond the bivariate normal distribution (e.g., Lewontin 1988), especially when associations are non-linear or when pairs of variables are neither independent nor identically distributed. For example, in the study of gametic disequilibrium, the distribution of correlation-based estimators is frequency dependent (Hedrick 1987), thus complicating comparisons among populations and meta-analyses. Also, there are situations such as in a multivariate t-distribution with a diagonal covariance matrix in which variables are uncorrelated and, yet, stochastically dependent. This illustrates that lack of correlation does not provide definitive evidence of independence: in this example, variable X, say, informs about Y even though uncorrelated. A related question is the inability of correlation to measure association between, say, 10 loci or 20 pleiotropic effects. The measures of association developed in this study are based on departure between the distributions of the variables under independence and association assumptions. Indexes measuring association between random variables, irrespective of their number or of the form of the joint distribution, are suggested and illustrated.
The article is organized as follows. Following this introduction, measures based on logarithmic and on relative ‘distances’ between distributions, as well as indexes of association, are presented. Subsequently, these indexes are developed and illustrated for quantitative genetics settings in which the joint distribution of the variables under consideration is either multivariate normal or multivariate-t. The fourth section shows how these indexes can be used to study LD in a two-locus system with multiple alleles; here, a correlation would not have much meaning, if at all. The fifth section presents applications to systems of correlated beta distributions; all these are generalizations of the univariate beta distribution, which has been used extensively in population genetics to study evolution of gene frequencies. In particular, two multivariate beta and multivariate beta-binomial processes are discussed, and new distributions are introduced: the GMS-Sarmanov multivariate beta and its beta-binomial counterpart. The paper concludes with a discussion of the concepts and of other procedures that have been proposed for study of multi-locus LD. Mathematical details are relegated to an appendix.
Measuring association among variables
To motivate the approach used here, consider a pair X, Y of random variables with realized values x, y; the ideas generalize readily to distributions of higher dimension simply by replacing scalars by vectors. Let p(x,y) and p(x) × p(y) be the densities of the joint distributions under association and independence, respectively; p(x) and p(y) are the marginal densities of X and Y respectively. Association occurs whenever p(y|x) (or p(x|y)) differs from p(y) (or p(x)). Measures of dependence (association) can be derived from the density ratio
αxy=p(x,y)p(x)p(y)=p(y|x)p(y)=p(x|y)p(x).(1)
If αxy = 1 for all pairs of values, the ‘distance’ between distributions is 0 and independence holds. Values of αxy larger or smaller than 1 suggest statistical dependence. For example, αxy larger than 1 indicates ‘coupling’, i.e. the density of certain pairs of values is higher than expected under independence; values smaller than 1 would be indicative of ‘repulsion’. Likewise, strong departures of log αxy from 0 would be indicative of an association between x and y.
We discuss two measures of association based on either expected logarithmic distance between distributions or on expected relative distance, as conveyed by the average values of log αxy and of αxy respectively. These measures are used subsequently to develop indexes of association, irrespective of the number of variables in a system.
Association based on Kullback-Leibler distance
The Kullback-Leibler (KL) logarithmic distance (Kullback 1968) is defined as the sum of two KL discrepancies between the distributions: one under association and the other under independence. The first discrepancy measures departure from association when the latter is true, and the other measures departure from independence when this attribute holds. The two discrepancies are positive by construction and are defined as
measuring discrepancy when independence holds; the range of integration depends on the random processes involved. If the two distributions are identical, both discrepancies are null; on the other hand, large discrepancies reflect stochastic dependence. Above, Ep(x,y) and Ep(x)p(y) denote expectations taken under the assumptions that (X,Y) follow either a bivariate distribution (association) or that these variables are independently distributed.
A connection with the likelihood ratio statistic is fairly direct. Suppose that the parameters of the distributions under independence and association are estimated by maximum likelihood (ML), and that these estimates are regarded as if they were ‘true’ values. Then, log αxy, with the densities evaluated at the corresponding ML estimates would be the log of a ratio between maximized likelihoods. In DAI one would be taking expectations of the log-likelihood ratio over all possible realizations expected if the ‘full model’ (association) were true; on the other hand, in DIA the expectations of the reciprocal of the log-likelihood would be taken under the assumption of independence. In this context, DAI and DIA are averages of log-likelihood ratios, as opposed to an evaluation at a single realization (in ML estimation parameters vary whereas observations are taken as fixed). Note that DAI and DIA do not involve any requirement of ‘nesting’ of models, contrary to likelihood ratio tests under regularity conditions, but this is a different issue.
The discrepancies are such that DAI ≠ DIA, and a unique, symmetric KL distance is defined (Kullback 1968) as
KL=DAI+DIA.(2)
Then, an index of association (taking values between 0 and 1) is obtained by normalizing DIA as
θ=DIADAI+DIA,(3)
with
1−θ=DAIDAI+DIA,
provided that KL is not null. A large value of DIA relative to DAI (or the opposite) is indicative of situations in which there is more association than what would be expected under independence so that θ or 1 − θ would be close to 1. Thus, values of θ or of 1 − θ away from 12 provide evidence of discrepancy between the hypotheses of random ensembles of (x,y) pairs vs. associated ensembles.
There is no closed form for DIA or DAI for many distributions, in which case these metrics must be calculated numerically or estimated using Monte Carlo methods, given some parameter values. If association is weak, the two logarithmic discrepancies and KL are nearly null, and this can lead to very unstable, even absurd, Monte Carlo estimates of θ. Also, the KL distance and the two discrepancies are logarithmic, and this perhaps provides a less intuitive notion of ‘distance’ than one based directly on αxy. An alternative ‘distance’ measure is presented immediately below.
The metric DAI' gives the expected value of αxy under association and DIA' gives the expectation of αxy−1 under the assumption of independence. These expectations are both positive, by construction, and values of θ′ away from 12 can be construed as reflecting departure from independence. For example, θ'=15 or 45 would suggest association. The connection with likelihood ratios surfaces again; for example, DIA' is the average likelihood-ratio over all possible realizations of the data under the independence assumption.
In what follows, the proposed indexes are examined for some distributions that are used often in quantitative and population genetics research.
Multivariate Gaussian and t-distributed variables
In quantitative genetics, the analysis often focuses on a set of continuous correlated variables (X1,X2,…,XK) that follow some multivariate distribution; these variates could be, for example, additive genetic effects for K traits. Suppose that a matrix of estimates of additive genetic correlations is available (and taken as a ‘true’ matrix), and that we seek to arrive at a single measure of association among the K genetic values, as opposed to the K(K−1)2 pairwise measures provided by the matrix of genetic correlations. Below, examples of multivariate normal distributions with K = 2, 3, 4 and of a bivariate t-distribution are considered to illustrate how θ behaves.
Bivariate normal distribution
Let X1 and X2 be two standardized normally distributed genetic values with correlation ρ. For this distribution DAI=−log1−ρ2 and DIA=ρ2(1−ρ2)−1+log1−ρ2 (e.g. Sorensen & Gianola 2002). When the correlation is null, the two discrepancies are 0; when the correlation is perfect, the discrepancies are both ∞. Thus, KL = ρ2(1 − ρ2)−1, which is 0 when the two random variables are independent and ∞ under a perfect correlation. Then
θ=DIADAI+DIA=1+(1−ρ2)log1−ρ2ρ2,(7)
measures discrepancy away from the independence situation when the latter holds, relative to the KL distance between the two distributions. It can be shown that lim_ρ2→0(θ)=12 and lim_ρ2→1(θ)=1, providing the lower and upper bounds of θ respectively.
The values of θ and of 1 − θ are plotted against ρ in Fig. 1. The dotted line (‘holds water’) gives the relative discrepancy (θ) away from the independence models when this is true, and the dashed line (‘spills water’) depicts the trajectory of 1 − θ, i.e. the relative discrepancy under association; note that a given curve is a mirror image of the other. As the absolute value of the correlation increases θ → 1, as one would expect. Since an association index with values ranging between 0 and 1 may be easier to interpret, an alternative measure could be
Measures of association of two bivariate Gaussian variables as a function of their correlation (ρ). The straight lines give the strength of the association as measured by the absolute value of ρ. The dotted (‘holds water’: θ) and dashed (‘spills water’: 1 − θ) lines depict the relative contributions to the Kullback-Leibler distance due to discrepancies under independence and dependence models, respectively. Values of the association measure γ=2θ − 1 are represented by the dark solid line.
However, γ takes values in [0,1] provided that θ≥12, which holds for a bivariate Gaussian distribution, but not so in general. The relationship between γ and ρ for this Gaussian model is also shown in Fig. 1 (solid line), and the association is suggested as weaker when measured by θ than when assessed with the correlation ρ. For example, if |ρ|=12,θ is smaller than 0.15. In multivariate systems or in other distributions, θ can be much smaller than 12 so that γ takes negative values, as illustrated below, so it turns out that θ often provides an easier to interpret measure of association than does γ.
Multivariate normal distribution
The metrics under consideration can be applied to distributions with any number of dimensions, leading to measures of departure from independence in more general situations, such as when alleles in a multiple-locus system are studied. To illustrate, consider the K-dimensional multivariate Gaussian distribution (x∼N(0,R), where variates are in standard deviation units so that R is a correlation matrix; under independence (x∼N(0,I). It can be shown (Penny & Roberts 2002) that
KL=12tr(R−1+R)−K,(8)
Where tr(.) denotes the sum of the diagonal elements of a squared matrix such as R. Further,
DAI=12[tr(R)−log|R|−K],(9)
and
DIA=12[log|R|+tr(R−1)−K],(10)
so that
θ=12[log|R|+tr(R−1)−K]12tr(R−1+R)−K.(11)
Examples of several multivariate normal distributions are given next.
Equi-correlated trivariate normal distribution
Let three variables be equi-correlated with correlation ρ, so that
R=[1ρρρ1ρρρ1],
with |R|=2ρ3−3ρ2+1,tr(R)=3 and tr(R−1)=3(ρ+1)/(1+ρ−2ρ2). Using (8)-(11) yields
DAI=−log(ρ−1)2(2ρ+1),
DIA=3ρ21+ρ−2ρ2−DAI,
KL=3ρ21+ρ−2ρ2,
and
θ=1−(ρ−1)(1+2ρ)log(ρ−1)2(1+2ρ)3ρ2,(12)
With 0 < θ <1. For example, for ρ =0.25, 0.50 and 0.75, the index of association θ takes values 0.49, 0.54 and 0.66 respectively. Figure 2 depicts θ, 1 − θ and θ as a function of the coefficient of correlation. At ρ =0, θ =0.5, but then it decreases slightly as the correlation increases such that γ does not attain a positive value until γ reaches a value of about 0.3. Note that θ is not symmetric with respect to ρ (the same is true of |R|); for instance, when ρ=−14,θ = 0.59 so that association would be viewed as stronger than when ρ=14, since θ =0.49 then. This is a consequence of the values that the discrepancies DIA and DAI take at varying values of ρ.
Indexes of association as a function of the coefficient of correlation in an equi-correlated trivariate normal distribution. The thick solid line gives the relative Kullback-Leibler discrepancy between distributions when independence holds (θ), and the dashed line gives the relative discrepancy when association is true (1 − θ). The thin line gives the trajectory of γ=2θ − 1.
Tetra-variate normal distributions
Alternative measures can be derived from eigenvalues of a correlation matrix; however, this is sensible only for distributions in which linear relationships between variables are meaningful. The index θ, derived from notions of statistical distance, does not have this limitation because KL discrepancies have a general meaning and can be calculated for any distribution, at least numerically, given values of the parameters. Naturally, any single measure of association in a multivariate distribution becomes less transparent as the system of variables increases in dimension.
Let K =4 and suppose the variables are equi-correlated with coefficient ρ. Then, |R|=−(ρ−1)3(3ρ+1),tr(R−1)=−(8ρ+4)/[(ρ−1)(3ρ+1)] and
θ=1−(ρ−1)(1+3ρ)log−(ρ−1)3(1+3ρ)6ρ2.(13)
Figure 3 displays how θ and 1 − θ vary against ρ. At θ =0 one has θ=1−θ=12;θ decreases somewhat subsequently and then increases, attaining 0.5 again at about ρ = 0.45. The values of the index of association suggest that the independence and association models are not ‘too distant’ unless the correlation is below −0.10, if negative, or above ρ =0.50, if positive.
Discrepancies from independence to association (θ, solid line), and from association to independence (1 − θ, dashed line) as a fraction of the Kullback-Leibler distance between two tetravariate normal distributions. The straight lines give the absolute values of the correlation ρ.
where 0 ≤ τ ≤ 1, so that when t =0 the four variables are independent. For t =1, log |R| = −2.5459 and tr(R−1) = 24.8980, yielding θ =0.8782. Likewise, for τ=12, log |R| = −0.2960 and tr(R−1) = 4.6540, so that θ = 0.5473. This illustrates that the strength of the association is proportional to the strength of inter-correlation, and that θ provides a metric for measuring departure from independence.
We finish this section by posing the following question: what does θ say that is not informed by all possible pairwise correlations? Again, let K = 4 and take as correlation matrices
The salient point is that the correlation structure is such that while pairs of variables (1,3), (2,4) and (3,4) are correlated, (1,2) are independent.
What is the overall strength of the association in the system? Calculations yield θ1 = 0.54 and θ2 = 0.91 for the networks of variables characterized by ρ=14 and ρ=610, respectively. While the correlations measure association between pairs of nodes, θ gives an indication of the overall degree of association.
Bivariate t-distribution
As mentioned, there are situations in which random variables are jointly dependent, but yet, their correlation is 0, which would incorrectly suggest that X does not inform about Y. This highlights limitations of the correlation parameter as a metric for statistical dependence between sets of variables. One such case is the multivariate-t distribution, in which variables can be statistically dependent, yet uncorrelated. The univariate and multivariate t-distributions appear in robust linear regression models for quantitative genetic analysis based on pedigrees (e.g., Strandén & Gianola 1998; Sorensen & Gianola 2002; Rosa et al. 2003). The univariate-t distribution has also been used as a prior in Bayesian linear regression models for genome-enabled prediction of single traits (Meuwissen et al. 2001; Gianola et al. 2009).
Suppose two variables, e.g. SNP effects on a pair of traits, follow a bivariate t-distribution with a null mean vector, ν degrees of freedom and a scale matrix that is equal to a 2 × 2 identity matrix; here, the two effects are uncorrelated but not independent, as shown in the Appendix. For this situation, the density ratio (1) is
The expectations above do not have closed forms but can be evaluated using Monte Carlo methods: given the value of the degrees of freedom parameter, random numbers can be drawn from a bivariate t-distribution and from two independent univariate t-distributions, to approximate D′AIand D′IA.
The index of association θ′ in (6) was computed for t-distributions with μ =1, 2, 4, 6, 8, 20 and 100 000. The last value produces an approximation of the joint distribution of two independent normal variates, whereas the setting with μ = 1 evaluates distance between a ‘bivariate Cauchy’ distribution and the distribution of two identically and independently distributed Cauchy random variables. The t-processes with μ =1, 2, 4 do not have a finite variance, but the distributions are proper (i.e., the density integrates to 1). As shown in Table 1, with sets of 1 million random numbers used for evaluating D′AI, D′IA and θ′, the ‘distance’ D′AI from association to independence decreases as ν increases, approaching 1, whereas the distance from independence to association D′IA increases, approaching 1 as well. With ν =100 000 the random variables are essentially independent since θ'=1−θ'=12, leading to γ′ = 0, the setting ν = 1 produces the strongest possible association, with θ′ = 0 and both 1 − θ′ and θ′ equal to 1.
Table 1. Departures of zero-mean bivariate t-distributions with a 2 × 2 identity matrix as scale parameter (so their correlation is null) from independence as measured by D′AI, D′IA and by the indexes of association θ′ = D′IA/(D′AI + D′IA), 1 − θ′ and γ ′ = 2(1 − θ′)− 1 (when θ'=12 the random variables are independent)
Degrees of freedom
DAI'
DIA'
D'=DAI'+DIA'
θ′
1 − θ′
γ′
1
7.28 × 109
0.4376
7.28 × 109
0.0000
1.0000
1.0000
2
63438837
0.6424
6348838
0.0000
1.0000
1.0000
4
20.7443
0.7947
21.5390
0.0369
0.9631
0.9262
6
2.6299
0.8560
3.4859
0.2456
0.7544
0.5088
8
1.2715
0.8889
2.1604
0.4115
0.5885
0.1770
20
1.0584
0.9528
2.0112
0.4738
0.5262
0.0524
100 000
1.0000
1.0000
2.0000
0.5000
0.5000
0.0000
Linkage disequilibrium
Suppose there are C and R alleles at locus A and locus B, respectively, such that data on gametic types can be arranged into an R×C contingency table. Let pij be the probability of observing a haplotype with configuration ‘ij’. If n gametes are screened and the observed number having such configuration is nij, the assumption of multinomial sampling leads to
where pi.=∑_j=1Cpij, and p.j=∑_i=1Rpij are the marginal probabilities of alleles ‘i’ and ‘j’at loci A and B respectively. Define parameters Dij = pij − pi.pj., measuring departure from independence (i.e. disequilibrium between alleles i and j). In a two-locus model a single D parameter is needed, and by using pij = D + pi.pj., then αxy would be the ratio between two likelihoods: the ‘unrestricted’ likelihood (indexed by parameters D, pi., pj.) and the ‘restricted’ likelihood (indexed by the two allelic frequencies). The asymptotic connection between the standard log-likelihood ratio statistic and the χ2 metrics used for multi-locus analysis of LD by Hill (1975) and more recently by Zhao et al. (2005) is given, for example, in Agresti (2002).
Under this sampling scheme, the discrepancies become
Naturally, θ is not defined when all Dij = 0 because then the two distributions would be identical and the KL distance is 0 in that case. Note that DAI is the expected value of the log-likelihood ratio, but under the assumption of association; on the other hand, when the sampling distribution of the log-likelihood statistic is considered, the reference distribution is the null model, i.e., linkage equilibrium. This illustrates an important conceptual difference, apart from the fact that DIA and DAI involve integration over the sampling space of allelic outcomes at the two loci in question.
To illustrate, consider a hypothetical example, patterned after data in Spiess (1989) reproduced partially by Frankham et al. (2010). The data pertain to three and four alleles at the major histocompatibility loci HLA-A and HLA-B respectively; the data, after a modification explained subsequently, are shown in Table 2. Frankham et al. (2010) presented only the data for these alleles, yet there are many more variants at these loci. Relative frequencies in Frankham et al. (2010) were ‘normalized’ such that the 12 joint frequencies would add up to one.
Table 2. Hypothetical data for major histocompatibility loci HLA-A and HLA-B, with three and four alleles respectively. The pij are haplotype frequencies; pi.=∑j=13pij and p.j=∑i=14pij give the allelic frequencies. The hypothetical number of observed haplotypes is assumed to be 108
Locus
A1
A2
A3
Marginal
B7
p11 = 0.0270
p12 = 0.0950
p13=0.1743
p1.=0.2964
B8
p21=0.2456
p22=0.0402
p23=0.0070
p2.=0.2928
B35
p31=0.0106
p32=0.0651
p33=0.0939
p3.=0.1696
B44
p41=0.0325
p42=0.1838
p43=0.0249
p4.=0.2412
Marginal
1.7p.1=0.3158
p.2 = 0.3841
p.3 = 0.3001
1.0 (n=108)
Taking haplotype frequencies pij as true values, the discrepancy of the independence model away from association is
DAI=
=n∑i=1R∑i=1Cpijlog(pijpi.pj.)
…+0.0249log0.02490.2412×0.3001]
=108×0.2928=31.6235
The relative contributions of each of the alleles at the B locus to the 0.2928 in DAI above are (after rounding) 22.71%, 61.77%, 13.76% and 1.76% for B7, B8, B35 and B44 respectively, implying that associations with allele B8 are responsible for most of the departure of the independence model away from association, assuming this last one is ‘true’; note that the nij do not enter into the calculations, since the frequencies are viewed as true ones. For example, the relative contribution of B8 to DAI is arrived at as follows:
=108×0.1809,
and then 108×0.1809/(108×0.2928)=0.6178. Likewise,
and the relative contributions of the four B-locus alleles to DIA are 16.80%, 53.93%, 12.11% and 17.16% for B7, B8, B35 and B44 respectively, with B8 playing a major role again. Similar calculations can be done for the A locus. The KL distance is 31.6235 + 49.5987 = 81.2222, and the KL distance per haplotype scored (N = 108) is 81.2222/108 = 0.7521. The indexes of association are θ=49.598781.2222≈0.61and 1 – θ ≈ 0.39.
In an R×C table of alleles, a correlation has little meaning and is not invariant with respect to how alleles are scored (e.g. 0, 1, 2) at each of the intervening loci. On the other hand, the chi-square statistics of Hill (1975) and Zhao et al. (2005) are also invariant and have a simple interpretation in terms of deviations expected under the null distribution. Our parameter θ is also score invariant, is independent with respect to the order of rows and columns, and applies to any distribution. For instance, if the simple multinomial sampling model is violated due to, e.g. over-dispersion, an appropriate θ could be developed under such model, whereas a ‘normalized’ χ2 would provide a crude, albeit probably useful, approximation.
We evaluated D′AI, D′IA and θ′ = D′IA/(D′IA + D′AI) using Monte Carlo sampling. In the calculation of D′AI one million random trials of size 108 each, with {nij} varying at random, were sampled from a multinomial distribution with probabilities pij; then, Πi=1RΠj=1C(pijpi.pj.)nij was evaluated for each realization and averaged over trials. Similarly, one million trials of size 108 were sampled from a multinomial distribution with probabilities pi.pj. and the realizations of Πi=1RΠj=1C(pi.pj.pij)nij were averaged out. The index θ′ was estimated using the mean and median of the draws of θ′, yielding 0.9953 and 0.9998. The association (disequilibrium) between alleles at the A and B loci is patent, both when logarithmic (θ) and relative (θ′) distances are employed to measure it.
Systems of multivariate beta processes
Bivariate beta distributions
A common finding in studies of population differentiation using Wright's F-statistics (Wright 1931; Cockerham 1969) with single nucleotide polymorphisms (SNPs) is that estimates of allelic frequencies within a bin of adjacent SNPs are correlated (e.g. Akey et al. 2002; Weir et al. 2005; Akey 2009). This is typically due to stochastic dependence between alleles at linked loci (producing LD), but evolutionary processes causing dependence among the true frequencies themselves have been suggested; see Ohta (1982) for a population genetics model based on this concept. For example, a correlation between alleles at randomly drawn pairs of loci arises if alleles are conditionally (given the allelic frequencies) independent but with allelic frequencies varying at random according to a beta distribution.
Suppose that pairs of loci are sampled at random from some conceptual population of equi-correlated loci and that all pairs of ‘true’ frequencies define a probability distribution. Wright (1937) found that a beta distribution arose from a diffusion equation used to study changes in allele frequencies in finite populations, so the beta process is well grounded in population genetics. Beta distributions have been used in studies of population differentiation as mixing processes to create randomness of allelic frequencies, leading to beta-binomial likelihoods or to unrecognized posterior distributions that must be resolved using sampling procedures, e.g., Holsinger (1999), Balding (2003), Beaumont & Balding (2004) and Gianola et al. (2010).
Similarly, it would seem sensible to assume that allelic frequencies of pairs of loci within the hypothetical population have an association that can be modelled with bivariate beta distributions to represent random variation of true, albeit unknown, allelic frequencies. For example, suppose that there are S half-sib families that can be viewed as drawn as random from some population and that we are interested in modeling association between pairs of allelic frequencies stemming from such sampling scheme. Here, distributions proposed by Sarmanov (1966) and Olkin & Liu (2003) are discussed and generalized as candidate processes that could be used to model this type of association.
Olkin-Liu bivariate beta distribution
First, we review how a specific bivariate beta distribution arises. As in Olkin & Liu (2003), let U, V and W be random variables following independent standard Gamma distributions with parameters a, b and c respectively. By construction, X=UU+W and Y=VV+W possess the beta distributions X: Beta(a,c) and Y: Beta(b,c) respectively. Clearly, X and Y are positively correlated (through W), with a strength that depends on the values of a, b, c. Olkin & Liu (2003) show that X and Y have a bivariate beta distribution with density function
Where 0 < X <1 and 0 < Y <1. Moments E(XkYl) cannot be written in closed form but can be approximated numerically or using sampling methods. Large c and small a, b produce correlations close to 0, whereas large a, b or small c produce correlations close to 1 (Olkin & Liu 2003). For example, a correlation equal to 0.002 is obtained for a = b =0.01 and c =5, whereas the correlation is 0.91 for a =2.5, b =4 and c =0.1. Figure 4 displays scatter plots of 5000 samples obtained from each of four bivariate beta distributions. Plot 1 represents a distribution in which the correlation is very low, and yet, there is considerable association between pairs of values near 0, illustrating inadequacy of correlation to reveal association. In plot 2 (resembling a ‘meteorite’) clustering takes place primarily at large values of X and Y. The two bottom plots depict bivariate beta distributions with similar correlations but with a completely different pattern of association. Clearly, correlation often fails as measure of statistical association.
Scatter diagrams of 5000 samples from each of four Olkin-Liu bivariate beta distributions. Plot 1 illustrates a strong association with essentially no correlation. Plot 2 (‘meteorite’) depicts a limitation of the correlation as a parameter for describing association. Plot 3 suggests association clearly. Plot 4 shows a bivariate distribution that is not trivial: the true correlation (0.46) arises primarily due to weaker association in the ‘middle’ of the bivariate sampling space.
In this bivariate beta model the density ratio takes the form
The corresponding D′AI, D′IA and KL′ can be calculated by taking expectations of density ratios as opposed to expectations of their logarithms.
Indexes of association based on D′AI, D′IA and KL′ were calculated for the following combinations of parameters, where ρ denotes the expected correlation: 1) a = 0.01, b =0.01, c =5 (ρ =0.002); 2) a = 1, b =0.5, c =0.6 (ρ =0.251), 3) a =1, b =1, c =0.9 (ρ =0.496) and 4) a =2, b =3, c =0.4 (ρ =0.750). Expectations under the independence model were approximated by drawing 1 million random numbers from independent X ∼ Beta(a,c) and Y ∼ Beta(b,c) distributions, and then averaging the evaluations of the appropriate expressions over the draws. In the association model, 1 million random triplets (U,V,W) were drawn from independent standard gamma variables with parameters a, b and c, to then form bivariate realizations as x=uu+w and y=vv+w. Table 3 gives the results, and all quantities (save for the true ρXY) are Monte Carlo estimates. The indexes of association θ′ and 1 − θ′ departed from 0.5 and approached 0 and 1, respectively, as the correlation in the bivariate beta distribution became stronger. Likewise, γ′ and γ′′ drifted away from 0, approaching |1|. In settings (3) and (4), θ′ and 1 − θ′ suggest a stronger association than what would be indicated by ρ.
Table 3. Departures of Olkin-Liu bivariate beta distributions from independence as measured by D′AI, D′IA and by the indexes of association θ′ and 1 − θ′ (when θ'=12 independence holds), for four combinations of parameters a, b and c. ρ^XY is the Monte Carlo estimate (1 million samples) of the true correlation ρXY. Means of variables X and Y are estimates from samples from association (A) or independence (I)models. The D′AI and D′IA parameters are estimates under the appropriate asumptions; KL′ = D′AI + D′IA; θ′ = D′IA/(D′AI + D′IA); γ′ = 2θ′ − 1; γ′′ = 1 − 2q′
Item
a=0.01b=0.011.2pcc=5
a=0.1b=0.5c=0.6
a=1b=1c=0.9
a=2b=3c=0.4
ρXY
0.002
0.2510
0.4960
0.7500
ρ^XY
0.002
0.2499
0.4951
0.7304
ave. (X)
−.2pc12.01×10−3(A).5pc2.00×10−3(I)
−.15pc0.1429(A)−.25pc0.1431(I)
−.15pc0.5262(A)−.18pc0.5366(I)
−.15pc0.8335(A)−.3pc0.8333(I)
ave. (Y)
−.1pc1.98×10−3(A)−.2pc2.00×10−3(I)
−.15pc0.4548(A)−.2pc0.4542(I)
−.15pc0.5264(A)−.18pc0.5259(I)
−.15pc0.8825(A)−.3pc0.8823(I)
DAI'
0.9999
0.8709
0.4292
0.0745
DIA'
0.9999
0.8009
0.0779
1.944 × 10−5
KL′
1.9999
1.5781
0.5071
0.0745
θ′
0.5000
0.4481
0.1537
0.0003
1 − θ′
0.5000
0.5519
0.8463
0.9997
γ′
0.0000
−0.1037
−0.6926
−0.9995
γ′′
0.0000
0.1037
0.6926
0.9995
Olkin-Liu bivariate beta-binomial distribution
In studies of LD, focus is typically on correlation between alleles rather than between frequencies. We discuss how a correlation between alleles can arise when the association stems from frequencies, as could happen when the population consists of clusters of individuals resulting from some family structure. Consider bi-allelic locus l with allelic frequencies Pr(Al) = pl and Pr(al) = 1 − pl; Al and al denote the two alleles at locus l. Suppose that a sample of N individuals is scored, and that the observed number of copies of Al and al are nAl and nal, respectively, with nAl+nal=2N. Given pl, the distribution of nAl is Binomial(2N,pl). Now, let the allelic frequency pl vary at random among clusters (e.g. sub-populations) according to a beta distribution with parameters (a,c). Then, the marginal distribution of the observed number of alleles is beta-binomial (e.g., Casella & George 1992; Sorensen & Gianola 2002), that is, the probability of observing nAl copies among the 2N alleles scored is
Consider now a pair of loci, and let the observed number of copies of the four alleles be represented as DATA=(nA1,na1,nA2,na2); haplotype frequencies are not relevant in the model that follows. If, given the allelic frequencies p = {pl}, the number of copies of A1 and A2 are independently distributed, i.e., the two loci are in linkage equilibrium, the distribution of the observed number of alleles is
l(DATA|p)=Πl=12(2NnAl)plnAl(1−pl)nal.(18)
Suppose now that this pair is a realization from a stochastic process where loci are sampled over a conceptual population of pairs, e.g. the two loci are drawn at random from a physical ‘bin’ of a certain length in kilobases. This process generates a covariance structure between allelic frequencies of pairs of loci within the ‘bin’. The alleles would be in LD, marginally, even though being in conditional (given the allelic frequencies) equilibrium. If the unobserved, ‘true’, allelic frequencies follow some bivariate distribution with density g(p1,p2|θ), where θ denotes its parameters, the density of the marginal distribution of the observed number of allelic counts under this association model would be
(19)
If a bivariate Olkin-Liu beta distribution with parameters θ = (a,b,c) is assumed, the joint density of allelic frequencies p1 and p2 as in (14). Letting
C(a,b,c)=Γ(a+b+c)Γ(a)Γ(b)Γ(c),
the marginal distribution (19) takes the form
pA(DATA)=[Πl=12(2NnAl)]C(a,b,c)
(20)
This integral cannot be written in closed form, but it can be evaluated using Monte Carlo methods such as importance sampling (e.g. Sorensen & Gianola 2002). One possible scheme is outlined in the Appendix.
Under this distribution
αnA1nA2=pA(DATA)pI(DATA),
Where pI(DATA) is the distribution under independence between observed allele numbers, and given by the product of two beta-binomial distributions, that is
where EB1,B2[h(p1,p2)] is defined in the Appendix. Preceding the expectation, the first term involves the parameters of the bivariate beta distribution, the second involves sample size (N) as well, and the third one depends on observed allelic counts.
The logarithmic and relative distances between distributions are calculated as
DAI=∑nA1=02N∑nA2=02Nlog(αnA1nA2)pA(DATA),
DIA=−∑nA1=02N∑nA2=02Nlog(αnA1nA2)pI(DATA),
DAI'=∑nA1=02N∑nA2=02NαnA1nA2pA(DATA),
and
DIA'=∑nA1=02N∑nA2=02NαnA1nA2−1pI(DATA),
With θ and θ′ calculated as before. Calculations proceed on an entirely numerical basis.
A generalization of the beta process to a situation with more than two loci is in Olkin & Liu (2003). For K loci the joint density of the allelic frequencies takes the form
The marginal distribution of allelic counts is obtained by mixing a multinomial distribution over the multivariate beta density above, but results are not available in closed form, so a numerical solution is needed again.
Lee-Sarmanov bivariate beta distribution
Lee (1996) proposed a different bivariate beta distribution following Sarmanov (1966); see Danaher & Hardie (2005). A Lee-Sarmanov bivariate beta distribution for allelic frequencies p1 and p2 has density
Where B1(.) and B2(.) are beta densities as before and w is a parameter (w = 0 produces independence); θ* = (a1,b1,a2,b2,w). Above, μ1=a1a1+b1 and μ2=a2a2+b2 are the expected values of p1 and p2 respectively. The marginal densities of p1 and p2 under this model are B1(.) and B2(.). Further,
E(p1p2)=μ1μ2+wVar(p1)Var(p2),
Cov(p1,p2)=wVar(p1)Var(p2),
so that
Corr(p1,p2)=wVar(p1)Var(p2),
with
Var(pi)=μi(1−μi)ai+bi+1,i=1,2.
In the Lee-Sarmanov bivariate beta distribution, a null correlation implies independence. Under this model the density ratio becomes
The index θ′ was evaluated for three sets of values of the a,b parameters for each of the two intervening beta distributions while setting the correlation at 0,14,34 or 910. Metric D′AI in (24) follows directly from the correlation, while D′IA was estimated using 1 million random samples from each of the two beta distributions. For each set of a,b parameters w was calibrated via
Note that D′AI is solely a function of the correlation, so it was the same for the three pairs of beta distributions examined. Using (25), D′IA was estimated by averaging αp1p2−1=[1+w(p1−μ1)(p2−μ2)]−1 over draws. Because αp1p2−1is a ratio between density functions, negative realizations were not used when forming the Monte Carlo estimate (this produces an upward bias but a more precise estimate). As shown in Table 4, the KL distance between the bivariate beta distribution and the independence process increased monotonically with the strength of the correlation. The indexes of association θ′ and 1 − θ′ also drifted away monotonically from 12 only in setting 2. For instance, in setting 1 θ′ increased from 0.50 to 0.66 as the correlation increased from 0 to 34, but decreased to 0.60 when the correlation was 910. The shape of the bivariate beta distribution was examined for this setting by plotting the density at 20 000 random points drawn from each of the marginal Beta(2,2) distributions. This is shown in Fig. 5: the Lee-Sarmanov density ‘evolves’ towards a more complex topography as correlation increases. When correlation grows from 0 to 14, the shape of the joint distribution is not too different from that under independence, as also suggested by KL and θ′. However, when the correlation increases by a factor of 3 from 14 to 34, the KL distance grows only about two times, while θ′ and 1 − θ′ move away from 12 at an even slower rate; this is expected because θ has an upper bound at 1, whereas KL takes any positive value in the real line. The bimodal shape of the density also suggests that more intensive sampling is needed to obtain reliable estimates of the indexes of association.
Plots of the densities of four Lee-Sarmanov bivariate beta distributions having the same Beta (2,2) marginal distributions but differing in the strength of association: as the correlation increases multi-modality emerges.
Table 4. Departures of Lee-Sarmanov bivariate beta distributions with varying correlation from independence, as measured by D′AI, D′IA and by indexes of association q′ and 1 − q′ (when θ'=12 independence holds). The four correlations levels depend on the value of parameter w of the bivariate distribution, and on parameters a,b of the two ‘parental’ beta distributions. D′AI has closed form and D′IA was estimated using 1 million samples from the beta distributions. KL′ = D′AI + D′AI; q′ = D′IA/(D′AI + D′IA)
Setting/Item
DAI'
DIA' (% negative)
KL′
θ′
1 − θ′
(1) a1 = b1 = a2 = b2 = 2
Correlation (w)
0(0)
1
1
2
0.5000
0.5000
14(5)
1.0625
1.2681 (0.0)
2.3307
0.5441
0.4559
34(15)
1.5625
3.0985 (8.5)
4.6610
0.6648
0.3352
910(18)
1.8073
2.7369 (11.2)
4.5468
0.6019
0.3981
(2) a1 = 10; b1 = 90; a2 = 90; b2 = 10
Correlation (w)
0(0)
1
1
2
0.5000
0.5000
14(281)
1.0625
1.1466 (0.4)
2.2091
0.5190
0.4810
34(842)
1.5625
2.6824 (6.3)
4.2449
0.6319
0.3681
910(1010)
1.8073
4.4937 (8.7)
6.3037
0.7128
0.2871
(3) a1 = .25; b1 = .75; a2 = .75; b2 = .25
Correlation (w)
0(0)
1
1
2
0.5000
0.5000
14(135)
1.0625
1.3766 (0.9)
2.4391
0.5644
0.4356
34(404)
1.5625
1.5094 (4.9)
3.0719
0.4914
0.5086
910(485)
1.8073
1.5876 (5.6)
3.3976
0.4673
0.5327
As shown in Danaher & Hardie (2005), the marginal distribution of the observed allelic counts DATA=(nA1,na1,nA2,na2) is obtained by mixing (18) over the Lee-Sarmanov density (23), leading to
A multivariate generalization of (26) is direct but appears that it has not been reported elsewhere. The joint density of the allelic frequencies at L loci takes the form
denote the beta-binomial distribution of the observed number of copies of allele Al, out of 2N copies. The marginal distribution of allelic counts over all loci can be shown to be, after algebra
We propose (for obvious reasons) to name (27) and (28) as multivariate beta GMS-Sarmanov and multivariate beta-binomial GMS-Sarmanov distributions, respectively.
Discussion
The problem addressed in this study was that of measuring departures of the joint distribution of genetic variables from independence. New measures of association based on notions of statistical distance were proposed and evaluated under several scenarios, spanning from multivariate Gaussian distributions modeling, say, pleiotropic effects, to systems of beta distributions describing association between allelic frequencies. Two hereto seemingly unreported probability distributions were derived, and termed GMS-Sarmanov multivariate beta and GMS-Sarmanov multivariate beta-binomial distributions. The standard LD problem was also dealt with to illustrate the generality of the approaches proposed.
Linkage disequilibrium analysis has been the subject of an enormous amount of research, re-energized with the availability of massive molecular marker and sequence data. For example, in the context of coalescent theory, an important issue is the decoupling of ancestries of sites at different regions of the chromosome, and this is done by studying association between alleles at different loci (Wakeley 2009). Most of the standard measures of LD employed are pairwise statistics such as correlations (e.g. Hill & Robertson 1968; Hill 1974; Hedrick 1987; Lewontin 1988; Morton et al. 2001; Pritchard & Przeworski 2001; McVean 2007), because of ease of calculation and, perhaps, ease of interpretation. However, as pointed out by Lewontin (1988), although a correlation is a meaningful parameter (e.g., in terms of the amount of variability of Y explained when X is observed) in a bivariate normal distribution, it is arguably less so when applied to discrete data, a problem that is well known in quantitative genetic analysis of discrete phenotypes (Dempster & Lerner 1950; Gianola 1982). Further, pairwise measures do not characterize association well if a given genetic system involves many correlated random variables, as in multi-locus measures of LD. Actually, such measures have been reported much less often, e.g. Weir (1996) gives formulae for up to four loci, and Sabatti and Risch (2002) give a measure based on excess of heterozygosity or of homozygosity.
Nothnagel et al. (2002) presented an entropy-based index of LD that is related to our approach, but that differs in some respects. These authors calculate the entropy of the allelic distributions under linkage equilibrium and under disequilibrium, and express the difference in entropy as a fraction of the equilibrium entropy. This produces a normalized entropy that takes values between 0 and 1 (0 indicating no association between alleles at different loci). A formal objection is that, while entropy measures uncertainty in a distribution, its relationship to statistical distance between distributions (Kullback 1968) requires more elaboration. Their method is for bi-allelic loci only, and entropy does not always behave well in continuous distributions (Bernardo and Smith, 1994; Sorensen & Gianola 2002), whereas relative entropy measures such as the KL distance are well defined.
Closer to the spirit of our procedures, Liu & Lin (2005) suggested a measure of LD based on a relative KL discrepancy, but it differs from the ones we propose. This is mainly because they use only one of the two components (termed DAI in our paper) of the invariant KL distance, and express it relative to the maximum value it can take. Using their procedures with DIA would produce a different value of association, and maybe a different qualitative picture might emerge from analysis of genetic data. However, their ideas can be embedded in our approach, and expressing association relative to a maximum distance is a well taken point. It also turns out that these results can also be adapted to the continuous domain, with some care. To illustrate this, it suffices to consider two random vectors, x and y, as generalization to a higher-dimensional system is straightforward. Let H(x), H(y), and H(x,y) be the entropies (non-negative, by construction, although pathological examples may arise in continuous distributions) of the marginal distributions of x, y, and x,y respectively, with
H(x)=−∫[logp(x)]p(x)dx,
H(y)=−∫[logp(y)]p(y)dy,
and
H(x,y)=−∫[logp(x,y)]p(x,y)dxdy.
The discrepancy away from the association model is
Where H(y|x) is the entropy of the conditional distribution of y given x. This implies that H(x, y) ≧ H(x), where x can also be any partition of x, e.g. its ith coordinate xi, say, and similarly H(x, y) ≧ H(y). Thus, H(x, y) ≧ maxiH(xi). Using this in (29)
This leads to additional indexes of association, such as
θ*=DIADIAmax+DAImax,(31)
and
θ**=DIA+DAIDIAmax+DAImax=KLKLmax,(32)
both taking values between 0 and 1. These indexes were evaluated for a bivariate normal distribution with correlation ρ=610 and marginal distributions x: N1(0,1) and y: N2(0,1). One obtains
H(x)=H(y)=12(1+log2π)≈1.4189,
DIA=ρ2(1−ρ2)−1+log1−ρ2≈0.3394,
DAI=−log1−ρ2≈0.2231,
KL=0.3394+0.2231=0.5625,
θ=0.33940.5625=0.6033,
DAImax=2×1.4189−1.4189=1.4189,
C(x,y)=−log12π1−ρ2+11−ρ2=3.1772,
DIAmax=3.1772−1.4189=1.7583,
θ*=0.33941.4189+1.7583=0.1068,
and
θ**=0.56251.4189+1.7583=0.1770.
The four indexes, ρ, θ, θ** and q** produce a different value of the strength of association between random variables in a distribution. This suggests that probably there is no such thing as a universal measure, although it is clear that the coefficient of correlation lacks generality.
In a nutshell, this paper presents measures of association for systems of genetic variables that go beyond standard two-dimensional statistics. The procedures apply to either continuous or discrete data, and typically require numerical implementation, because many of the expressions are not available in closed form, depending on the distribution assumed. Our procedures (like any other method) require knowledge of the parameters of the distributions under independence or association, and estimation of such parameters is not the objective of this paper. Today, computer-intensive approaches for parameter inference, such as Bayesian Markov chain Monte Carlo (Sorensen & Gianola 2002; Gelman et al. 2004) or approximate Bayesian computation (Beaumont et al. 2002), can be implemented effectively in today's high throughput systems (Wu et al. 2011).
Finally, since this volume is in honor of the contributions of Professor Moshe Soller, a relationship between this paper and his work should be established. As noted at the onset of this document, among many other accomplishments, he pioneered marker-assisted selection in animal and plants via exploitation of linkage and LD relationships between markers and unknown quantitative trait loci. Examples of his papers in this area include Soller & Genizi (1978) and Lipkin et al. (2009), and these connect with some of the developments presented here. We look forward to many more scientific accomplishments by Moshe!
Acknowledgements
Part of this work was carried out while the senior author was a Visiting Professor at Georg-August-Universität, Göttingen (Alexander von Humboldt Foundation Senior Researcher Award) and Visiting Scientist at the Station d'Amélioration Génétique des Animaux, Centre de Recherche de Toulouse (Chaire D'Excellence Pierre de Fermat, Agence Innovation, Midi-Pyreneés). Support by the Wisconsin Agriculture Experiment as well as by the Deutsche Forschungsgemeinschaft (GRK 1644/1) is acknowledged. W.G. Hill is thanked for useful comments.
Conflicts of interest
The authors have no conflicts to declare.
Appendix
Bivariate and univariate t-distributions
The density of a K-variate random vector following a multivariate t-distribution is
where μ=E(x),Σ is the scale matrix, υ > 0 is the degrees of freedom parameter and Γ(.) is the Gamma function; the covariance matrix of this distribution is Συ/(υ−2). The specification k = 2, μ = 0 and Σ = I2, yields a bivariate t-distribution with null mean and covariance matrix Var(x)=I2υ/(υ−2); the two marginal distributions are univariate-t, each having a null mean, variance υ/(υ−2) and ν degrees of freedom. Although uncorrelated, these two random variables are statistically dependent because
The draws from the bivariate t-distribution needed for evaluating DAI' and DIA' are obtained as
[x1x2]=[z1z2]νχν2.
Under independence, the sampling from 2 univariate t-distributions is done as
[x1x2]=[z1νχν,12z2νχν,22].
Above, the z’s are independent draws from N(0,1),χν2 is a draw from a central chi-square distribution on ν degrees of freedom, and χν,12 and χν,22 are two independent draws from the same distribution.
When allelic frequencies are independent, their joint density is the product of the densities of the beta random variables B1(nA1+a,na1+b+c) and B2(nA2+b,na2+a+c). However, this does not provide a suitable importance sampling distribution, because dependency would not be recognized at all, with the sample space not visited appropriately. A form of introducing dependency is via an importance sampling density with two dependent beta distributions, such as
(34)
Distribution B2' depends on p1, and has expectation
(nA2+b)nA2+b+(na2+a+c)(1−p1)p1,
which would be equal to the mean value under independence only if p1 = 0.5. The double integral in (20) can be represented as
=Γ(nA1+a)Γ(na1+b+c)Γ(2N+a+b+c)EB1B2[h(p1,p2)],
where
denotes expectation with respect to the joint distribution with density B1B2'. Using this in (20), the marginal distribution of the data under the association model is
(35)
Expression (35) is a seemingly unreported discrete probability distribution which we term ‘Olkin-Liu bivariate beta-binomial distribution’.