SEARCH

SEARCH BY CITATION

Keywords:

  • association indexes;
  • entropy;
  • gene frequencies;
  • Kullback-Leibler distance;
  • linkage disequilibrium;
  • multivariate beta distributions;
  • pleiotropy

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

Systems involving many variables are important in population and quantitative genetics, for example, in multi-trait prediction of breeding values and in exploration of multi-locus associations. We studied departures of the joint distribution of sets of genetic variables from independence. New measures of association based on notions of statistical distance between distributions are presented. These are more general than correlations, which are pairwise measures, and lack a clear interpretation beyond the bivariate normal distribution. Our measures are based on logarithmic (Kullback-Leibler) and on relative ‘distances’ between distributions. Indexes of association are developed and illustrated for quantitative genetics settings in which the joint distribution of the variables is either multivariate normal or multivariate-t, and we show how the indexes can be used to study linkage disequilibrium in a two-locus system with multiple alleles and present applications to systems of correlated beta distributions. Two multivariate beta and multivariate beta-binomial processes are examined, and new distributions are introduced: the GMS-Sarmanov multivariate beta and its beta-binomial counterpart.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

The analysis of systems involving several variables is of great importance in population and quantitative genetics, for example, in multi-trait prediction of breeding values or in the study of networks using multi-locus data. The problem addressed here is that of measuring departures of the joint distribution of sets of genetic variables from stochastic independence. The target variables could be, for instance, pleiotropic additive effects of many loci under the infinitesimal model of Fisher (1918), correlated allelic frequencies of loci within bins of physical distance defined by numbers of base pairs, or sets of allelic frequencies that vary at random over clusters of individuals or sub-populations. Likewise, a study may involve systems of alleles whose joint distribution departs from independence, due either to proximity within a chromosome (linkage) or to forces such as random drift or selection favoring epistatic combinations, which may also create associations among loci on different chromosomes. When the problem is that of the study of the statistical association between allele states at different loci, this usually is referred to as gametic or ‘linkage disequilibrium’ (LD), with pairwise correlation measures typically employed for characterizing such association. If alleles at two loci are associated due to physical linkage, equilibrium (independence) is eventually attained asymptotically as generations of random mating accrue. LD has received enormous attention in population genetics (e.g., Hill 1974; Hedrick 1987; Lewontin 1988; McVean 2007) and has become an increasingly important topic of research in the light of emerging genomic data, e.g. single nucleotide polymorphism markers and sequence data enable the study of joint evolution of genomic blocks as well of as genome-wide association with complex disease or quantitative traits. This last area is one in which Professor Morris Soller has made important contributions, for example, a pioneering paper by Soller & Genizi (1978) in marker-assisted selection, and his 1990 work with Weller and Kashi on designs for inferring linkage between markers and quantitative trait loci in dairy cattle. The present paper is in his honor.

The objective of this work is to introduce new measures of association involving a system of variables based on notions of statistical distance between distributions. The resulting metrics are more general than correlations, which are pairwise measures only and lack a parametric interpretation beyond the bivariate normal distribution (e.g., Lewontin 1988), especially when associations are non-linear or when pairs of variables are neither independent nor identically distributed. For example, in the study of gametic disequilibrium, the distribution of correlation-based estimators is frequency dependent (Hedrick 1987), thus complicating comparisons among populations and meta-analyses. Also, there are situations such as in a multivariate t-distribution with a diagonal covariance matrix in which variables are uncorrelated and, yet, stochastically dependent. This illustrates that lack of correlation does not provide definitive evidence of independence: in this example, variable X, say, informs about Y even though uncorrelated. A related question is the inability of correlation to measure association between, say, 10 loci or 20 pleiotropic effects. The measures of association developed in this study are based on departure between the distributions of the variables under independence and association assumptions. Indexes measuring association between random variables, irrespective of their number or of the form of the joint distribution, are suggested and illustrated.

The article is organized as follows. Following this introduction, measures based on logarithmic and on relative ‘distances’ between distributions, as well as indexes of association, are presented. Subsequently, these indexes are developed and illustrated for quantitative genetics settings in which the joint distribution of the variables under consideration is either multivariate normal or multivariate-t. The fourth section shows how these indexes can be used to study LD in a two-locus system with multiple alleles; here, a correlation would not have much meaning, if at all. The fifth section presents applications to systems of correlated beta distributions; all these are generalizations of the univariate beta distribution, which has been used extensively in population genetics to study evolution of gene frequencies. In particular, two multivariate beta and multivariate beta-binomial processes are discussed, and new distributions are introduced: the GMS-Sarmanov multivariate beta and its beta-binomial counterpart. The paper concludes with a discussion of the concepts and of other procedures that have been proposed for study of multi-locus LD. Mathematical details are relegated to an appendix.

Measuring association among variables

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

To motivate the approach used here, consider a pair X, Y of random variables with realized values x, y; the ideas generalize readily to distributions of higher dimension simply by replacing scalars by vectors. Let p(x,y) and p(x) × p(y) be the densities of the joint distributions under association and independence, respectively; p(x) and p(y) are the marginal densities of X and Y respectively. Association occurs whenever p(y|x) (or p(x|y)) differs from p(y) (or p(x)). Measures of dependence (association) can be derived from the density ratio

  • display math(1)

If αxy = 1 for all pairs of values, the ‘distance’ between distributions is 0 and independence holds. Values of αxy larger or smaller than 1 suggest statistical dependence. For example, αxy larger than 1 indicates ‘coupling’, i.e. the density of certain pairs of values is higher than expected under independence; values smaller than 1 would be indicative of ‘repulsion’. Likewise, strong departures of log αxy from 0 would be indicative of an association between x and y.

We discuss two measures of association based on either expected logarithmic distance between distributions or on expected relative distance, as conveyed by the average values of log αxy and of αxy respectively. These measures are used subsequently to develop indexes of association, irrespective of the number of variables in a system.

Association based on Kullback-Leibler distance

The Kullback-Leibler (KL) logarithmic distance (Kullback 1968) is defined as the sum of two KL discrepancies between the distributions: one under association and the other under independence. The first discrepancy measures departure from association when the latter is true, and the other measures departure from independence when this attribute holds. The two discrepancies are positive by construction and are defined as

  • display math

which measures discrepancy when it is true that X and Y are associated, and

  • display math

measuring discrepancy when independence holds; the range of integration depends on the random processes involved. If the two distributions are identical, both discrepancies are null; on the other hand, large discrepancies reflect stochastic dependence. Above, Ep(x,y) and Ep(x)p(y) denote expectations taken under the assumptions that (X,Y) follow either a bivariate distribution (association) or that these variables are independently distributed.

A connection with the likelihood ratio statistic is fairly direct. Suppose that the parameters of the distributions under independence and association are estimated by maximum likelihood (ML), and that these estimates are regarded as if they were ‘true’ values. Then, log αxy, with the densities evaluated at the corresponding ML estimates would be the log of a ratio between maximized likelihoods. In DAI one would be taking expectations of the log-likelihood ratio over all possible realizations expected if the ‘full model’ (association) were true; on the other hand, in DIA the expectations of the reciprocal of the log-likelihood would be taken under the assumption of independence. In this context, DAI and DIA are averages of log-likelihood ratios, as opposed to an evaluation at a single realization (in ML estimation parameters vary whereas observations are taken as fixed). Note that DAI and DIA do not involve any requirement of ‘nesting’ of models, contrary to likelihood ratio tests under regularity conditions, but this is a different issue.

The discrepancies are such that DAI ≠ DIA, and a unique, symmetric KL distance is defined (Kullback 1968) as

  • display math(2)

Then, an index of association (taking values between 0 and 1) is obtained by normalizing DIA as

  • display math(3)

with

  • display math

provided that KL is not null. A large value of DIA relative to DAI (or the opposite) is indicative of situations in which there is more association than what would be expected under independence so that θ or 1 − θ would be close to 1. Thus, values of θ or of 1 − θ away from inline image provide evidence of discrepancy between the hypotheses of random ensembles of (x,y) pairs vs. associated ensembles.

There is no closed form for DIA or DAI for many distributions, in which case these metrics must be calculated numerically or estimated using Monte Carlo methods, given some parameter values. If association is weak, the two logarithmic discrepancies and KL are nearly null, and this can lead to very unstable, even absurd, Monte Carlo estimates of θ. Also, the KL distance and the two discrepancies are logarithmic, and this perhaps provides a less intuitive notion of ‘distance’ than one based directly on αxy. An alternative ‘distance’ measure is presented immediately below.

Association based on relative distance

Define the relative discrepancies

  • display math(4)
  • display math(5)

and

  • display math(6)

The metric inline image gives the expected value of αxy under association and inline image gives the expectation of inline image under the assumption of independence. These expectations are both positive, by construction, and values of θ away from inline image can be construed as reflecting departure from independence. For example, inline image or inline image would suggest association. The connection with likelihood ratios surfaces again; for example, inline image is the average likelihood-ratio over all possible realizations of the data under the independence assumption.

In what follows, the proposed indexes are examined for some distributions that are used often in quantitative and population genetics research.

Multivariate Gaussian and t-distributed variables

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

In quantitative genetics, the analysis often focuses on a set of continuous correlated variables inline image that follow some multivariate distribution; these variates could be, for example, additive genetic effects for K traits. Suppose that a matrix of estimates of additive genetic correlations is available (and taken as a ‘true’ matrix), and that we seek to arrive at a single measure of association among the K genetic values, as opposed to the inline image pairwise measures provided by the matrix of genetic correlations. Below, examples of multivariate normal distributions with K = 2, 3, 4 and of a bivariate t-distribution are considered to illustrate how θ behaves.

Bivariate normal distribution

Let X1 and X2 be two standardized normally distributed genetic values with correlation ρ. For this distribution inline image and inline image (e.g. Sorensen & Gianola 2002). When the correlation is null, the two discrepancies are 0; when the correlation is perfect, the discrepancies are both . Thus, KL = ρ2(1 − ρ2)−1, which is 0 when the two random variables are independent and under a perfect correlation. Then

  • display math(7)

measures discrepancy away from the independence situation when the latter holds, relative to the KL distance between the two distributions. It can be shown that inline image and inline image providing the lower and upper bounds of θ respectively.

The values of θ and of 1 − θ are plotted against ρ in Fig. 1. The dotted line (‘holds water’) gives the relative discrepancy (θ) away from the independence models when this is true, and the dashed line (‘spills water’) depicts the trajectory of 1 − θ, i.e. the relative discrepancy under association; note that a given curve is a mirror image of the other. As the absolute value of the correlation increases θ [RIGHTWARDS ARROW] 1, as one would expect. Since an association index with values ranging between 0 and 1 may be easier to interpret, an alternative measure could be

  • display math
image

Figure 1. Measures of association of two bivariate Gaussian variables as a function of their correlation (ρ). The straight lines give the strength of the association as measured by the absolute value of ρ. The dotted (‘holds water’: θ) and dashed (‘spills water’: 1 − θ) lines depict the relative contributions to the Kullback-Leibler distance due to discrepancies under independence and dependence models, respectively. Values of the association measure γ=2θ − 1 are represented by the dark solid line.

Download figure to PowerPoint

However, γ takes values in [0,1] provided that inline image, which holds for a bivariate Gaussian distribution, but not so in general. The relationship between γ and ρ for this Gaussian model is also shown in Fig. 1 (solid line), and the association is suggested as weaker when measured by θ than when assessed with the correlation ρ. For example, if inline image θ is smaller than 0.15. In multivariate systems or in other distributions, θ can be much smaller than inline image so that γ takes negative values, as illustrated below, so it turns out that θ often provides an easier to interpret measure of association than does γ.

Multivariate normal distribution

The metrics under consideration can be applied to distributions with any number of dimensions, leading to measures of departure from independence in more general situations, such as when alleles in a multiple-locus system are studied. To illustrate, consider the K-dimensional multivariate Gaussian distribution (xN(0,R), where variates are in standard deviation units so that R is a correlation matrix; under independence (xN(0,I). It can be shown (Penny & Roberts 2002) that

  • display math(8)

Where tr(.) denotes the sum of the diagonal elements of a squared matrix such as R. Further,

  • display math(9)

and

  • display math(10)

so that

  • display math(11)

Examples of several multivariate normal distributions are given next.

Equi-correlated trivariate normal distribution

Let three variables be equi-correlated with correlation ρ, so that

  • display math

with inline image inline image and inline image Using (8)-(11) yields

  • display math
  • display math
  • display math

and

  • display math(12)

With 0 < θ < 1. For example, for ρ = 0.25, 0.50 and 0.75, the index of association θ takes values 0.49, 0.54 and 0.66 respectively. Figure 2 depicts θ, 1 − θ and θ as a function of the coefficient of correlation. At ρ = 0, θ = 0.5, but then it decreases slightly as the correlation increases such that γ does not attain a positive value until γ  reaches a value of about 0.3. Note that θ is not symmetric with respect to ρ (the same is true of |R|); for instance, when inline image θ = 0.59 so that association would be viewed as stronger than when inline image since θ = 0.49 then. This is a consequence of the values that the discrepancies DIA and DAI take at varying values of ρ.

image

Figure 2. Indexes of association as a function of the coefficient of correlation in an equi-correlated trivariate normal distribution. The thick solid line gives the relative Kullback-Leibler discrepancy between distributions when independence holds (θ), and the dashed line gives the relative discrepancy when association is true (1 − θ). The thin line gives the trajectory of γ=2θ − 1.

Download figure to PowerPoint

Tetra-variate normal distributions

Alternative measures can be derived from eigenvalues of a correlation matrix; however, this is sensible only for distributions in which linear relationships between variables are meaningful. The index θ, derived from notions of statistical distance, does not have this limitation because KL discrepancies have a general meaning and can be calculated for any distribution, at least numerically, given values of the parameters. Naturally, any single measure of association in a multivariate distribution becomes less transparent as the system of variables increases in dimension.

Let = 4 and suppose the variables are equi-correlated with coefficient ρ. Then, inline image inline image and

  • display math(13)

Figure 3 displays how θ and 1 − θ vary against ρ. At θ = 0 one has inline image θ decreases somewhat subsequently and then increases, attaining 0.5 again at about ρ = 0.45. The values of the index of association suggest that the independence and association models are not ‘too distant’ unless the correlation is below −0.10, if negative, or above ρ = 0.50, if positive.

image

Figure 3. Discrepancies from independence to association (θ, solid line), and from association to independence (1 − θ, dashed line) as a fraction of the Kullback-Leibler distance between two tetravariate normal distributions. The straight lines give the absolute values of the correlation ρ.

Download figure to PowerPoint

Let now = 4 and take as correlation matrix

  • display math

where 0 ≤ τ ≤ 1, so that when = 0 the four variables are independent. For = 1, log |R| = −2.5459 and tr(R−1)  = 24.8980, yielding θ = 0.8782. Likewise, for inline image log |R| = −0.2960 and tr(R−1) = 4.6540, so that θ  = 0.5473. This illustrates that the strength of the association is proportional to the strength of inter-correlation, and that θ provides a metric for measuring departure from independence.

We finish this section by posing the following question: what does θ say that is not informed by all possible pairwise correlations? Again, let = 4 and take as correlation matrices

  • display math

The salient point is that the correlation structure is such that while pairs of variables (1,3), (2,4) and (3,4) are correlated, (1,2) are independent.

What is the overall strength of the association in the system? Calculations yield θ1 = 0.54 and θ2 = 0.91 for the networks of variables characterized by inline image and inline image respectively. While the correlations measure association between pairs of nodes, θ gives an indication of the overall degree of association.

Bivariate t-distribution

As mentioned, there are situations in which random variables are jointly dependent, but yet, their correlation is 0, which would incorrectly suggest that X does not inform about Y. This highlights limitations of the correlation parameter as a metric for statistical dependence between sets of variables. One such case is the multivariate-t distribution, in which variables can be statistically dependent, yet uncorrelated. The univariate and multivariate t-distributions appear in robust linear regression models for quantitative genetic analysis based on pedigrees (e.g., Strandén & Gianola 1998; Sorensen & Gianola 2002; Rosa et al. 2003). The univariate-t distribution has also been used as a prior in Bayesian linear regression models for genome-enabled prediction of single traits (Meuwissen et al. 2001; Gianola et al. 2009).

Suppose two variables, e.g. SNP effects on a pair of traits, follow a bivariate t-distribution with a null mean vector, ν degrees of freedom and a scale matrix that is equal to a 2 × 2 identity matrix; here, the two effects are uncorrelated but not independent, as shown in the Appendix. For this situation, the density ratio (1) is

  • display math

so that the relative discrepancies in (4) and (5) become

  • display math

and

  • display math

The expectations above do not have closed forms but can be evaluated using Monte Carlo methods: given the value of the degrees of freedom parameter, random numbers can be drawn from a bivariate t-distribution and from two independent univariate t-distributions, to approximate D′AI and D′IA.

The index of association θ′ in (6) was computed for t-distributions with μ = 1, 2, 4, 6, 8, 20 and 100 000. The last value produces an approximation of the joint distribution of two independent normal variates, whereas the setting with μ = 1 evaluates distance between a ‘bivariate Cauchy’ distribution and the distribution of two identically and independently distributed Cauchy random variables. The t-processes with μ = 1, 2, 4 do not have a finite variance, but the distributions are proper (i.e., the density integrates to 1). As shown in Table 1, with sets of 1 million random numbers used for evaluating D′AI, D′IA and θ′, the ‘distance’ D′AI from association to independence decreases as ν increases, approaching 1, whereas the distance from independence to association D′IA increases, approaching 1 as well. With ν = 100 000 the random variables are essentially independent since inline image leading to γ′ = 0, the setting ν = 1 produces the strongest possible association, with θ′ = 0 and both 1 − θ′ and θ′ equal to 1.

Table 1. Departures of zero-mean bivariate t-distributions with a 2 × 2 identity matrix as scale parameter (so their correlation is null) from independence as measured by D′AI, D′IA and by the indexes of association θ′ D′IA/(D′AI + D′IA), 1 − θ′ and γ ′ = 2(1 − θ′)  1 (when inline image the random variables are independent)
Degrees of freedominline imageinline imageinline imageθ1 − θγ
17.28 × 1090.43767.28 × 1090.00001.00001.0000
2634388370.642463488380.00001.00001.0000
420.74430.794721.53900.03690.96310.9262
62.62990.85603.48590.24560.75440.5088
81.27150.88892.16040.41150.58850.1770
201.05840.95282.01120.47380.52620.0524
100 0001.00001.00002.00000.50000.50000.0000

Linkage disequilibrium

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

Suppose there are C and R alleles at locus A and locus B, respectively, such that data on gametic types can be arranged into an R×C contingency table. Let pij be the probability of observing a haplotype with configuration ‘ij’. If n gametes are screened and the observed number having such configuration is nij, the assumption of multinomial sampling leads to

  • display math

where inline image and inline image are the marginal probabilities of alleles ‘i’ and ‘j’at loci A and B respectively. Define parameters Dij = pij − pi.pj., measuring departure from independence (i.e. disequilibrium between alleles i and j). In a two-locus model a single D parameter is needed, and by using pij = pi.pj., then αxy would be the ratio between two likelihoods: the ‘unrestricted’ likelihood (indexed by parameters D, pi., pj.) and the ‘restricted’ likelihood (indexed by the two allelic frequencies). The asymptotic connection between the standard log-likelihood ratio statistic and the χ2 metrics used for multi-locus analysis of LD by Hill (1975) and more recently by Zhao et al. (2005) is given, for example, in Agresti (2002).

Under this sampling scheme, the discrepancies become

  • display math

and

  • display math

Hence

  • display math

Naturally, θ is not defined when all Dij = 0 because then the two distributions would be identical and the KL distance is 0 in that case. Note that DAI is the expected value of the log-likelihood ratio, but under the assumption of association; on the other hand, when the sampling distribution of the log-likelihood statistic is considered, the reference distribution is the null model, i.e., linkage equilibrium. This illustrates an important conceptual difference, apart from the fact that DIA and DAI involve integration over the sampling space of allelic outcomes at the two loci in question.

To illustrate, consider a hypothetical example, patterned after data in Spiess (1989) reproduced partially by Frankham et al. (2010). The data pertain to three and four alleles at the major histocompatibility loci HLA-A and HLA-B respectively; the data, after a modification explained subsequently, are shown in Table 2. Frankham et al. (2010) presented only the data for these alleles, yet there are many more variants at these loci. Relative frequencies in Frankham et al. (2010) were ‘normalized’ such that the 12 joint frequencies would add up to one.

Table 2. Hypothetical data for major histocompatibility loci HLA-A and HLA-B, with three and four alleles respectively. The pij are haplotype frequencies; inline image and inline image give the allelic frequencies. The hypothetical number of observed haplotypes is assumed to be 108
LocusA1A2A3Marginal
B7p11 = 0.0270p12 = 0.0950inline imageinline image
B8inline imageinline imageinline imageinline image
B35inline imageinline imageinline imageinline image
B44inline imageinline imageinline imageinline image
Marginalinline imagep.2 = 0.3841p.3 = 0.30011.0 inline image

Taking haplotype frequencies pij as true values, the discrepancy of the independence model away from association is

  • display math
  • display math
  • display math
  • display math
  • display math

The relative contributions of each of the alleles at the B locus to the 0.2928 in DAI above are (after rounding) 22.71%, 61.77%, 13.76% and 1.76% for B7, B8, B35 and B44 respectively, implying that associations with allele B8 are responsible for most of the departure of the independence model away from association, assuming this last one is ‘true’; note that the nij do not enter into the calculations, since the frequencies are viewed as true ones. For example, the relative contribution of B8 to DAI is arrived at as follows:

  • display math
  • display math

and then inline image Likewise,

  • display math

and the relative contributions of the four B-locus alleles to DIA are 16.80%, 53.93%, 12.11% and 17.16% for B7, B8, B35 and B44 respectively, with B8 playing a major role again. Similar calculations can be done for the A locus. The KL distance is 31.6235 + 49.5987 = 81.2222, and the KL distance per haplotype scored (N = 108) is 81.2222/108 = 0.7521. The indexes of association are inline imageand 1 – θ ≈ 0.39.

In an R×C table of alleles, a correlation has little meaning and is not invariant with respect to how alleles are scored (e.g. 0, 1, 2) at each of the intervening loci. On the other hand, the chi-square statistics of Hill (1975) and Zhao et al. (2005) are also invariant and have a simple interpretation in terms of deviations expected under the null distribution. Our parameter θ is also score invariant, is independent with respect to the order of rows and columns, and applies to any distribution. For instance, if the simple multinomial sampling model is violated due to, e.g. over-dispersion, an appropriate θ could be developed under such model, whereas a ‘normalized’ χ2 would provide a crude, albeit probably useful, approximation.

The ‘relative distance’ metric takes the form

  • display math

with

  • display math

and

  • display math

We evaluated D′AI, D′IA and θ′ = D′IA/(D′IA + D′AI) using Monte Carlo sampling. In the calculation of D′AI one million random trials of size 108 each, with {nij} varying at random, were sampled from a multinomial distribution with probabilities pij; then, inline image was evaluated for each realization and averaged over trials. Similarly, one million trials of size 108 were sampled from a multinomial distribution with probabilities pi.pj. and the realizations of inline image were averaged out. The index θ′ was estimated using the mean and median of the draws of θ′, yielding 0.9953 and 0.9998. The association (disequilibrium) between alleles at the A and B loci is patent, both when logarithmic (θ) and relative (θ′) distances are employed to measure it.

Systems of multivariate beta processes

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

Bivariate beta distributions

A common finding in studies of population differentiation using Wright's F-statistics (Wright 1931; Cockerham 1969) with single nucleotide polymorphisms (SNPs) is that estimates of allelic frequencies within a bin of adjacent SNPs are correlated (e.g. Akey et al. 2002; Weir et al. 2005; Akey 2009). This is typically due to stochastic dependence between alleles at linked loci (producing LD), but evolutionary processes causing dependence among the true frequencies themselves have been suggested; see Ohta (1982) for a population genetics model based on this concept. For example, a correlation between alleles at randomly drawn pairs of loci arises if alleles are conditionally (given the allelic frequencies) independent but with allelic frequencies varying at random according to a beta distribution.

Suppose that pairs of loci are sampled at random from some conceptual population of equi-correlated loci and that all pairs of ‘true’ frequencies define a probability distribution. Wright (1937) found that a beta distribution arose from a diffusion equation used to study changes in allele frequencies in finite populations, so the beta process is well grounded in population genetics. Beta distributions have been used in studies of population differentiation as mixing processes to create randomness of allelic frequencies, leading to beta-binomial likelihoods or to unrecognized posterior distributions that must be resolved using sampling procedures, e.g., Holsinger (1999), Balding (2003), Beaumont & Balding (2004) and Gianola et al. (2010).

Similarly, it would seem sensible to assume that allelic frequencies of pairs of loci within the hypothetical population have an association that can be modelled with bivariate beta distributions to represent random variation of true, albeit unknown, allelic frequencies. For example, suppose that there are S half-sib families that can be viewed as drawn as random from some population and that we are interested in modeling association between pairs of allelic frequencies stemming from such sampling scheme. Here, distributions proposed by Sarmanov (1966) and Olkin & Liu (2003) are discussed and generalized as candidate processes that could be used to model this type of association.

Olkin-Liu bivariate beta distribution

First, we review how a specific bivariate beta distribution arises. As in Olkin & Liu (2003), let U, V and W be random variables following independent standard Gamma distributions with parameters a, b and c respectively. By construction, inline image and inline image possess the beta distributions X: Beta(a,c) and Y: Beta(b,c) respectively. Clearly, X and Y are positively correlated (through W), with a strength that depends on the values of a, b, c. Olkin & Liu (2003) show that X and Y have a bivariate beta distribution with density function

  • display math(14)

Where 0 < < 1 and 0 < < 1. Moments E(XkYl) cannot be written in closed form but can be approximated numerically or using sampling methods. Large c and small a, b produce correlations close to 0, whereas large a, b or small c produce correlations close to 1 (Olkin & Liu 2003). For example, a correlation equal to 0.002 is obtained for = 0.01 and = 5, whereas the correlation is 0.91 for = 2.5, = 4 and = 0.1. Figure 4 displays scatter plots of 5000 samples obtained from each of four bivariate beta distributions. Plot 1 represents a distribution in which the correlation is very low, and yet, there is considerable association between pairs of values near 0, illustrating inadequacy of correlation to reveal association. In plot 2 (resembling a ‘meteorite’) clustering takes place primarily at large values of X and Y. The two bottom plots depict bivariate beta distributions with similar correlations but with a completely different pattern of association. Clearly, correlation often fails as measure of statistical association.

image

Figure 4. Scatter diagrams of 5000 samples from each of four Olkin-Liu bivariate beta distributions. Plot 1 illustrates a strong association with essentially no correlation. Plot 2 (‘meteorite’) depicts a limitation of the correlation as a parameter for describing association. Plot 3 suggests association clearly. Plot 4 shows a bivariate distribution that is not trivial: the true correlation (0.46) arises primarily due to weaker association in the ‘middle’ of the bivariate sampling space.

Download figure to PowerPoint

In this bivariate beta model the density ratio takes the form

  • display math(15)

Where p(x) and p(y) are the beta densities of X and Y: Beta(b,c). Then

  • display math
  • display math

so that

  • display math
  • display math(16)

The corresponding D′AI, D′IA and KL can be calculated by taking expectations of density ratios as opposed to expectations of their logarithms.

Indexes of association based on D′AI, D′IA and KL were calculated for the following combinations of parameters, where ρ denotes the expected correlation: 1) a = 0.01, = 0.01, = 5 (ρ = 0.002); 2) a = 1, = 0.5, = 0.6 (ρ = 0.251), 3) = 1, = 1, = 0.9 (ρ = 0.496) and 4) = 2, = 3, = 0.4 (ρ = 0.750). Expectations under the independence model were approximated by drawing 1 million random numbers from independent X ∼ Beta(a,c) and Y ∼ Beta(b,c) distributions, and then averaging the evaluations of the appropriate expressions over the draws. In the association model, 1 million random triplets (U,V,W) were drawn from independent standard gamma variables with parameters a, b and c, to then form bivariate realizations as inline image and inline image Table 3 gives the results, and all quantities (save for the true ρXY) are Monte Carlo estimates. The indexes of association θ′ and 1 − θ′ departed from 0.5 and approached 0 and 1, respectively, as the correlation in the bivariate beta distribution became stronger. Likewise, γ′ and γ′′ drifted away from 0, approaching |1|. In settings (3) and (4), θ′ and 1 − θ′ suggest a stronger association than what would be indicated by ρ.

Table 3. Departures of Olkin-Liu bivariate beta distributions from independence as measured by D′AI, D′IA and by the indexes of association θ′ and 1 − θ′ (when inline image independence holds), for four combinations of parameters a, b and c. inline image is the Monte Carlo estimate (1 million samples) of the true correlation ρXY. Means of variables X and Y are estimates from samples from association (A) or independence (I)models. The D′AI and D′IA parameters are estimates under the appropriate asumptions; KL = D′AI + D′IA; θ′ D′IA/(D′AI + D′IA); γ′ = 2θ′  1; γ′′ = 1 − 2q′
Iteminline imageinline imageinline imageinline image
ρXY0.0020.25100.49600.7500
inline image0.0020.24990.49510.7304
ave. (X)inline imageinline imageinline imageinline image
ave. inline imageinline imageinline imageinline imageinline image
inline image0.99990.87090.42920.0745
inline image0.99990.80090.07791.944 × 10−5
KL1.99991.57810.50710.0745
θ0.50000.44810.15370.0003
1 − θ0.50000.55190.84630.9997
γ0.0000−0.1037−0.6926−0.9995
γ′′0.00000.10370.69260.9995

Olkin-Liu bivariate beta-binomial distribution

In studies of LD, focus is typically on correlation between alleles rather than between frequencies. We discuss how a correlation between alleles can arise when the association stems from frequencies, as could happen when the population consists of clusters of individuals resulting from some family structure. Consider bi-allelic locus l with allelic frequencies Pr(Al)  = pl and Pr(al) = 1 − pl; Al and al denote the two alleles at locus l. Suppose that a sample of N individuals is scored, and that the observed number of copies of Al and al are inline image and inline image respectively, with inline image Given pl, the distribution of inline image is Binomial(2N,pl). Now, let the allelic frequency pl vary at random among clusters (e.g. sub-populations) according to a beta distribution with parameters (a,c). Then, the marginal distribution of the observed number of alleles is beta-binomial (e.g., Casella & George 1992; Sorensen & Gianola 2002), that is, the probability of observing inline image copies among the 2N alleles scored is

  • display math(17)

Consider now a pair of loci, and let the observed number of copies of the four alleles be represented as inline image haplotype frequencies are not relevant in the model that follows. If, given the allelic frequencies p = {pl}, the number of copies of A1 and A2 are independently distributed, i.e., the two loci are in linkage equilibrium, the distribution of the observed number of alleles is

  • display math(18)

Suppose now that this pair is a realization from a stochastic process where loci are sampled over a conceptual population of pairs, e.g. the two loci are drawn at random from a physical ‘bin’ of a certain length in kilobases. This process generates a covariance structure between allelic frequencies of pairs of loci within the ‘bin’. The alleles would be in LD, marginally, even though being in conditional (given the allelic frequencies) equilibrium. If the unobserved, ‘true’, allelic frequencies follow some bivariate distribution with density g(p1,p2|θ), where θ denotes its parameters, the density of the marginal distribution of the observed number of allelic counts under this association model would be

  • display math(19)

If a bivariate Olkin-Liu beta distribution with parameters θ = (a,b,c) is assumed, the joint density of allelic frequencies p1 and p2 as in (14). Letting

  • display math

the marginal distribution (19) takes the form

  • display math
  • display math(20)

This integral cannot be written in closed form, but it can be evaluated using Monte Carlo methods such as importance sampling (e.g. Sorensen & Gianola 2002). One possible scheme is outlined in the Appendix.

Under this distribution

  • display math

Where pI(DATA) is the distribution under independence between observed allele numbers, and given by the product of two beta-binomial distributions, that is

  • display math
  • display math(21)

Then

  • display math
  • display math(22)

where inline image is defined in the Appendix. Preceding the expectation, the first term involves the parameters of the bivariate beta distribution, the second involves sample size (N) as well, and the third one depends on observed allelic counts.

The logarithmic and relative distances between distributions are calculated as

  • display math
  • display math
  • display math

and

  • display math

With θ and θ′ calculated as before. Calculations proceed on an entirely numerical basis.

A generalization of the beta process to a situation with more than two loci is in Olkin & Liu (2003). For K loci the joint density of the allelic frequencies takes the form

  • display math

The marginal distribution of allelic counts is obtained by mixing a multinomial distribution over the multivariate beta density above, but results are not available in closed form, so a numerical solution is needed again.

Lee-Sarmanov bivariate beta distribution

Lee (1996) proposed a different bivariate beta distribution following Sarmanov (1966); see Danaher & Hardie (2005). A Lee-Sarmanov bivariate beta distribution for allelic frequencies p1 and p2 has density

  • display math(23)

Where B1(.) and B2(.) are beta densities as before and w is a parameter (w = 0 produces independence); θ* = (a1,b1,a2,b2,w). Above, inline image and inline image are the expected values of p1 and p2 respectively. The marginal densities of p1 and p2 under this model are B1(.) and B2(.). Further,

  • display math
  • display math

so that

  • display math

with

  • display math

In the Lee-Sarmanov bivariate beta distribution, a null correlation implies independence. Under this model the density ratio becomes

  • display math

so

  • display math

where Eg denotes expectation under association and

  • display math

is the corresponding expectation under independence of the distributions. Further

  • display math
  • display math(24)

is in closed form, while

  • display math(25)

can be estimated using Monte Carlo sampling, assuming this expectation exists. Hence

  • display math

The index θ′ was evaluated for three sets of values of the a,b parameters for each of the two intervening beta distributions while setting the correlation at inline image or inline image. Metric D′AI in (24) follows directly from the correlation, while D′IA was estimated using 1 million random samples from each of the two beta distributions. For each set of a,b parameters w was calibrated via

  • display math

Note that D′AI is solely a function of the correlation, so it was the same for the three pairs of beta distributions examined. Using (25), D′IA was estimated by averaging inline image over draws. Because inline imageis a ratio between density functions, negative realizations were not used when forming the Monte Carlo estimate (this produces an upward bias but a more precise estimate). As shown in Table 4, the KL distance between the bivariate beta distribution and the independence process increased monotonically with the strength of the correlation. The indexes of association θ′ and 1 − θ′ also drifted away monotonically from inline image only in setting 2. For instance, in setting 1 θ′ increased from 0.50 to 0.66 as the correlation increased from 0 to inline image but decreased to 0.60 when the correlation was inline image. The shape of the bivariate beta distribution was examined for this setting by plotting the density at 20 000 random points drawn from each of the marginal Beta(2,2) distributions. This is shown in Fig. 5: the Lee-Sarmanov density ‘evolves’ towards a more complex topography as correlation increases. When correlation grows from 0 to inline image, the shape of the joint distribution is not too different from that under independence, as also suggested by KL and θ′. However, when the correlation increases by a factor of 3 from inline image to inline image, the KL distance grows only about two times, while θ′ and 1 − θ′ move away from inline image at an even slower rate; this is expected because θ has an upper bound at 1,  whereas KL takes any positive value in the real line. The bimodal shape of the density also suggests that more intensive sampling is needed to obtain reliable estimates of the indexes of association.

image

Figure 5. Plots of the densities of four Lee-Sarmanov bivariate beta distributions having the same Beta (2,2) marginal distributions but differing in the strength of association: as the correlation increases multi-modality emerges.

Download figure to PowerPoint

Table 4. Departures of Lee-Sarmanov bivariate beta distributions with varying correlation from independence, as measured by D′AI, D′IA and by indexes of association q′ and 1 − q′ (when inline image independence holds). The four correlations levels depend on the value of parameter w of the bivariate distribution, and on parameters a,b of the two ‘parental’ beta distributions. D′AI has closed form and D′IA was estimated using 1 million samples from the beta distributions. KL = D′AI + D′AI; q′ D′IA/(D′AI + D′IA)
Setting/Iteminline imageinline image (% negative)KLθ1 − θ
(1) a1 = b1 = a2 = b2 = 2
Correlation (w)
inline image1120.50000.5000
inline image1.06251.2681 (0.0)2.33070.54410.4559
inline image1.56253.0985 (8.5)4.66100.66480.3352
inline image1.80732.7369 (11.2)4.54680.60190.3981
(2) a1 = 10; b1 = 90; a2 = 90; b2 = 10
Correlation (w)
inline image1120.50000.5000
inline image1.06251.1466 (0.4)2.20910.51900.4810
inline image1.56252.6824 (6.3)4.24490.63190.3681
inline image1.80734.4937 (8.7)6.30370.71280.2871
(3) a1 = .25; b1 = .75; a2 = .75; b2 = .25
Correlation (w)
inline image1120.50000.5000
inline image1.06251.3766 (0.9)2.43910.56440.4356
inline image1.56251.5094 (4.9)3.07190.49140.5086
inline image1.80731.5876 (5.6)3.39760.46730.5327

As shown in Danaher & Hardie (2005), the marginal distribution of the observed allelic counts inline image is obtained by mixing (18) over the Lee-Sarmanov density (23), leading to

  • display math
  • display math(26)

Using the fact that under independence the joint distribution is product beta-binomial as in (21), it turns out that

  • display math

so that closed forms are available for

  • display math
  • display math

and

  • display math
  • display math

The correlation between number of copies of the A alleles at the two loci is (Danaher & Hardie 2005)

  • display math
  • display math

so that for large N

  • display math

A multivariate generalization of (26) is direct but appears that it has not been reported elsewhere. The joint density of the allelic frequencies at L loci takes the form

  • display math(27)

It is straightforward to verify that this density integrates to 1,  and that all marginal distributions are beta. Now, let

  • display math
  • display math

denote the beta-binomial distribution of the observed number of copies of allele Al, out of 2N copies. The marginal distribution of allelic counts over all loci can be shown to be, after algebra

  • display math(28)

We propose (for obvious reasons) to name (27) and (28) as multivariate beta GMS-Sarmanov and multivariate beta-binomial GMS-Sarmanov distributions, respectively.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

The problem addressed in this study was that of measuring departures of the joint distribution of genetic variables from independence. New measures of association based on notions of statistical distance were proposed and evaluated under several scenarios, spanning from multivariate Gaussian distributions modeling, say, pleiotropic effects, to systems of beta distributions describing association between allelic frequencies. Two hereto seemingly unreported probability distributions were derived, and termed GMS-Sarmanov multivariate beta and GMS-Sarmanov multivariate beta-binomial distributions. The standard LD problem was also dealt with to illustrate the generality of the approaches proposed.

Linkage disequilibrium analysis has been the subject of an enormous amount of research, re-energized with the availability of massive molecular marker and sequence data. For example, in the context of coalescent theory, an important issue is the decoupling of ancestries of sites at different regions of the chromosome, and this is done by studying association between alleles at different loci (Wakeley 2009). Most of the standard measures of LD employed are pairwise statistics such as correlations (e.g. Hill & Robertson 1968; Hill 1974; Hedrick 1987; Lewontin 1988; Morton et al. 2001; Pritchard & Przeworski 2001; McVean 2007), because of ease of calculation and, perhaps, ease of interpretation. However, as pointed out by Lewontin (1988), although a correlation is a meaningful parameter (e.g., in terms of the amount of variability of Y explained when X is observed) in a bivariate normal distribution, it is arguably less so when applied to discrete data, a problem that is well known in quantitative genetic analysis of discrete phenotypes (Dempster & Lerner 1950; Gianola 1982). Further, pairwise measures do not characterize association well if a given genetic system involves many correlated random variables, as in multi-locus measures of LD. Actually, such measures have been reported much less often, e.g. Weir (1996) gives formulae for up to four loci, and Sabatti and Risch (2002) give a measure based on excess of heterozygosity or of homozygosity.

Nothnagel et al. (2002) presented an entropy-based index of LD that is related to our approach, but that differs in some respects. These authors calculate the entropy of the allelic distributions under linkage equilibrium and under disequilibrium, and express the difference in entropy as a fraction of the equilibrium entropy. This produces a normalized entropy that takes values between 0 and 1 (0 indicating no association between alleles at different loci). A formal objection is that, while entropy measures uncertainty in a distribution, its relationship to statistical distance between distributions (Kullback 1968) requires more elaboration. Their method is for bi-allelic loci only, and entropy does not always behave well in continuous distributions (Bernardo and Smith, 1994; Sorensen & Gianola 2002), whereas relative entropy measures such as the KL distance are well defined.

Closer to the spirit of our procedures, Liu & Lin (2005) suggested a measure of LD based on a relative KL discrepancy, but it differs from the ones we propose. This is mainly because they use only one of the two components (termed DAI in our paper) of the invariant KL distance, and express it relative to the maximum value it can take. Using their procedures with DIA would produce a different value of association, and maybe a different qualitative picture might emerge from analysis of genetic data. However, their ideas can be embedded in our approach, and expressing association relative to a maximum distance is a well taken point. It also turns out that these results can also be adapted to the continuous domain, with some care. To illustrate this, it suffices to consider two random vectors, x and y, as generalization to a higher-dimensional system is straightforward. Let H(x), H(y), and H(x,y) be the entropies (non-negative, by construction, although pathological examples may arise in continuous distributions) of the marginal distributions of x, y, and x,y respectively, with

  • display math
  • display math

and

  • display math

The discrepancy away from the association model is

  • display math(29)

Note that

  • display math
  • display math

Where H(y|x) is the entropy of the conditional distribution of y given x. This implies that H(x, y) ≧ H(x), where x can also be any partition of x, e.g. its ith coordinate xi, say, and similarly H(x, y) ≧ H(y). Thus, H(x, y) ≧ maxi H(xi). Using this in (29)

  • display math

More generally, for a vector z with n elements, the KL discrepancy away from the independence model is

  • display math(30)

Likewise

  • display math

where

  • display math

is the cross-entropy between distributions. From the expressions for DAI above, H(x) + H(y) ≧ maxi[H(xi),H(yi)], so that

  • display math

This leads to additional indexes of association, such as

  • display math(31)

and

  • display math(32)

both taking values between 0 and 1. These indexes were evaluated for a bivariate normal distribution with correlation inline image and marginal distributions x: N1(0,1) and y: N2(0,1). One obtains

  • display math
  • display math
  • display math
  • display math
  • display math
  • display math
  • display math
  • display math
  • display math

and

  • display math

The four indexes, ρ, θ, θ** and q** produce a different value of the strength of association between random variables in a distribution. This suggests that probably there is no such thing as a universal measure, although it is clear that the coefficient of correlation lacks generality.

In a nutshell, this paper presents measures of association for systems of genetic variables that go beyond standard two-dimensional statistics. The procedures apply to either continuous or discrete data, and typically require numerical implementation, because many of the expressions are not available in closed form, depending on the distribution assumed. Our procedures (like any other method) require knowledge of the parameters of the distributions under independence or association, and estimation of such parameters is not the objective of this paper. Today, computer-intensive approaches for parameter inference, such as Bayesian Markov chain Monte Carlo (Sorensen & Gianola 2002; Gelman et al. 2004) or approximate Bayesian computation (Beaumont et al. 2002), can be implemented effectively in today's high throughput systems (Wu et al. 2011).

Finally, since this volume is in honor of the contributions of Professor Moshe Soller, a relationship between this paper and his work should be established. As noted at the onset of this document, among many other accomplishments, he pioneered marker-assisted selection in animal and plants via exploitation of linkage and LD relationships between markers and unknown quantitative trait loci. Examples of his papers in this area include Soller & Genizi (1978) and Lipkin et al. (2009), and these connect with some of the developments presented here. We look forward to many more scientific accomplishments by Moshe!

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

Part of this work was carried out while the senior author was a Visiting Professor at Georg-August-Universität, Göttingen (Alexander von Humboldt Foundation Senior Researcher Award) and Visiting Scientist at the Station d'Amélioration Génétique des Animaux, Centre de Recherche de Toulouse (Chaire D'Excellence Pierre de Fermat, Agence Innovation, Midi-Pyreneés). Support by the Wisconsin Agriculture Experiment as well as by the Deutsche Forschungsgemeinschaft (GRK 1644/1) is acknowledged. W.G. Hill is thanked for useful comments.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix
  • Agresti A. (2002) Categorical Data Analysis. Wiley, New York.
  • Akey J.M. (2009) Constructing genomic maps of positive selection in humans: where do we go from here? Genome Research 19, 71122.
  • Akey J.M., Zhang G., Zhang K., Jin L. & Shriver M.D. (2002) Interrogating a high-density SNP map for signatures of natural selection. Genome Research 12, 180514.
  • Balding D.J. (2003) Likelihood-based inference for genetic correlations. Theoretical Population Biology 63, 22130.
  • Beaumont M.A. & Balding D.J. (2004) Identifying adaptive genetic divergence among populations from genome scans. Molecular Ecology 13, 96980.
  • Beaumont M.A., Zhang W. & Balding D.J. (2002) Approximate bayesian computation in population genetics. Genetics 162, 202535.
  • Bernardo J.M. & Smith A.F.M. (1994) Bayram Theory. Wiley, Chichester.
  • Casella G. & George E. (1992) Explaining the Gibbs sampler. The American Statistician 46, 16774.
  • Cockerham C.C. (1969) Variance of gene frequencies. Evolution 23, 7284.
  • Danaher P.J. & Hardie B.G.S. (2005) Bacon with your eggs? Application of a new bivariate beta-binomial distribution. The American Statistician 59, 2826.
  • Dempster E.R. & Lerner I.M. (1950) Heritability of threshold characters. Genetics 35, 21236.
  • Fisher R.A. (1918) On the corrrelation between relatives on the supposition of Medelian inheritance. Transactions of the Royal Society of Edinburgh 52, 399433.
  • Frankham R., Ballou J.D. & Briscoe D.A. (2010) Introduction to Conservation Genetics. Cambridge, Cambridge.
  • Gelman , A. , Carlin J.B., Stern H.S. & Rubin D.B. (2004) Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton.
  • Gianola D. (1982) Theory and analysis of threshold characters. Journal of Animal Science 54, 107996.
  • Gianola D., de los Campos G., Hill W.G., Manfredi E. & Fernando R.L. (2009) Additive genetic variability and the Bayesian alphabet. Genetics 183, 34763.
  • Gianola D., Simianer H. & Qanbari S. (2010) A two-step method for detecting selection signatures using genetic markers. Genetics Research 92, 14155.
  • Hedrick W. (1987) Gametic disequilibrium measures: proceed with caution. Genetics 117, 33141.
  • Hill W.G. (1974) Estimation of linkage disequilibrium in randomly mating populations. Heredity 33, 22939.
  • Hill W.G. (1975) Linkage disequilibrium among multiple neutral alleles produced by mutation in finite population. Theoretical Population Biology 8, 11726.
  • Hill W.G. & Robertson A. (1968) Linkage disequilibrium in finite populations. Theoretical & AppIied Genetics 38, 22631.
  • Holsinger K.E. (1999) Analysis of genetic diversity in geographically structured populations: a Bayesian perspective. Hereditas 130, 24555.
  • Kullback S. (1968) Information Theory and Statistics. Dover, Mineola.
  • Lee M.L.T. (1996) Properties and applications of the Sarmenon family of Bivariatse distributions. Communications in Statistics - Theory and Methods 25, 120722.
  • Lewontin R.C. (1988) On measures of gametic disequilibrium. Genetics 120, 84952.
  • Lipkin E., Straus K., Tal Stein R. et al. (2009) Extensive long-range and nonsyntenic linkage disequilibrium in livestock populations: deconstruction of a conundrum. Genetics 181, 6919.
  • Liu Z. & Lin S. (2005) Multilocus LD measure and tagging SNP Selection with generalized mutual information. Genetic Epidemiology 29, 35364.
  • McVean G. (2007) Linkage disequilibrium, recombination and selection. In: Handbook of Statistical Genetics, 3rd edn (Ed. by D.J. Balding, M. Bishop & C. Cannings), pp. 90944. Wiley, New York.
  • Meuwissen T.H., Hayes B.J. & Goddard M.E. (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 181929.
  • Morton N.E., Zhang W., Taillon-Miller P., Ennis P., Kwok P.Y. & Collins A. (2001) The optimal measure of allelic association. Proceedings of the National Academy of Sciences 98, 521721.
  • Nothnagel M., Frst R. & Rohde K. (2002) Entropy as a measure for linkage disequilibrium over multilocus haplotype blocks. Human Heredity 54, 18698.
  • Ohta T. (1982) Linkage disequilibrium with the island model. Genetics 101, 13955.
  • Olkin I. & Liu R. (2003) A bivariate beta distribution. Statistics & Probability Letters 62, 40712.
  • Penny W. & Roberts S. (2002) Bayesian multivariate autoregressive models with structured priors. IIE Proceedings on Vision, Signal & Image Processing 149, 33341.
  • Pritchard J.K. & Przeworski M. (2001) Linkage disequilibrium in humans: models and data. American Journal of Human Genetics 69, 114.
  • Rosa , G. J. M. , Padovani C.R. & Gianola D. (2003) Robust linear mixed models with normal/independent distributions and Bayesian MCMC implementation. Biometrical Journal 45, 57390.
  • Sabatti C. & Risch N. (2002) Homozygosity and linkage disequilibrium. Genetics 160, 170719.
  • Sarmanov O.V. (1966) Generalized normal correlation and two-dimensional Frechet classes. Doklady (Soviet Mathematics) 168, 5969.
  • Soller M. & Genizi A. (1978) The efficiency of experimental designs for the detection of linkage between a marker locus and a locus affecting a quantitative trait in segregating populations. Biometrics 34, 4755.
  • Sorensen D. & Gianola D. (2002) Likelihood, Bayesian, and MCMC Methods in Quantitative Genetics. Springer, New York.
  • Spiess E.B. (1989) Genes in Populations. Wiley, New York.
  • Strandén I. & Gianola D. (1998) Attenuating effects of preferential treatment with Student-t mixed linear models: a simulation study. Genetics, Selection, Evolution 30, 56583.
  • Wakeley J. (2009) Coalescent Theory: An Introduction. Roberts & Company, Greenwood Village.
  • Weir B.S. (1996) Genetic Data Analysis. Sinauer, Sunderland.
  • Weir B.S., Cardon L.R., Anderson A.D., Nielsen D.M. & Hill W.G. (2005) Measures of human population structure show heterogeneity among genomic regions. Genome Research 15, 146876.
  • Weller J.I., Kashi Y. & Soller M. (1990) Power of daughter and grandaughter designs for determining linkage between marker loci and quantitaive trait loci in dairy cattle. Journal of Dairy Science 73, 252537.
  • Wright S. (1931) Evolution in Mendelian populations. Genetics 16, 97159.
  • Wright S. (1937) The distribution of gene frequencies in populations. Proceedings of the National Academy of Sciences 23, 30720.
  • Wu X.L., Beissinger T.M., Bauck S., Woodward B., Rosa G.J.M., Weigel K.A., de Leon-Gatti N. & Gianola D. (2011) A primer on high-throughput computing for genomic selection. Frontiers in Livestock Genomics 2, 4. doi: 10.3389/fgene.2011.00004.
  • Zhao H., Nettleton D., Soller M. & Dekkers J.C.M. (2005) Evaluation of linkage disequilibrium measures between multi-allelic markers as predictors of linkage disequilibrium between markers and QTL. Genetics Research 86, 7787.

Appendix

  1. Top of page
  2. Summary
  3. Introduction
  4. Measuring association among variables
  5. Multivariate Gaussian and t-distributed variables
  6. Linkage disequilibrium
  7. Systems of multivariate beta processes
  8. Discussion
  9. Acknowledgements
  10. Conflicts of interest
  11. References
  12. Appendix

Bivariate and univariate t-distributions

The density of a K-variate random vector following a multivariate t-distribution is

  • display math(33)

where inline image Σ is the scale matrix, υ > 0 is the degrees of freedom parameter and inline image is the Gamma function; the covariance matrix of this distribution is inline image The specification k = 2, μ = 0 and Σ = I2, yields a bivariate t-distribution with null mean and covariance matrix inline image inline image the two marginal distributions are univariate-t, each having a null mean, variance inline image and ν degrees of freedom. Although uncorrelated, these two random variables are statistically dependent because

  • display math
  • display math
  • display math

The draws from the bivariate t-distribution needed for evaluating inline image and inline image are obtained as

  • display math

Under independence, the sampling from 2 univariate t-distributions is done as

  • display math

Above, the zs are independent draws from inline image inline image is a draw from a central chi-square distribution on ν degrees of freedom, and inline image and inline image are two independent draws from the same distribution.

Olkin-Liu bivariate beta-binomial distribution: sampling scheme

When allelic frequencies are independent, their joint density is the product of the densities of the beta random variables inline image and inline image However, this does not provide a suitable importance sampling distribution, because dependency would not be recognized at all, with the sample space not visited appropriately. A form of introducing dependency is via an importance sampling density with two dependent beta distributions, such as

  • display math(34)

Distribution inline image depends on p1, and has expectation

  • display math

which would be equal to the mean value under independence only if p1 = 0.5. The double integral in (20) can be represented as

  • display math
  • display math

where

  • display math

denotes expectation with respect to the joint distribution with density inline image Using this in (20), the marginal distribution of the data under the association model is

  • display math(35)

Expression (35) is a seemingly unreported discrete probability distribution which we term ‘Olkin-Liu bivariate beta-binomial distribution’.