A Likelihood Ratio Test for Genome-Wide Association under Genetic Heterogeneity
Corresponding authors: Yongzhao Shao, Ph.D. Departments of Population Health and Environmental Medicine, New York University School of Medicine, 650 First Avenue, Room 538, New York, NY 10016, USA. Tel: 1-212-263-0324, Fax: 1-212-263-8570; E-mail: email@example.com
Most existing association tests for genome-wide association studies (GWASs) fail to account for genetic heterogeneity. Zhou and Pan proposed a binomial-mixture-model-based association test to account for the possible genetic heterogeneity in case-control studies. The idea is elegant, however, the proposed test requires an expectation-maximization (EM)-type iterative algorithm to identify the penalised maximum likelihood estimates and a permutation method to assess p-values. The intensive computational burden induced by the EM-algorithm and the permutation becomes prohibitive for direct applications to GWASs. This paper develops a likelihood ratio test (LRT) for GWASs under genetic heterogeneity based on a more general alternative mixture model. In particular, a closed-form formula for the LRT statistic is derived to avoid the EM-type iterative numerical evaluation. Moreover, an explicit asymptotic null distribution is also obtained, which avoids using the permutation to obtain p-values. Thus, the proposed LRT is easy to implement for GWASs. Furthermore, numerical studies demonstrate that the LRT has power advantages over the commonly used Armitage trend test and other existing association tests under genetic heterogeneity. A breast cancer GWAS dataset is used to illustrate the newly proposed LRT.
Common and complex diseases (or traits) are often genetically heterogeneous in aetiologies (Lander & Schork, 1994; Zhou & Pan, 2009). Some well-known complex diseases with genetic heterogeneity include asthma, breast cancer (Hall et al., 1990; Wooster et al., 1994; Turnbull et al., 2010), and diabetes (Hattersley, 1998; Sladek et al. 2010). As in Zhou & Pan (2009), this paper considers the situation when a complex disease (or trait) is caused by mutations in multiple unlinked loci, commonly referred to as locus heterogeneity (Ott, 1999; Abreu et al., 2002; Fu et al., 2006). As a consequence of genetic heterogeneity, the population of individuals with disease may be decomposed into various latent sub-populations, each with disease caused by mutations at different loci (or their combinations). Most of the existing association tests for population-based case-control studies, e.g. GWAS, are based on comparing the mean genotype scores (e.g. the Armitage trend test (ATT)) between the case and control groups, which are not efficient in the presence of genetic heterogeneity. Zhou & Pan (2009) showed that it can be beneficial to use methods that account for genetic heterogeneity for testing association in a case-control study.
Similar to admixture mapping in linkage analysis (Smith, 1963; Abreu et al., 2002; Fu et al., 2006), Zhou & Pan (2009) proposed a binomial mixture model to account for genetic heterogeneity and developed a modified likelihood ratio test (MLRT) for a single locus (Fu et al., 2006). They also consider two methods to combine single-locus-based MLRTs across multiple loci in linkage disequilibrium to boost power when causal single nucleotide polymorphisms (SNPs) are not genotyped (Zhou & Pan, 2009). They illustrated, with a wide spectrum of numerical examples, that the proposed MLRT tests are more powerful than some commonly used association tests under genetic heterogeneity. Following Zhou and Pan, we define the genetic score X as the number of the minor alleles at a single locus for a subject. Zhou and Pan (2009) assumed that the genetic score in a healthy control subject follows a binomial distribution, that is
where and represents the minor allele frequency (MAF) on that specific locus of the control subject. On the other hand, under genetic heterogeneity, the genetic score for a diseased subject, , follows a simple two-component mixture binomial distribution,
where θ represents the probability of having the minor allele on one chromosome for a sub-group of cases with disease associated with the minor allele. They adopt a two-step procedure for parameter estimation. First, a maximum likelihood estimate (MLE) of is obtained based only on the control sample. Then, fixing the estimated at its MLE derived from the control-group data, maximum penalised likelihood estimates of other parameters in the mixture model are obtained using an expectation-maximization (EM)-type algorithm (Li et al., 2009). Subsequently, the penalised MLEs from the EM-step are plugged into a likelihood ratio to form a test statistic to detect the association between the marker genotypes and the disease status. Finally, they proposed a permutation procedure to obtain the p-value of the association test.
Zhou and Pan's idea is applicable to an association study for a limited number of candidate markers; however, there are several challenges in applying their proposed method to genome-wide association studies (GWASs). First, the computation of their proposed MLRT for a vast number of SNPs in a typical GWAS would be very intensive. Since the penalised MLEs are obtained by an EM algorithm for maximization of the penalised mixture likelihood, there are known complexities and caveats associated with the EM or other iterative methods for identifying MLEs and penalised MLEs in mixture models including the challenges in selecting multiple starting points for parameter estimation. Moreover, the p-value of the MLRT is proposed to be attained by permutation, which is also difficult to apply directly to detect the SNP-disease association in GWAS with a vast number of SNPs, where the significance level is usually set to be less than 10−6. In addition, it is widely believed that complex diseases and traits are caused by interplays of a large number of genetic loci and environmental risk factors. The simple binomial mixture model with two components in Equation (2) may be too simple to capture the complex heterogeneity for many complex diseases. A more general form of binomial mixture model can be written as follows:
where , , and if and only if . In particular, for many of the complex diseases with genetic heterogeneity, it is likely that J is quite large. Since it is hard to know the number of the sub-populations J under genetic heterogeneity, it is desirable to have a new test that is applicable without the need to know the exact value of J while allowing .
In this paper, we developed a LRT for GWASs based on the more flexible binomial mixture models in (3). It is widely believed that complex diseases and traits are caused by interplays of a large number of genetic loci and environmental risk factors. Thus, we assume that the genetic score in the case group, , follows a general binomial mixture distribution in (3) which allows the possibility of a large and unknown J. The proposed LRT overcomes the above-mentioned challenges of using Zhou and Pan's method for testing association of a vast number of SNPs in a typical GWAS. In particular, we derived the closed-form formula for the LRT statistic even though the MLEs of parameters in the binomial mixture model are non-regular with loss of identifiability (Liu & Shao, 2003). We further derived the simple closed-form asymptotic null distribution of the LRT which avoids the intensive numerical calculations, such as the EM-based iterations for identification of MLEs and the permutations for evaluation of p-values. Additionally, the LRT can be implemented without the requirement of knowing the number of components J in the mixture model (3). We conducted extensive simulation studies to show that the LRT has power advantages over Armitage trend test (ATT) and some other association tests under genetic heterogeneity. We applied our test to a real dataset from a breast cancer GWAS to illustrate that it can achieve a much smaller p-value than some commonly used tests when there is evidence of genetic heterogeneity. Thus, the proposed LRT might be used to scan SNPs in GWAS to make novel discoveries by taking account of genetic heterogeneity.
Notation and Setup
We focus on detecting marker-disease association at a single locus with two alleles A and a, such as a SNP in a case-control GWAS. Suppose controls and cases are sampled from the population. For each SNP, the genotype frequencies in a case-control study can be summarised as in the following 2 × 3 table (Table 1).
Table 1. The genotype frequencies for case-control data of a SNP
Let the genetic score and denote the number of minor alleles, say a, at a single locus for a healthy control and a diseased case, respectively. It is clear that Similar to Zhou and Pan's setup, we assume that under the null hypothesis, both and have the same binomial distribution as described in Equation (1). As in Zhou & Pan (2009), is assumed to have a binomial distribution under H1. Under the alternative hypothesis of genetic heterogeneity, we assume that has a mixture distribution as described in Equation (3). This last assumption is worthy of further comments. On the one hand, it is possible to have in Equation (3) under H1 both in practice and in theory; thus, it is conceptually desirable to allow in Equation (3). On the other hand, for likelihood inference, it is not necessary to have in the model in order to achieve the maximum of the likelihood because the model is actually saturated with In other words, for a given dataset, posing a model (3) with or with , the testing results from the LRT are not going to be different. In fact, as will be seen in the next section, our proposed LRT actually has the “non-parametric” nature because it has a closed-form formula, with a simple null distribution shown to be valid,; thus, it will be valid for testing any alternative models including the common models and those under heterogeneity. In this paper, we will establish that the test is actually a LRT under the specified setup motivated by the elegant work of Zhou & Pan (2009) and by the fact that the LRT has well-known optimalities in terms of statistical power and efficiency.
Mixture Binomial and Maximum Likelihood
Assuming the setup in the previous sub-section, under H0, using the notation in Table 1 and denoting the true value of as P0, the MLE of P0 for the overall combined case-control data in Table 1 is
Thus, the binomial likelihood function for the overall combined case-control data evaluated at , L0, is given by
where and is defined in (4). Following Zhou & Pan (2009), in the control group, the genetic score is assumed to follow a binomial distribution under the alternative hypothesis, say
Using the notation in Table 1, the MLE of within the healthy control group only is given by
The binomial likelihood function of the healthy controls data evaluated at , , is given by
Similarly, in the case group, if the genetic score has the distribution , the MLE of within the diseased case group only would be
However, as in Zhou & Pan (2009), we assume that under genetic heterogeneity, the cases can be divided into multiple latent sub-populations. Thus, under the alternative hypothesis of genetic heterogeneity, we assume that has a mixture distribution as described in Equation (3). It can be shown that (see Appendix Appendix), using the above notation, the maximum of the mixture likelihood for has an explicit formula:
The derivation of the above equation can be found in Appendix Appendix. It is also clear from the derivation in Appendix Appendix that the mixture likelihood function of the parameter vector in the mixture model (3) can have many local maxima due to the lack of identifiability in parameters (Liu & Shao, 2003). Nevertheless, the supremum of the mixture likelihood for has a single unique value for each dataset and can be obtained from the explicit formula in Equation (10). In the typical case-control study design, is independent of .
Using the maximum of the likelihood L0, and in Equations (5), (8) and (10), respectively, we can write down the explicit formula of the log-LRT statistic as follows:
No iterative numerical maximization of the mixture likelihood function is needed for the evaluation of the LRT statistic in (11). Thus, the LRT statistic is easy to compute even for GWAS. It is known that the LRT statistics for testing homogeneity in mixture models often have complicated asymptotic distributions that typically lack closed-form representations. However, the above statistic can be shown to have an explicit form of asymptotic distribution under the null hypothesis. More specifically, under H0, as → ∞ and → ∞, we have
where denotes a chi-squared distribution with d degrees of freedom, 1, 2. Although the above asymptotic null distribution can be derived from general results such as those in Chernoff & Lander (1995), Chiano & Yates (1995) or Liu & Shao (2003), an elementary and detailed direct derivation of the above asymptotic null distribution is given in Appendix Appendix for readers who are interested in a direct derivation based on first principles.
It is worth pointing out that our extensive numerical simulations discussed in the next section indicate that the simple asymptotic null distribution in (12) approximates the exact finite sample null distribution very well. The asymptotic formula is only slightly conservative. Therefore, the p-values of the LRT can be easily read off from the above simple closed-form asymptotic null distribution. For example, given any observed data in Table 1, one can first evaluate the value of in (11) and then can obtain the p-value using the following simple command in the widely used R-platform:
Last but not least, it is well known that the LRT generally has better power than other ad hoc tests. Thus, it should not be a surprise to see that the LRT can be more powerful than other commonly used tests which ignore the genetic heterogeneity that exists for many common complex diseases such as breast cancer. Finally, to implement the LRT, there is no need to identify the exact number of mixture components J in (3), which is desirable because J is hard to determine in practice.
Type I Errors
The LRT has an explicit asymptotic distribution under H0. Consequently, it is convenient to evaluate the p-value and type I errors. We conducted comprehensive simulations to compare the empirical type I error of the LRT to the nominal significance level ranging from 10−2 to 10−8. In the Monte Carlo simulations, the genotype data for both the control group and the case group were generated from the same binomial distribution ), where takes some fixed value P0, which represents the MAF. A number of simulation setups, which varied over a range of MAF and sample size, were selected. The control and case sample sizes are set to be equal. The nominal significance levels were taken to be 10−2, 10−3, 10−4, 10−5, 10−6, 10−7 and 10−8, respectively. For each setup, 1011 samples were generated. We found that the empirical type I error is slightly smaller than the nominal level, but they are extremely close to each other. Thus, using the asymptotic null distribution for the LRT is valid. For illustration, an example with 0.4 and sample size 1000 is shown in Table 2.
Table 2. Empirical type I error and nominal significance level at = 0.4 and
The significance level of the association test is usually set very small for GWASs. For example, the genome-wide significance level of 5×10−8 is being increasingly used for arrays that contain one million SNPs. The most commonly used association tests for GWAS include ATT and the test, both applicable for testing association in a 2 × 3 table between the case-control status and the three genotypes, as illustrated in Table 1. Accordingly, we designed simulation studies to evaluate and compare the powers of the LRT, the ATT and the test when the significance level is set to be 5×10−8. Note that, the MLRT of Zhou & Pan (2009) was not included for comparison due to its severe computational challenge when the significance level is very small. In the first set of Monte Carlo simulations, the control sample was generated from a binomial distribution ); the case sample was generated from a two-component mixture binomial distribution as described in (3) with :
Twenty thousand replicate datasets of controls and cases were simulated for each of the eight simulations setup and the empirical power for each test is shown in Table 3. The simulation results indicate that the LRT has power advantage over the ATT and the test under genetic heterogeneity.
Table 3. Empirical power when has a mixture distribution with
|Power|| || || || || || || || |
Similar power advantages of the LRT over other tests are also observed when the alternative mixture model has three components (), as demonstrated in Table 4, where θ3 for the cases is set as equal to for the control group.
Table 4. Empirical power when has a mixture distribution with
|Power|| || || || || || || || |
Note that the ATT, also called Cochran–Armitage trend test (CATT) by many researchers, has good power only when the disease risk of the genotypes and is monotone increasing or decreasing under the alternative hypothesis (Armitage, 1955; Freidlin et al., 2002). Thus, ATT can have very low power when there is a violation of a linear trend in the disease risk across the ordered genotypes and , as in the case of both setups #3 and #5 in Table 3.
It is clear from the power simulations across the multiple simulation setups that the LRT can be much more powerful than the commonly used ATT and the test in GWAS in the presence of genetic heterogeneity.
A Breast Cancer GWAS
Breast cancer is the most common cancer among women. Many genes on different chromosomes that underlie breast cancer have been identified including many well-known studies conducted two decades ago (Hall et al., 1990; Wooster et al., 1994). Many more genetic variants underlying breast cancer are still being discovered nowadays; thus, there is little doubt about the existence of genetic heterogeneity in the case of breast cancer. For illustration, we applied the newly proposed LRT to a breast cancer GWAS dataset. In particular, Turnbull et al. (2010) conducted a GWAS to identify breast cancer susceptibility alleles. They studied 582,886 SNPs in 3659 breast cancer cases and 4897 controls in the first stage, and evaluated promising SNPs that were identified in stage 1 in a second stage with 12,576 cases and 12,223 controls. In the paper, they reported five new susceptibility SNPs with summary genotype data of the five SNPs made publicly available. A literature search indicates that four of the five SNPs (rs1011970, rs10995190, rs704010 and rs614367) have been independently confirmed by other studies since the publication of their GWAS results in 2010 (Peng et al., 2011; Lambrechts et al., 2012). We evaluated the p-values of the LRT, ATT and test for these four SNPs for comparison. The results are summarised in Table 5.
Table 5. Comparison of p-values for the four SNPs reported in Turnbull et al. (2010)
|rs10995190|| Stage 1|
| || Stage 2||10−8|
| ||Fisher p-value|
|rs614367|| Stage 1|
| || Stage 2||10−8|
| ||Fisher p-value||10−16|
|rs704010|| Stage 1|
| || Stage 2|
| ||Fisher p-value|
|rs1011970|| Stage 1|
| || Stage 2|
| ||Fisher p-value|
Note that for the SNP rs10995190 and SNP rs614367, the p-values are smaller than the genome-wide significance level for the newly proposed LRT and the ATT, and for each of the two stages. The performance of the LRT is as good as or better than the other two tests. In particular, the LRT has an extremely small p-value for stage 2 data of SNP rs614367 showing statistical significance at even lower levels. It is thus not surprising that these SNPs are independently replicated by other GWAS. For the SNP rs704010 and SNP 1011970, a simple combined p-value (for combining the two stages), e.g. Fisher's meta p-value, indicates that both SNPs are significant even using the genome-wide significance level for all three tests. The newly proposed LRT is also very competitive for these two SNPs. For example, for the stage 1 data of SNP rs704010, only the p-value of the LRT is smaller than the genome-wide significance level . As an indication of overall strength of the test, Fisher's meta p-value of the LRT from the combined stages 1 and 2 is smaller than those of the other two tests, and the LRT is clearly the most competitive test among the three competitors. This example indicates the potential value of the proposed LRT for GWAS data to detect association of complex diseases where the presence of genetic heterogeneity is always a possibility.
In the analysis of GWAS data, potential latent genetic heterogeneity has largely been ignored by researchers. Zhou & Pan (2009) first proposed mixture models to account for genetic heterogeneity. However, for the analysis of a vast number of SNPs in GWAS, the MLRT of Zhou and Pan has major computational challenges. In this paper, using a more general binomial mixture model, we have derived a LRT for case-control association studies that improve the MLRT by Zhou and Pan on computational efficiency and multiple other aspects. In particular, the LRT statistic has a simple closed-form formula, which could avoid intensive computation, such as the EM algorithm for penalised MLEs. Additionally, we have derived an explicit asymptotic null distribution for the proposed LRT, which is convenient to obtain p-values even at a small significance level. Moreover, to perform the LRT, there is no need to decide the exact number of mixture components, which is convenient in practice. Therefore, the new LRT has computational advantages over the MLRT proposed by Zhou and Pan and is suitable for scanning SNPs in GWAS data.
As demonstrated by our numerical studies, in the presence of genetic heterogeneity, the LRT can be much more powerful than either Armitage's trend test or the test, both of which are among the most widely used tests in GWAS. Given that most complex diseases are widely believed to be polygenic and have environmental components, genetic heterogeneity is a hallmark of complex diseases. As illustrated using the GWAS data for breast cancer, newly proposed LRT can be easily used for any GWAS data; thus, researchers can use the simple algorithm to scan their SNPs as a cost-effective way to potentially make novel and important discoveries using existing data already collected in the large number of GWAS. Given that there are already about 1000 published GWAS, and many more genome-wide studies are being planned and conducted, the new LRT has the potential to become one of the useful tests to scan the SNPs in these GWAS, maybe as a secondary analysis to account for genetic heterogeneity. Thus, the new user-friendly LRT can potentially be used to increase the impact of existing and future GWASs.
This research is partially supported by the NIH Cancer Center Supporting Grant to NYU (2P30 CA16087), and the NIEHS Center Grant to NYU (5P30 ES00260), as well as a Stony Wold-Herbert Foundation grant to YS. The authors have no conflict of interest. The authors would like to thank the reviewers for insightful suggestions that lead to improvement of the paper.
Derivation of the Test Statistic of the LRT
To prove our proposed association test is indeed a LRT under the given setup, we just need to establish Equation (10), that is, when follows the mixture distribution in (3),
where is defined as in Equation (9). First, we want to show that when , the MLEs of the η in (3) satisfy
A simple application of Jensen's inequality yields that, for any η,
The right-hand side of the above inequality is an upper bound which may not be achievable in general. However, when , we can show that the equality in (13) is achievable. In fact, when , there are infinitely many values of the MLE that can make (13) an equality. It is straightforward and elementary to verify that one set of solutions for MLE is given as follows:
As indicated above, can take any values in an interval; thus, there are infinitely many sets of solutions for the MLE. Thus, Equation (15) is proved. Next, we show that when ,
First, we show that for any fixed η,
Using the inequality , we get
It is straightforward to verify that
Therefore, when , and for any η
Finally, it is obvious that
This finishes the proof (14), thus also (10).
The Asymptotic Null Distribution of the LRT
Under H0, both and have the same binomial distribution . We denote the true null value for as P0. Without loss of generality, we assume to avoid appearing in any denominator. First, we may consider testing against , using only the healthy controls. This is a classic problem, the LRT statistic is well known to have a distribution. It is well known that under , the LRT statistic can be written as
where is the MLE of using only the healthy controls. Similarly, we may consider testing against , using only the diseased cases. Then, under , the LRT statistic has distribution and can be written as
where is the MLE of using only the diseased cases. Similarly, we may consider testing against , using the overall sample combining both the diseased cases and health controls. Then, the MLE for from the combined sample is as defined in (4). The LRT statistic can be written as
From the above three equations, and the Equations (5), (8) and (10), we have, when 4n0n2 <,
Denote . Then, it is straightforward to verify that
Note that and and and are independent. Thus,
Therefore, when , we have .
On the other hand, under H0, when 4n0, we can first consider testing goodness-of-fit of using only the diseased cases. The LRT statistic has a asymptotic distribution and can be written as
The first term at the right-hand side of the last equality is equivalent to the Pearson's classic χ2 statistic (via comparing observed to expected cell frequencies) for testing Hardy–Weinberg equilibrium which is well known to have the distribution (Emigh, 1980). Using the above equations, when , we have
By Equations (23) and (24), from the above equation, we have
Note that the two terms in the right-hand side of (18) are well known to be asymptotically independent which, in turn, implies asymptotic independence of the two terms at the right-hand side of (19). Therefore, when , we have
Finally, it suffices to show that as Note that under H0, follow a multinomial distribution π0, π1, π2), where for . Let be the random vector ( , , ). Then, we have (Bickel & Docksum, 2000)
Let denote and denote . Under H0, then
By the central limit theorem and the multivariate delta method, has an asymptotic normal distribution with mean 0. That is
Thus, under H0, as ∞,
This finishes the proof of the following convergence in distribution, under H0,