## Introduction

Common and complex diseases (or traits) are often genetically heterogeneous in aetiologies (Lander & Schork, 1994; Zhou & Pan, 2009). Some well-known complex diseases with genetic heterogeneity include asthma, breast cancer (Hall et al., 1990; Wooster et al., 1994; Turnbull et al., 2010), and diabetes (Hattersley, 1998; Sladek et al. 2010). As in Zhou & Pan (2009), this paper considers the situation when a complex disease (or trait) is caused by mutations in multiple unlinked loci, commonly referred to as locus heterogeneity (Ott, 1999; Abreu et al., 2002; Fu et al., 2006). As a consequence of genetic heterogeneity, the population of individuals with disease may be decomposed into various latent sub-populations, each with disease caused by mutations at different loci (or their combinations). Most of the existing association tests for population-based case-control studies, e.g. GWAS, are based on comparing the mean genotype scores (e.g. the Armitage trend test (ATT)) between the case and control groups, which are not efficient in the presence of genetic heterogeneity. Zhou & Pan (2009) showed that it can be beneficial to use methods that account for genetic heterogeneity for testing association in a case-control study.

Similar to admixture mapping in linkage analysis (Smith, 1963; Abreu et al., 2002; Fu et al., 2006), Zhou & Pan (2009) proposed a binomial mixture model to account for genetic heterogeneity and developed a modified likelihood ratio test (MLRT) for a single locus (Fu et al., 2006). They also consider two methods to combine single-locus-based MLRTs across multiple loci in linkage disequilibrium to boost power when causal single nucleotide polymorphisms (SNPs) are not genotyped (Zhou & Pan, 2009). They illustrated, with a wide spectrum of numerical examples, that the proposed MLRT tests are more powerful than some commonly used association tests under genetic heterogeneity. Following Zhou and Pan, we define the genetic score *X* as the number of the minor alleles at a single locus for a subject. Zhou and Pan (2009) assumed that the genetic score in a healthy control subject follows a binomial distribution, that is

where and represents the minor allele frequency (MAF) on that specific locus of the control subject. On the other hand, under genetic heterogeneity, the genetic score for a diseased subject, , follows a simple two-component mixture binomial distribution,

where θ represents the probability of having the minor allele on one chromosome for a sub-group of cases with disease associated with the minor allele. They adopt a two-step procedure for parameter estimation. First, a maximum likelihood estimate (MLE) of is obtained based only on the control sample. Then, fixing the estimated at its MLE derived from the control-group data, maximum penalised likelihood estimates of other parameters in the mixture model are obtained using an expectation-maximization (EM)-type algorithm (Li et al., 2009). Subsequently, the penalised MLEs from the EM-step are plugged into a likelihood ratio to form a test statistic to detect the association between the marker genotypes and the disease status. Finally, they proposed a permutation procedure to obtain the *p*-value of the association test.

Zhou and Pan's idea is applicable to an association study for a limited number of candidate markers; however, there are several challenges in applying their proposed method to genome-wide association studies (GWASs). First, the computation of their proposed MLRT for a vast number of SNPs in a typical GWAS would be very intensive. Since the penalised MLEs are obtained by an EM algorithm for maximization of the penalised mixture likelihood, there are known complexities and caveats associated with the EM or other iterative methods for identifying MLEs and penalised MLEs in mixture models including the challenges in selecting multiple starting points for parameter estimation. Moreover, the *p*-value of the MLRT is proposed to be attained by permutation, which is also difficult to apply directly to detect the SNP-disease association in GWAS with a vast number of SNPs, where the significance level is usually set to be less than 10^{−6}. In addition, it is widely believed that complex diseases and traits are caused by interplays of a large number of genetic loci and environmental risk factors. The simple binomial mixture model with two components in Equation (2) may be too simple to capture the complex heterogeneity for many complex diseases. A more general form of binomial mixture model can be written as follows:

where , , and if and only if . In particular, for many of the complex diseases with genetic heterogeneity, it is likely that *J* is quite large. Since it is hard to know the number of the sub-populations *J* under genetic heterogeneity, it is desirable to have a new test that is applicable without the need to know the exact value of *J* while allowing .

In this paper, we developed a LRT for GWASs based on the more flexible binomial mixture models in (3). It is widely believed that complex diseases and traits are caused by interplays of a large number of genetic loci and environmental risk factors. Thus, we assume that the genetic score in the case group, , follows a general binomial mixture distribution in (3) which allows the possibility of a large and unknown *J*. The proposed LRT overcomes the above-mentioned challenges of using Zhou and Pan's method for testing association of a vast number of SNPs in a typical GWAS. In particular, we derived the closed-form formula for the LRT statistic even though the MLEs of parameters in the binomial mixture model are non-regular with loss of identifiability (Liu & Shao, 2003). We further derived the simple closed-form asymptotic null distribution of the LRT which avoids the intensive numerical calculations, such as the EM-based iterations for identification of MLEs and the permutations for evaluation of *p*-values. Additionally, the LRT can be implemented without the requirement of knowing the number of components *J* in the mixture model (3). We conducted extensive simulation studies to show that the LRT has power advantages over Armitage trend test (ATT) and some other association tests under genetic heterogeneity. We applied our test to a real dataset from a breast cancer GWAS to illustrate that it can achieve a much smaller *p*-value than some commonly used tests when there is evidence of genetic heterogeneity. Thus, the proposed LRT might be used to scan SNPs in GWAS to make novel discoveries by taking account of genetic heterogeneity.