The Merits of Testing Hardy-Weinberg Equilibrium in the Analysis of Unmatched Case-Control Data: A Cautionary Note


  • Guang Yong Zou,

    Corresponding author
    1. Department of Epidemiology and Biostatistics, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada and Robarts Clinical Trials, Robarts Research Institute, London, Ontario, Canada
    Search for more papers by this author
  • Allan Donner

    1. Department of Epidemiology and Biostatistics, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada and Robarts Clinical Trials, Robarts Research Institute, London, Ontario, Canada
    Search for more papers by this author

*Corresponding author: Guang Yong Zou, Robarts Clinical Trials Robarts Research Institute, London, Ontario, Canada N6A 5K8 Tel: 519-663-3400 Ext 34092 Fax: 519-663-3807 Email:


Testing for departures from the assumption of Hardy-Weinberg equilibrium (HWE) has been widely recommended as a preliminary step in the analysis of genetic case-control studies. Some authors suggest using a two-stage procedure in which gene/disease associations are ultimately evaluated using either the Pearson chi-square procedure or the Cochran-Armitage test for trend. Other authors go further and encourage investigators to discard data that are in violation of HWE, essentially using the test as a tool for identifying genotyping errors. In this paper we show that 1) testing for HWE should not be used as a tool to identify genotyping errors; and 2) it is not necessary, and possibly even harmful, to test the HWE assumption before testing for association between alleles and disease. Instead one should inherently account for deviations from HWE with an adjusted chi-square test statistic, a procedure which in the present context is identical to the trend test. Examples from previous reports are used to illustrate the methodology.


Over the past century the so-called Hardy-Weinberg law or Hardy-Weinberg equilibrium (HWE) (Stern, 1943; Chakraborty, 2001) has become well entrenched in the genetics literature, after Hardy (1908) and Weinberg (1908) introduced the notion that alleles at a locus are independent. Many methods have subsequently been developed for testing the validity of HWE, as summarized and compared by Emigh (1980). Some computational refinements and extensions have also recently appeared in the literature (Guo & Thompson, 1992; Meulepas, 1999; Montoya-Dlgado et al. 2001; Aoki, 2003; Bourgain et al. 2004; Wigginton et al. 2005; Chen et al. 2005).

Several possible explanations for deviations from HWE have been proposed, including admixture and selection bias. However, as stated by Shoemaker et al. (1998), “a population will never be exactly in HWE”. Even so, researchers commonly accept HWE as an underlying model and rely on it in their research. For example, Cheng & Chen (2005) and Cheng & Lin (2005) proposed methods that rely on the assumption of HWE. On the other hand, Lee (2003) assumed the general population is in HWE and suggested that testing for HWE could be a useful tool for searching for disease genes, while Whittke-Thompson et al. (2005) based their ‘rational’ inference on the assumption that ‘the susceptibility locus will be in HWE in the general population’.

It has been a particularly common misconception that testing for HWE can be used to identify genotype error, and thus ‘quality control provided by a Hardy-Weinberg test should be an essential part of any genome scan or other application of DNA typing, and the claim that a new typing method is better than an old one should not be published without evidence of acceptable fit to Hardy-Weinberg frequencies’ (Gomes et al. 1999, p 535). Similar statements have also appeared in a journal editorial (Weiss et al. 2001). Moreover, linking genotyping error to deviations from HWE has appeared repeatedly without justification in the literature (Xu et al. 2002; Hosking et al. 2004; Salanti et al. 2005).

In the context of genetic association studies, testing for departures from HWE has been repeatedly recommended as an essential step (Tiret & Cambien, 1995; Schaid & Jacobsen, 1999; Silverman & Palmer, 2000; Campbell & Rudan, 2002; Xu et al. 2002; Attia et al. 2003; Schulze et al. 2003; Kocsis et al. 2004; Thakkinstian et al. 2005a), with the current SAS/Genetics Manual (SAS, 2005, p. 49) stating:“… it may be desirable to run the ALLELE procedure on the data to perform the HWE test on each marker … prior to applying PROC CASECONTROL”. Some writers in fact have zero tolerance for departures from HWE, suggesting that such data be entirely excluded from a planned meta-analysis (e.g. Munafo & Flint, 2004; Freedman et al. 2004; Feinn et al. 2005). A somewhat softer position was taken by Thakkinstian et al. (2005b) who assigned higher quality scores to studies in which tests for HWE were conducted. Other authors have proposed a two-stage testing procedure. Thus, depending on the conclusions resulting from testing the HWE assumption, Millikan et al. (2003) and Relton et al. (2004) suggest that one should apply either the standard Pearson chi-square test to the allele counts, or the Cochran-Armitage test for trend to the genotype counts (Sasieni, 1997). Other researchers (Jahnes et al. 2002) suggest that results obtained from both tests should be presented.

One purpose of this paper is to point out that genotyping error is unlikely to be revealed by testing for deviations from HWE. As the assumptions underlying HWE, including random mating, lack of selection according to genotype, and absence of mutation or migration, are rarely met in human populations (Shoemaker et al. 1998; Ayres & Overall, 1999), a second purpose is to show that a test which adjusts for such deviations should be conducted a priori. Our recommendations are supported by exact numerical results showing that, aside from unnecessarily lengthening the computational effort, implementation of a two-stage procedure will, in general, provide a distorted type 1 error rate. The underlying reason for this distortion is that the acceptance of HWE using a significance test does not prove independence between two alleles (Wellek, 2004), i.e., the P-value cannot be interpreted as the probability of HWE being true, and thus a key assumption underlying the validity of the Pearson chi-square test may still be violated. Multiple testing with one data set is also a concern. As Swinscow & Campbell (2002) note (p. 74) ‘there is something illogical about using one significance test conditional on the results of another significance test.’


The fallacy of attributing deviations from HWE to genotyping errors

We assume that a candidate gene status at a certain locus in a given individual can be categorized into one of two classes, such as in the case of a single nucleotide polymorphism (SNP): let A denote a high risk variant and a the other allele. Let π be the frequency for A and 1 −π for a in the population of interest. Under HWE, the probability distribution for the three genotypes aa, Aa, and AA can then be modelled as p0= (1 −π)2, p1= 2π(1 −π), and p22, respectively. The intrasubject allelic correlation can be then defined as


which is 0.

Now, let ɛ1 and ɛ2 be the mis-classification rates of aA and aA, respectively. The probability distribution for the three genotypes then becomes (Gordon et al. 2001)


with the allelic correlation ρ defined by


As commented by Akey et al. (2001), the model in (1) is reasonable for genotyping methods that rely on hybridization for discrimination of SNP alleles, such as Taqman and oligonucleotide arrays. It is also straightforward to show that ρe= 0 regardless of the values of ɛ1 and ɛ2. This result implies that the presence of allelic correlation is independent of the existence of genotyping error, and thus it would be futile to use a test of HWE to help identify genotype errors. This is consistent with a previous observation that misclassification errors, unlike deviations from HWE, do not invalidate the test for association (Bross, 1954; Mote & Anderson, 1965), and also justifies the assertion that the presence of a di-allelic locus in HWE is not altered by the introduction of a genotyping error (Gordon et al. 2002). As a referee has pointed out, this conclusion is reached by assuming that genotyping errors in control and case groups are indistinguishable, i.e., the misclassification error rate is non-differential, a reasonable assumption provided the case-control status is blinded to the technician performing the genotyping.

An alternative genotyping error model parameterized in terms of possible misclassification error has also appeared (Kang et al. 2004). Without loss of generality, one can set the probability of misclassifying AA as aa or vice versa as zero in the model. The resulting formulation has been discussed by Douglas et al. (2002) in which the probability distribution for the three genotypes is given by


where γ is the probability of classifying AA or aa as Aa, and η is that of classifying Aa as AA or aa. We can define the allelic correlation ρ as


Although ρ, in general, is negative, its magnitude is rather small. Figure 1 demonstrates the relationship between ρ and the underlying error rate, where without loss of generality, we have assumed γ= 2 η. This relationship clearly indicates that the contribution of error to allelic-correlation is far less than the actual values of correlations (Figure 2) from 60 case-control studies (Whittke-Thompson et al. 2005), where the method developed by Donner & Eliasziw (1992) was used to construct confidence intervals. This result further suggests that testing for HWE has limited value in the detection of genotyping errors occurring in genotype classification, even with an error rate as high as 10% in classifying AA or aa as Aa.

Figure 1.

Intrasubject allelic correlation (ρ) as affected by genotyping error rate (η) and allele (a) frequency (1 −π) based on a genotype misclassification model by Douglas et al. (2002).

Figure 2.

Estimated intrasubject correlations inline image with 95% confidence intervals based on a method by Donner & Eliasziw (1992) for 60 datasets (from Table 1 of Whittke-Thompson et al. 2005).

Genetic association testing

Instead of relying on the assumption of HWE, we adopt a model that inherently takes the presence of allelic correlation into account. The probability distribution for the three genotypes can then be modelled (Fyfe & Bailey, 1951) as


The parameter ρ in (3) is commonly referred to as the inbreeding coefficient in the genetics literature and as the intraclass correlation coefficient in the statistical literature (Mak, 1988; Bloch & Kraemer, 1989; Donner & Eliasziw, 1992). As pointed out by Ayres & Balding (1998), this model is completely general in representing deviations from HWE for a di-allelic locus. Note that (i) HWE is satisfied only when ρ= 0, and (ii) that the value of ρ below 0 is constrained at max [−π/(1 −π), − (1 −π)/π].

For a data set containing n0 subjects with genotype aa, n1 with Aa and n2 with AA, respectively, the maximum likelihood estimator (MLE) for π has been derived (Fyfe & Bailey, 1951; Bloch & Kraemer, 1989) as


with variance estimated by


where n=n0+n1+n2. Moreover, the MLE of ρ is given by


The estimator inline image is also known as the intraclass kappa statistic (Bloch & Kraemer, 1989; Donner & Eliasziw, 1992), while the variance estimator in (5) has previously been obtained by Mendell & Simon (1984).

We now consider a case-control study involving n subjects, including m1 cases and m2 controls. The data may be conveniently summarized in a 2 × 3 table for genotypes, which can also be recast into a 2 × 2 table for alleles (see Table 1).

Table 1.  Data layout of a genetic case-control study

Since the mode of inheritance is usually unknown, we focus discussion on the additive effect of allele A. Note that the null hypothesis of no association between alleles and disease is equivalent to testing H0: π12, where π1 and π2 are the underlying proportions of allele A for the case and control populations respectively. Two asymptotically equivalent tests can be constructed. The first, based on a Wald statistic, is given by


where, for inline image and inline image may be obtained using the equations (4) and (5), respectively. Under H0, W is asymptotically distributed as chi-square with one degree of freedom.

Alternatively, one can estimate the variance of inline image under H0 and obtain an adjusted chi-square statistic given by


where inline image is the pooled estimate of allele frequency for A and χ2P is the standard Pearson chi-square statistic computed from the allele counts in Table 1. Note that the statistics W and χ2A are identical to the squares of z-statistics considered by Schaid & Jacobsen (1999). Moreover, as pointed out by Knapp (2001), χ2A is identical to the Cochran-Armitage trend test as applied to the genotype counts in Table 1 (Sasieni, 1997), and is also a special case of test statistic considered by Donner et al. (1981) and Donner (1989). We also note that χ2A can be extended easily to handle a multiallelic locus, by applying the Pearson chi-square statistic to a 2 ×k contingency table where k is the number of alleles, combined with the inbreeding coefficient estimator discussed by Ayres & Balding (1998). This would then provide an intuitive alternative to the statistic discussed by Czika & Weir (2004).

Finally, one can also show that χ2A is virtually identical to a score test statistic arising from the generalized estimating equations approach (Liang & Zeger, 1986) in which each subject is treated as a cluster of size two. This insight suggests a general approach for incorporating covariates into the analysis as well as unifying most test statistics developed for both matched and unmatched case-control designs (Zou, 2006). We note that instead of treating disease status as a response variable, this model relies on the invariance property of the odds ratio (Cornfield, 1951) to allow allele status to be treated as the dependent variable.

Our results can also be seen to greatly simplify sample size calculations for the design of a genetic case-control study. In particular, the required number of subjects can be obtained by multiplying the standard sample size formula for comparing two proportions by an estimate of (1 +ρ)/2, straightforward to obtain from a previous study provided the authors report genotypes as suggestion by Little et al. (2002). We note that a sample size formula for the linear trend test has been derived by Chapman & Nam (1968). In the context of genetic case-control studies Slager & Schaid (2001) also discussed sample size estimation for the linear trend test, based on prior estimates of prevalence rates and genotype risk information that may not be easily available.

Evaluation of hypothesis testing procedures

We now evaluate the empirical type 1 error rates for the statistics χ2P, W and χ2A (all computed without performing a preliminary test for departures from HWE), as compared to that of a 2-stage procedure 22S) which calls for computing χ2P if a test of HWE in the control group is not rejected, and for computing χ2A otherwise.

Let inline image denote the estimator defined in (6) as computed from control group data only. The test for departures from HWE can be performed using a goodness-of-fit procedure calculated by referring inline image to tables of the chi-square distribution with one degree of freedom at the 5% level of significance (Emigh, 1980).

To avoid dealing with errors introduced by performing a simulation study we use exact evaluation. The type 1 error rate can be calculated as


where the pij, i, j= 0, 1, 2 are given by the model in (3), and I= 1 if the statistic T(=χ2P, W, χ22S, χ2A) for testing association is greater than the 100(1 −α)% critical value of the chi-square distribution with one degree-of-freedom, and I= 0 otherwise. For computational convenience we added or subtracted 1 × 10−6 when n1 and n2 are zero.

Table 2 provides true type 1 error rates corresponding to nominal significance levels α= 0.01 and 0.05 for parameter values π= 0.1, 0.3, correlation ρ given by max [−π/(1 −π), − (1 −π)/π], 0, 0.2 and 0.4, and sample sizes m1=m2= 25, 50, and 100.

Table 2.  Exact type 1 error rates for the 1-stage Pearson chi-square 2P), 1-stage Wald test (W), two-stage test 22S) and 1-stage adjusted chi-square test 2A), as applied to genetic case-control data with equal allelic correlation (ρ) in both case and control subjects.
m1=m2πρα= 1%α= 5%

Consistent with results obtained by Schaid & Jacobsen (1999) it is clear that χ2P should not be used to test for hypothesis testing, as this procedure consistently leads to severely distorted type 1 error rates. The Wald test also cannot be regarded as satisfactory in terms of type 1 error rate. Although the two-stage procedure 22S) is similarly seen to distort the type I error rate, it does so to much less an extent than χ2P. For example, when π= 0.3, ρ= 0.2, and m= 50, the type 1 error rate is 0.0661 for α= 0.05. On the other hand, the adjusted chi-square statistic χ2A yields type 1 error rates that are consistently very close to nominal.

Table 3 presents type 1 error rates for the same parameter combinations as in Table 2, except that the value of ρ which characterizes the control group subjects is set to zero, as has been suggested by several authors (Schaid & Jacobsen, 1999; Lee, 2003; Munafo & Flint, 2004; Freedman et al. 2004; Whittke-Thompson et al. 2005; Salanti et al. 2005). These results demonstrate that the distortion of type 1 error rates observed using χ22S is even more severe than that seen in Table 2. For example, with π= 0.3, ρ= 0.4 and m= 100, the type 1 error rate reaches 0.0728 at the nominal level of 0.05.

Table 3.  Exact type 1 error rates of the Pearson chi-square 2P), 1-stage Wald test (W), two-stage test 22S), and 1-stage adjusted chi-square test 2A), applied to a genetic case-control study in which allelic correlation in the case group is assumed to be ρ and in the control group to be 0.
m1=m2πρα= 0.01α= 0.05


As a first example we consider a study aimed at examining whether polymorphism of the TPH1 (tryptophan hydroxylase 1) gene is related to the etiology of major depression, anxiety and comorbid depression and anxiety (Sun et al. 2004). The sample of interest in this study arose from a population-based study of postpartum Taiwanese women. Among 79 control subjects, the T27224C genotypes are 6 AA, 58 AC, and 15 CC, respectively, while among 27 case subjects with comorbid depression and anxiety they are given by 2 AA, 12 AC, and 13 CC. The GOF test for departures from HWE as performed among the control group subjects yields S= 18.787 (P < 0.05), with the subsequent value of χ2A given by 5.70 (P= 0.017), in contrast to χ2P= 3.59 (P= 0.058). Thus in this example, the estimated allelic correlation of -0.49 in the controls results in the (correct) adoption of χ2A.

As a second example, we consider a study reported by Zorzetto et al. (2002) who examined the association between the CR1 (complement receptor 1) gene Pro 1827Arg (C5507G) polymorphism and sarcoidosis in a sample of Caucasian and Italian subjects. The genotype distribution for the 71 control subjects is 5 GG, 17 CG, and 49 CC, while that for the 91 scarcoidosis patients is 16 GG, 29 CG and 46 CC. Since S= 3.52 (P= 0.061), application of the two-stage procedure yields χ2P= 8.48 (P= 0.0036). In contrast, we note that χ2A= 6.61 (P= 0.0101).

As a final example we consider a study designed to test the hypothesis that GABAA subunit genes contribute to a condition known as methamphetamine use disorder (Lin et al. 2003). Among the 105 female cases the α1rs2279020 genotype distribution is given by 24 AA, 42 AG, and 39 GG, while the corresponding distribution in the control group is 55 AA, 83 AG, and 50 GG. After computing S= 2.547 (P= 0.111), we proceed as in Lin et al. (2003) to calculate χ2P= 3.874 (P= 0.0491). In contrast, we note that χ2A= 3.381 (P= 0.066) does not reach statistical significance at the 5% level.


This paper has identified a subtle but widespread potential pitfall in the analysis of genetic case-control studies. In particular, we have shown that the practice of testing HWE for the purpose of identifying genotyping error is unjustified, while the commonly used two-stage procedure has been shown to potentially lead to spurious statistical significance in the testing of marker-disease associations.

Concern that HWE may not hold has prompted the development of a variety of procedures that may be used to test this assumption. As pointed out by Wellek (2004), none of the methods can be used to establish HWE. Unfortunately, this message is still being ignored by some researchers in the analysis of case-control genetic association studies. The fundamental issue here is that a P-value larger than a pre-specified type 1 error cannot be used as basis to prove HWE.

We support the view that more accurate genotyping platforms and assay performance monitoring should be implemented in order to reduce errors (Hattersley & McCarthy, 2005) without taking such steps, since otherwise a false sense of assurance may arise, suggesting erroneously that a study is of high quality. In the event that genotyping errors still remain we recommend that methods incorporating such errors be explicitly considered, such as those proposed by Gordon et al. (2001, 2002, 2004a,b), Rice & Holmans (2003) and Hao & Wang (2004). Note however that deviations from HWE discussed here, would still need to be taken into account.

In the analysis of genetic case-control studies with data arising from validated genotyping techniques, we believe it would be appropriate to assume a priori that departures from HWE exist and apply the statistic χ2A in a single step. It is important to note that if ρ is indeed zero, or very close to zero, this statistic will still provide valid type 1 error rates in samples of reasonable size.

An additional source of inflated type 1 error rate is introduced when a large number of tests are performed (Lancet, 2003; Khoury et al. 2004). To reduce the inevitable number of false-positive results that are reported several methods have been proposed, including reducing the required level of significance from 0.05 to 0.0005 or 0.00005 (Colhoun et al. 2003), or alternatively estimating the probability of a false positive result (Wacholder et al. 2004). However, whatever method one chooses to adopt, one must have an initially correct P-value.

As a final remark, we note that the existence of allelic correlation should not be confused with the notion of confounding. The latter should be adjusted for by a procedure such as an adjusted Cochran-Mantel-Haenszel test (Klar, 1996; Donner, 1998) or by applying generalized estimating equations (Liang & Zeger, 1986), rather than by the procedures discussed here (Deng & Chen, 2000). A related method which takes into account allelic correlation and may be applied to the performance of a meta-analysis (sometimes referred to as HuGE review (Khoury & Little, 2000)) has been presented by Zou (2006).


This work was supported in part by the Natural Sciences and Engineering Research Council of Canada.