EMK: A Novel Program for Family-Based Allelic and Genotypic Association Tests on Quantitative Traits


  • Y. W. Li,

    1. Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA
    2. Center for Human Genetics, Department of Medicine, Duke University Medical Center, Durham, NC 27710, USA
    Search for more papers by this author
  • E. R. Martin,

    1. Center for Genetic Epidemiology and Statistical Genetics, Miami Institute for Human Genomics, University of Miami Miller School of Medicine, Miami, FL 33101- 019132, USA
    Search for more papers by this author
  • Y. J. Li

    Corresponding author
    1. Center for Human Genetics, Department of Medicine, Duke University Medical Center, Durham, NC 27710, USA
    Search for more papers by this author

*Corresponding author: Yi-Ju Li, Ph.D., Center for Human Genetics, Department of Medicine, Duke University Medical Center, DUMC Box 3445, Durham, NC 27710, USA. Phone: (919) 684-0604, Fax: (919) 684-0921, E-mail: yiju.li@duke.edu


The QTDT program is a widely-used program for analyzing quantitative trait data, but the methods mainly test allelic association. Since the genotype of a marker is a direct observation for an individual, it is of interest to assess association at the genotypic level. In this study, we extended the allele-based association method developed by Monks and Kaplan (MK method) to genotype-based association tests for quantitative traits. We implemented a novel extended MK (EMK) program that can perform both allele- and genotype- based association tests in any pedigree structure. To evaluate the performance of EMK, we utilized simulated pedigree data and real data from our previous report of GSTO1 and GSTO2 genes in Alzheimer disease (AD). Both allele- and genotype-based EMK methods (allele-EMK and geno-EMK) showed correct type I error for various pedigree structures and admixture populations. The geno-EMK method showed comparable power to the allele-EMK test. By treating age-at-onset (AAO) as a quantitative trait, the EMK program was able to detect significant associations for rs4925 in GSTO1 (P= 0.006 for allele-EMK and P= 0.009 for geno-EMK), and rs2297235 in GSTO2 (P= 0.005 for allele-EMK and P= 0.009 for geno-EMK), which are consistent with our previous findings.


Since the development of the Transmission Disequilibrium Test (TDT) (Spielman et al. 1993), family-based association methods have played an important role in mapping genes for complex human diseases. The original TDT was applicable only to parent-offspring triads for qualitative traits. Since then, many extensions of TDT have been proposed to incorporate various pedigree structures (Martin et al. 1997; Curtis 1997; Boehnke & Langefeld 1998; Spielman & Ewens 1998; Martin et al. 2000) as well as quantitative traits (Allison 1997; Rabinowitz 1997; Schaid & Rowland 1999).

To date, QTDT and FBAT are the two primary programs that perform association tests for quantitative traits for family data. In particular, the QTDT implements five association methods (Abecasis et al. 2000; Allison 1997; Rabinowitz 1997; Monks & Kaplan 2000; Fulker et al. 1999), in which the likelihood-based orthogonal model (OM) (Abecasis et al. 2000) and the TDT-based Monks and Kaplan (MK) method (Monks & Kaplan 2000) are feasible in wider pedigree structures than the other methods. The OM method tests the additive genetic effect of the quantitative trait loci (QTL) in general pedigrees with or without parental data. It requires normally distributed quantitative trait data, which may not always reflect real data. On the other hand, the MK method applies to nuclear families with and without parental data but has fewer restrictions on the trait distribution. The MK method also has the advantage of providing information about the direction of the allelic effect; for instance, whether the allele increases or decreases the trait's value.

Most association methods for quantitative traits, including OM and MK, provide evidence of association for an allele rather than a genotype at a marker locus. The development of Genotype-PDT for qualitative traits (Martin et al. 2003) has set an example of the advantage of assessing the genotypic association. Testing for genotype-based association has more power than testing for allele-based association in some genetic models. For instance, Genotype-PDT was found to have higher power than allele-based PDT under recessive and dominant models (Martin et al. 2003). The results of genotype association tests may help us understand the underlying genetic model of the susceptibility gene or genetic modifier by evaluating genotype-specific effects.

Methods and programs for testing genotypic association between markers and quantitative traits and for family-based association analysis of quantitative traits are lacking. Nuclear family data are still the focus of the application for most methods. When extended pedigrees are available, not all data within the extended pedigree are used, which may result in some loss of statistical power. We built on the advantages of the MK method by extending it to general pedigree data and to test for genotypic association. Simulation studies were conducted to evaluate the validity of our proposed extensions of the MK method (EMK methods). We also developed a computer program that implements both allele-based and genotype-based EMK methods for real data analysis.


The MK method utilizes the following two types of informative nuclear families: families with known parental genotypes and at least one heterozygous parent; and families without parental genotypes but with at least two siblings with different genotypes. Our version of allele-based EMK (allele-EMK) and genotype-based EMK (geno-EMK) methods are applicable to a general pedigree that contains the same type of informative nuclear families. The test statistics described below are based on the extended general pedigree. They can be simplified for nuclear families if needed.

Allele-Based Association Test (Allele-EMK)

Considering a marker locus with two alleles A1 and A2, we assume ti siblings in the ith pedigree, where i = 1, …, F. We used the notation of Yij for the quantitative phenotype of the jth individual in the ith pedigree and inline image for the mean trait value over all non-founders in the ith pedigree.

Several genotypic scores were defined in Monks and Kaplan (Monks & Kaplan 2000) including the following: X*iM (X*iF) = 1 if mother (father) is heterozygous at the marker; X*iM (X*iF) = 0 if mother (father) is homozygous; XijM (XijF) = 1 if allele A1 was transmitted to the jth offspring by the mother (father); and XijM (XijF) = 0 if otherwise. For a nuclear family with parental genotype information, a Ui statistic is defined for the ith family:


When parental genotype information is not available, an alternative Vi statistic is used:


where inline image is an average of genotype scores among siblings.

The sum of XijM and XijF is the number of A1 alleles that the jth offspring carries. It should be noted that both Equations (1) and (2) are slightly different from the original definition of Ui and Vi in the MK method since we replaced inline image, the mean trait value over all non-founders in all pedigrees, with inline image, the family-specific mean trait value.

A general association test statistic TQPS was based on the sum of Ui and Vi from families with and without parental genotype data (Monks & Kaplan 2000). However, TQPS is based on the assumption that the two types of nuclear families are independent. If a general pedigree contains several nuclear families with both types of family structure, the assumption of independence between nuclear families is not valid. Thus, for pedigree i, we define a family-specific score inline image, where FiP and FiS are the number of nuclear families with and without parental-genotype information in pedigree i. The new allele-EMK test statistic is, therefore, written as


where TQPS follows an asymptotic standard normal distribution under the null hypothesis of no linkage disequilibrium.

Genotype-Based EMK (Geno-EMK) Method

In order to construct the genotype-based MK method, we define a random variable Xij to code the observed genotype. Assume that we test genotype A1A1. Let Xij= 1 if the observed genotype is A1A1, and Xij= 0 if otherwise.

For the case where parental genotypes are available, an estimate of the covariance between the marker genotype and the quantitative trait can be written as random variable Ui for the ith nuclear family:


where G*ij is the pool of all possible genotypes (X*ij) for an offspring based on parental genotype information and Si is the number of elements in G*ij.

For the case of no parental genotype information, the covariance for family i with ti siblings is defined as the following:


where inline image is the mean of the Xij from ti siblings.

Similarly, for the ith general pedigree with FiP and FiS nuclear families with and without parental genotypes, we define inline image. Under the null hypothesis, we were able to derive E(Ui) = 0 for all nuclear families with parental genotypes and E(Vi) = 0 for all nuclear families without parental genotypes (see Appendix A and B in supplementary materials). Therefore, for the F independent general pedigrees, we have inline image and inline image Hence, a test statistic can be written as:


Under the null hypothesis, inline image follows an asymptotic standard normal distribution.inline image can be easily modified for genotypes A1A2 and A2A2 .

To obtain an overall assessment of significance at a marker, a global test can be computed as below:


where g is the number of genotypes at a marker. Under the null hypothesis, Tglobal is an asymptotic χ2 distribution with g−1 degrees of freedom (Martin et al. 2003).

Candidate Gene Analysis for Alzheimer Disease

Alzheimer disease (AD) is a leading cause of dementia in the elderly and is known to have a complex etiology with strong genetic and environmental components. Many susceptibility genes have been reported to date, but only four AD genes, amyloid precursor protein (APP) (Goate et al. 1991), presenilin 1and 2 (PS1, PS2) (Sherrington et al. 1995; Levy-Lahad et al. 1995; Rogaev et al. 1995), and apolipoprotein E (APOE) (Corder et al. 1993), have been confirmed. In addition to these susceptibility genes, we have been interested in mapping genetic modifiers for age-at-onset (AAO) of AD using quantitative trait approaches. We previously reported glutathione S-transferase omega-1 (GSTO1) and GSTO2 as potential AAO genes for AD (Li et al. 2003; Li et al. 2006) based on results from the OM and MK methods implemented in the QTDT program. Here, we tested seven single nucleotide polymorphisms (SNPs) from these two studies to compare our proposed EMK program and the QTDT MK methods.


A series of computer simulations were used to examine the type I error and power of the allele-EMK and geno-EMK methods under different genetic models and sample sizes. We examined whether the geno-EMK is a valid test for nuclear families and extended general pedigrees. Further, we assessed the validity of the allele-EMK method using simulated two-generation extended pedigree data.

We assume bi-allelic marker A (A1 and A2) with population frequencies p1 and p2, and a bi-allelic QTL Q (Q1 and Q2) with population frequencies q1 and q2. The linkage disequilibrium inline image, where inline image is population haplotype frequency for A1Q1. Traits resulting from three QTL genotypes,Q1Q1,Q1Q2, and Q2Q2, are assumed to follow normal distributions. The mean of each genotype-specific trait is defined as μ11=a,μ12=d, and μ22=−a. A common variance σ2G is assumed for all QTL genotypes, where σ2G= 2q1q2[a+d(q2q1)]2+ (2q1q2d)2. The residual trait variance is assumed to be σ2E. These parameters lead to the broad-sense heritability of H22G/(σ2G2E) (Falconer DS 1996).

Parameters used in the simulations are listed in Table 1 Given allele frequencies of the marker A and QTL Q, and linkage disequilibrium D, four population haplotype frequencies for marker A and QTL Q can be calculated by P(A1Q1) =p1q1+D, P(A1Q2) =p1q2D, P(A2Q1) =p2q1D, and P(A2Q2) =p2q2+D. The haplotypes of each parent were simulated based on the given haplotype frequencies. We then randomly drew one haplotype from each parent to form two haplotypes for each offspring.

Table 1.  Parameters used in the simulation study
  1. *a: additive effect; d: dominant effect.

Marker allele frequency P(A1)0.2, 0.5
QTL allele frequency P(Q1)0.2, 0.5
Scale of Dominant model (k=d/a)*−2, −1.5, −1, 0, 1, 1.5, 2
Number of sibling size2, 5
Number of families simulated200, 500
Heritability (H2)0.1
Number of iterations10,000

QTLs were simulated with additive, dominant, recessive, and overdominant models. For simplicity, we assumed the dominance effect (D) is the product of additive effect (a) and a scale of dominance (k) (d=k×a). Therefore, the genetic models of recessive, additive, and dominant are reflected by k=−1, 0, and 1, respectively. An overdominant model is the one with k> 1 or k< −1. We assumed the quantitative traits Y follow normal distributions with corresponding mean and variance, where inline image, and inline image.

Our simulation studies evaluated various pedigree structures including nuclear families with or without parental information, and two-generation general pedigrees (Figure 1). We simulated datasets with either 200 or 500 nuclear families, or 200 general pedigrees for each replicate. A total of 10,000 replicates were generated each time. The type I error of the allele-EMK and geno-EMK were estimated under the cases of no association between marker and QTL (linkage disequilibrium coefficient D= 0) and statistical power was estimated for D≠ 0. We used 0.05 as the significance level for all estimates.

Figure 1.

General pedigrees used in simulations.

To demonstrate that the EMK tests are not affected by population admixture, we also investigated the type I error rates of the allele-EMK and geno-EMK tests using simulated admixture population data. We simulated 500 two-sib nuclear families that are a mixture of two equal size subpopulations with different allele frequencies at the marker and QTL. The marker and QTL allele frequencies were 0.3 for the first subpopulation and 0.1 for the second subpopulation.


Type I Error Rates

Table 2 presents the type I error rates for each genotype, global geno-EMK, allele-EMK, and QTDT MK tests in 200 nuclear families with two and five sibs simulated under different genetic models. Except for the geno-EMK11 genotype tests (P= 0.040 in the recessive model, P= 0.038 in the additive model, and P= 0.039 in the dominant model), the type I error estimates are very close to the nominal significance level of 0.05. The exception is probably the result of the low frequency of the 11 genotype (P(A1A1) = 0.04) and the small number of observations for these data. Compared to the allele-EMK and geno-EMK tests, the QTDT MK test is consistently conservative, especially for the two-sib nuclear families without parental genotypes case (P≈ 0.041 ∼ 0.044).

Table 2.  Type I error rates for data simulated from 200 families under different genetic models and family structures
ModelMethod*With ParentsWithout Parents5 sibs
 2 sibs5 sibs2 sibs
  1. *G11 = geno-EMK 11 test; G12 = geno-EMK 12 test; G22 = geno-EMK 22 test; Global = geno-EMK global test.

Allele-EMK 0.0500.0480.0480.050
QTDT MK 0.0490.0440.0440.046
Allele-EMK 0.0500.0490.0460.050
QTDT MK 0.0470.0460.0420.047
Allele-EMK 0.0510.0480.0470.051
QTDT MK 0.0460.0470.0420.048

Overall, our error rate estimates in the allele-EMK test are closer to the nominal significance level than those in the geno-EMK global test. Families with two sibs generally show slightly lower type I error rates than those with five sibs and general pedigree cases. This was expected, because the overall sample size is larger in the five sibs and general pedigree cases than in the two-sib case.

In the case of an admixture population, simulations showed type I error rates close to the nominal significance level in the geno-EMK global test (P≈ 0.044 ∼ 0.052) and allele-EMK test (P≈ 0.048 ∼ 0.053). These results demonstrate that the EMK is valid for testing association regardless of whether population substructure exists.

Power Estimates

The statistical power for both geno-EMK and allele-EMK methods was evaluated for all combinations of genetic models, parameters, and pedigree structures. Unlike geno-PDT for testing disease risk (Martin et al. 2003), we found that geno-EMK has similar power patterns with allele-EMK for quantitative traits simulated under dominant, additive, and recessive models. Here, we present power curves across different degree of linkage disequilibrium (D) (Figures 2 and 3). Interestingly, geno-EMK has higher power than allele-EMK under the overdominant model with k=−2 (Figure 4). Overall, power is very sensitive to the degree of D between marker and QTL, and the availability of parental data. In all cases, maximum power is obtained when the marker is in perfect disequilibrium (e.g. D= 0.16) with the trait locus and when parental genotypes are available. This is expected because the parental controls provide more accurate estimates of the expected genotype score than sibling controls. Moreover, the tests in general pedigrees show the highest statistical power due to the large sample size per pedigree. For all scenarios, higher power was observed for data simulated under additive and dominant genetic models than for data simulated under the recessive model.

Figure 2.

Power comparison among different genotypes under 500 two-sib nuclear families in additive model. Both marker and QTL allele frequencies were 0.2. The heritability was 0.1 and additive genetic effect a= 2.

Figure 3.

Power comparison among QTDT MK (MK), allele-EMK (A-EMK), and geno-EMK (G-EMK) global test under 500 two-sib nuclear families and 200 general pedigrees in additive model. Both marker and QTL allele frequencies were 0.2. The heritability was 0.1 and additive genetic effect a= 2.

Figure 4.

Power comparison among QTDT MK (MK), allele-EMK (A-EMK), and geno-EMK (G-EMK) global test under 500 two-sib nuclear families in overdominant model (k=−2). Both marker and QTL allele frequencies were 0.2. The heritability was 0.1 and additive genetic effect a= 2.

It should be noted that the allele-EMK test has slightly higher power than the original MK under nuclear families without parental genotype (Figure 3 (b)). Therefore, our allele-EMK method may serve as a good alternative for studies with missing parental data, in particular, for the genetic studies of late-onset diseases.

Figure 2 shows the results of geno-EMK power comparisons for all three genotypes and for the global test for data simulated under the additive model. The global test shows a similar power pattern to the main associated genotype, which here is 22. The same pattern was observed in the recessive and dominant models. Therefore, the global test can serve as an initial overall assessment to support the evidence of individual genotype association.

In general, the allele-EMK test has greater power than the geno-EMK global test (Figure 3), except for a few exceptions. First, the genotype specific test may have greater power than the allele-EMK and QTDT MK test in some cases such as the recessive model (k=−1). For example, assume the population mean μ=−1.84 and trait means of Q1Q1,Q1Q2,Q2Q2 as 2, −2, and −2, respectively (Figure 5). Under the assumption of strong LD (D= 0.12), for which marker and QTL genotypes are mostly identical, the geno-EMK 11 test showed more power (92.43%) than 12 (12.66%) and 22 (16.79%) tests under simulations of 500 two-sib nuclear families with parents. This may be because the trait mean of Q1Q1 has a much greater difference from μ than the trait means of Q1Q2 and Q2Q2. Because the allele-based test counts data across two genotypes (11 and 12 for allele 1), the allele-EMK test and the QTDT MK test (power = 64.51% and 74.3%, respectively) are less significant.

Figure 5.

Relationship of genotypic trait means for QTL genotype Q1Q1,Q1Q2, and Q2Q2, and population mean μ. The additive genetic effect a= 2.

Second, the geno-EMK test has much higher power than the allele-EMK test in data with and without parental genotypes under the overdominant model (Figure 4). For the overdominant model of k=−2 illustrated in Figure 5 (the population trait mean μ=−2.48), the expected trait mean for Q2Q2 is closer to μ than Q1Q1 and Q1Q2. Therefore we found that geno-EMK has higher power in 11 (power = 72.16%) and 12 (power = 92.91%) than in 22 (power = 61.23%) for D= 0.12 in 500 two-sib nuclear families with parents. However, the allele-based tests have much lower power in this case (allele-EMK power = 11.24% and QTDT MK power = 12.77%). This may be because genotypes 11 and 12 have opposite quantitative effects 11= 2 and μ12=−4), which diminishes the allelic association signal.

Analysis in Age-At-Onset Data

Table 3 shows the results of the allele-EMK and geno-EMK tests for the seven SNPs in the GSTO1 and GSTO2 genes for age-at-onset data in families with Alzheimer disease (AD). There were 711 families in the AD dataset. Li et al (2003) reported significant findings for rs4925 in GSTO1 (P= 0.023) and rs2297235 in GSTO2 (P= 0.024) based on the QTDT MK test. Our allele-EMK and geno-EMK tests supported the findings of the QTDT MK test with smaller p-values. The allele-EMK test p-values were 0.006 (rs4925) and 0.005 (rs2297235), and the global geno-EMK test p-values were both 0.009 (rs4925 and rs297235). More interestingly, genotype 22 at rs4925 (P= 0.006) in GSTO1 and 11 at rs2297235 (P= 0.007) in GSTO2 are the most significant associated genotypes of the seven SNPs we tested to early age-at-onset of Alzheimer disease. This example shows that our EMK program can handle real data analysis and yield more informative insights into significant results than the existing methods.

Table 3.  QTDT MK (MK), allele-EMK (A-EMK), and geno-EMK (Global, G11, G12, and G22) test results for GSTO1 and GSTO2 in AD families
GenedbSNP no. *MKA-EMKGlobal Geno-EMKG11G12G22
  1. *SNPs 5 – 11 listed in Li et al. 2003.

GSTO1rs11191972 (SNP5)0.2810.0910.2390.0900.8820.251
rs2164624 (SNP6)0.1170.1070.0620.0450.0390.770
rs4925 (SNP7)0.0230.0060.0090.4000.0150.006
rs1147611 (SNP8)0.1320.0870.1470.0660.1380.670
GSTO2rs2297235 (SNP9)0.0240.0050.0090.0070.0220.240
rs157077 (SNP10)0.3800.0590.1500.1290.6900.073
rs156697 (SNP11)0.1130.0390.0970.0500.1190.398


Family-based association methods have played a central role in candidate gene studies for complex human diseases. Computer programs for performing association tests in real data will become more and more important as we move toward whole genome association studies. In this study, we extended the allele-based MK method to multi-generation families and developed a genotype association method for quantitative traits based on the framework of the MK method. We evaluated the validity and power of these two new methods by simulating various family datasets and admixture populations. Our simulation studies showed that both geno-EMK and allele-EMK tests have the correct type I error rate for all pedigrees.

The allele-EMK test has slightly higher power than the original MK method under nuclear families without parental genotypes. This may be due to changing the overall trait mean to family specific trait mean in the test statistic (1) and (2). Since parental genotypes are mostly missing in late onset disease such as Alzheimer disease, the allele-EMK method could serve as a good alternative to the original MK test.

The geno-EMK global test maintains the nominal significance level even in the case when the type I error of a particular genotype test is too conservative. Moreover, the global test shows similar power to the test of the main associated genotype. Therefore, we recommend using the geno-EMK global test as an initial overall assessment to support the evidence of individual genotype association. For instance, geno-EMK genotype tests have significant findings for rs2164624 in GSTO1 (P= 0.045 for 11 test and P= 0.039 for 12 test), but the global test did not provide a significant result (P= 0.062). Therefore, this SNP may not as important as the other two SNPs (rs4925 and rs2297235). Furthermore, both allele-EMK and QTDT MK tests did not reveal significant results on rs2164624.

The statistical power is comparable between the global geno-EMK and allele-EMK tests. However, the geno-EMK test for an individual genotype may have higher power than allele-EMK or the original MK method in some cases such as the recessive model. We also found that the geno-EMK test has much higher power than allele-based tests under the overdominant model. Overall, the geno-EMK has the advantage of offering genotype specific association results and can be more powerful for some genetic models.

Using EMK, significant SNPs rs4925 and rs2297235 and their early AAO genotypes were found, which reproduces Li et al. (2003) with smaller p-values. It also echoes our previous reports that rs4925 allele 1 (A nucleotide) carriers have a maximum 6.8 year delay of AAO compared to individuals with the 22 genotype (CC) of rs4925 (Li et al. 2006). Furthermore, the allele-EMK test found rs156697, which is in LD with rs2297235 (r2= 0.78; Li et al. (2003)), to be moderately significant (P= 0.039), while the MK test did not (P= 0.113).

In conclusion, we have shown that our EMK program is a robust tool to analyze quantitative trait in family data. While the QTDT program has the flexibility of choosing different testing methods, we consider that our EMK program will be a better alternative for the MK method. The EMK software for conducting the geno-EMK and allele-EMK is written in C++ and available for UNIX and Windows platforms. It can be publicly accessed at the Duke Center for Human Genetics Web site (http:///wwwchg.duhs.duke.edu/research/software.html).


We would like to thank Dr. Andrew Dellinger for his helpful suggestions to improve this paper. This work was supported by a research grant for American Federal for Aging Research (AFAR), a new investigator grant (NIRG-02-3603) and an investigator initiative research grant (IIRG-05-14708) from Alzheimer's association. The Alzheimer data were supported by grants, NS311530, AG021547, AG19757, and AG05128 from NIH.