Efficient Approximation of P-value of the Maximum of Correlated Tests, with Applications to Genome-Wide Association Studies


  • Qizhai Li,

    1. Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
    2. Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, P. R. China
    Search for more papers by this author
  • Gang Zheng,

    1. Office of Biostatistics Research, Office of the Director, National Heart, Lung and Blood Institute, Bethesda, MD 20892, USA
    Search for more papers by this author
  • Zhaohai Li,

    1. Department of Statistics, George Washington University, Washington, D.C., 20052, USA
    2. Biometry and Mathematical Statistics Branch, National Institute of Child Health and Human Development, Bethesda, MD 20892, USA
    Search for more papers by this author
  • Kai Yu

    Corresponding author
    1. Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA
      *Corresponding author: Kai Yu, Ph.D., 6120 Executive Boulevard, Room 8050, Rockville, MD 20852. E-mail: yuka@mail.nih.gov
    Search for more papers by this author

*Corresponding author: Kai Yu, Ph.D., 6120 Executive Boulevard, Room 8050, Rockville, MD 20852. E-mail: yuka@mail.nih.gov


Genome-wide association study (GWAS), typically involving 100,000 to 500,000 single-nucleotide polymorphisms (SNPs), is a powerful approach to identify disease susceptibility loci. In a GWAS, single-marker analysis, which tests one SNP at a time, is usually used as the first stage to screen SNPs across the genome in order to identify a small fraction of promising SNPs with relatively low p-values for further and more focused studies. For single-marker analysis, the trend test derived for an additive genetic model is often used. This may not be robust when the additive assumption is not appropriate for the true underlying disease model. A robust test, MAX, based on the maximum of three trend test statistics derived for recessive, additive, and dominant models, has been proposed recently for GWAS. But its p-value has to be evaluated through a resampling-based procedure, which is computationally challenging for the analysis of GWAS. Obtaining the p-value for MAX with adjustment for the covariates can be even more time-consuming. In this article, we provide a simple approximation for the p-value of the MAX test with or without adjusting for the covariates. The new method avoids resampling steps and thus makes the MAX test readily applicable to GWAS. We use simulation studies as well as real datasets on 17 confirmed disease-associated SNPs to assess the accuracy of the proposed method. We also apply the method to the GWAS of coronary artery disease.


Genome-wide association study (GWAS), typically using hundreds of thousands of single nucleotide polymorphisms (SNPs) across the genome, has become a powerful tool for identifying genes or genetic markers underlying disease susceptibility (Klein et al. 2005; Hunter et al. 2007; Sladek et al. 2007; Yeager et al. 2007; The Wellcome Trust Case Control Consortium (WTCCC) 2007). In a typical current GWAS, a panel of 100K–500K SNPs is often genotyped on thousands of individuals. Single-marker analysis, testing the association between the outcome and an individual SNP, is usually used for selecting a subset of SNPs for further investigation (Hoh & Ott 2003; Marchini et al. 2005; Schaid et al. 2005; Wang et al. 2007; Skol et al. 2006; Yu et al. 2007). For example, in a two-stage GWAS (Skol et al. 2006), SNPs whose p-values (obtained from the single-marker analysis in the first stage) are less than a given threshold are evaluated further in an independent sample in the second stage.

A typical test statistic used in single-marker analysis for case-control studies is the Cochran-Armitage trend test (CATT), derived under the assumption of an additive mode of inheritance (Sasieni, 1997; Slager & Schaid, 2001; Zheng et al. 2006a). Since the CATT has an asymptotic normal distribution under the null hypothesis, ranking SNPs based on their test statistics is equivalent to ranking them on their p-values. The CATT for the additive model, however, is not very robust under other modes of inheritance, e.g., recessive or dominant models. A search for disease-related SNPs with their risk effects governed by a particular disease model might miss SNPs following other risk patterns. Furthermore, for complex diseases with low penetrance, usually none of the above simplified models is appropriate. Under these circumstances, efficiency robust tests, which retain high power across all scientifically plausible genetic models, are preferable (Sladek et al. 2007; Zheng et al. 2003; Zheng et al. 2006a). The theory of efficiency robust tests was summarized in Gastwirth (1985) and Freidlin et al. (1999). One commonly used robust test is based on the MAX statistic, the maximum of three CATTs derived under the recessive, additive, and dominant models, respectively. Empirical results show the advantages of using the MAX statistic over the CATT, derived for the additive model, to prioritize SNPs or to detect disease-associated SNPs (Zheng et al. 2006a).

Under the null hypothesis of no association, the MAX statistic does not follow the standard normal distribution asymptotically. Thus a computationally intensive resampling-based procedure is required to estimate its p-value. For example, in a GWAS of type 2 diabetes, Sladek et al. (2007) conducted 10,000 permutations per SNP to estimate p-values of MAX tests. They identified 59 SNPs, based on a p-value threshold around the level of 10−4, for further replication in an independent sample. They then used 10,000,000 permutation steps to estimate the p-values associated with the MAX test on each of the 59 chosen SNPs, based on the replication sample. The reason for this extremely large number of permutation steps was to ensure a reliable estimation for any p-value falling below the level of 10−6. Given situations where the p-value of MAX is not available and a fixed number of SNPs need to be selected for the next-stage study, Zheng et al. (2007) proposed using the MAX statistic rather than its p-value as the basis for the ranking. This approach is easy to carry out without any Monte Carlo simulation. However, the asymptotic null distribution for MAX depends on the genotypic distribution of the study SNP and is SNP-dependent. Therefore, the ranks of SNPs based on their MAX statistics are not weighted on the same scale. It would be more appropriate to rank SNPs based on their p-values.

In many GWAS, in order to account for the other covariates' effects, the logistic regression model is commonly used for the evaluation of individual markers' marginal effect. A similar MAX statistic can be defined based on three Wald (score, or likelihood ratio) test statistics, derived under the dominant, recessive, and additive genotypic effect models, respectively. Clearly, a more computationally intensive resampling procedure is required to estimate the p-value for this type of MAX statistic.

Although using the MAX statistic has various advantages over the CATT derived under an additive model, it is computationally challenging to apply it to a large-scale GWAS. In this article, we propose a simple approach to approximate the p-value of the MAX statistic without Monte Carlo simulation. The approximation formula, called the Rhombus formula, is designed to estimate the two-sided test p-value for the MAX statistic. This Rhombus formula is an extension of the W-formula of Efron (1997), which was originally derived to approximate the one-sided test p-value of the MAX statistic and had been applied to family-based association tests (Yan et al. 2008). To apply this rhombus formula, we need to estimate the covariance matrix for the three CATT (or Wald) tests corresponding to the additive, recessive, and dominant models. Zheng et al. (2006a) provided an analytic formula to estimate the covariate matrix for CATT-based tests. For Wald tests with adjustment for other covariate effects, we propose to use the approach of Pepe et al. (1999), which was based on the generalized estimating equation (GEE) method (Liang & Zeger, 1986), to estimate their covariance matrix numerically. We conducted extensive simulation studies to evaluate the accuracy of the proposed rhombus formula in the setting of the GWAS. To illustrate the application of our methods, we applied the results to 17 confirmed disease-associated SNPs from three GWAS and to a real dataset from a GWAS for coronary artery disease (CAD) with about 350K SNPs (WTCCC, 2007).


A MAX Test Statistic Based on Trend Tests

Suppose that in a GWAS with r cases and s controls, we have genotypes measured on a large panel of SNPs for each individual. Let AA, Aa, and aa be three possible genotypes for a given SNP. The CATT derived for the additive model is often applied to detect the disease-associated SNPs or to prioritize SNPs for further analysis, if there are no other covariates to be adjusted for. For a given SNP, we denote its genotype frequencies in the case and control groups as shown in Table 1.

Table 1.  Notation for Genotype Frequencies

Under the notations listed in Table 1, a general form of the CATT can be written as


where ϕ= (ϕ0, ϕ1, ϕ2) is a pre-determined genotype score. Since the trend test is invariant under a linear transformation of ϕ with ϕ0≤ϕ1≤ϕ2, we can always set ϕ0= 0 and ϕ2= 1, varying the value for ϕ1 between 0 and 1. Under the null hypothesis of no association, H0, Zϕ has an asymptotic normal distribution N(0,1). Usually, we do not know which allele has high risk, so a two-sided test is recommended. Results from Sasieni (1997) and Zheng et al. (2006a) showed that the optimal choices of ϕ1 for the recessive, additive, and dominant models are ϕ1= 0, 1/2, and 1, respectively. Denote the corresponding CATTs by ZREC, ZADD, and ZDOM.

Based on the three CATTs, (ZREC, ZADD, ZDOM), a robust test (Sladek et al. 2007; Zheng et al. 2006a) is given by


To estimate the p-value of ZMAX using the rhombus formula described later, we need to evaluate pairwise asymptotic correlation coefficients among (ZREC, ZADD, ZDOM). From Zheng et al. (2006a), under the null hypothesis, we have


where p0, p1 and p2 are probabilities of genotypes AA, Aa, and aa in the population, respectively, which can be estimated by the observed frequencies using the combined case and control samples, inline image for i= 0, 1, 2 (see Table 1).

A MAX Test Statistic Based on Wald Tests

When there are other covariates to be adjusted for, the following logistic regression model can be used,


where y, z, and g are, respectively, the outcome variable (case or control), the column vector of non-genetic covariates, and the genotype variable at the study SNP. Similar to the CATT, the genotype in model (3) can be coded according to the following three schemes: (i) g(R)= 1 for AA, and 0 for either Aa or aa, which is based on the recessive model in terms of the odd ratio; (ii) g(A) equals the number of copies (0, 1, or 2) of allele A, which is based on the additive (in logit scale) model; and (iii) g(D)= 1 for either AA or Aa, and 0 for aa, which is based on the dominant model. For each type of predictor, say g(R), denote the corresponding coefficients in model (3) by (R), γ(R), β(R)). The null hypothesis for g(R), for example, can be written as H0 : β(R)= 0 . The standard likelihood ratio test, score test, and Wald test can be used to test this null hypothesis while adjusting for the effect of z.

Here we focus on the Wald test. Depending on which model is assumed, g(R), g(A), or g(D), we could have three different Wald tests, denoted by WREC, WADD, and WDOM, each of which is asymptotically optimal under the assumed model. To have a more robust test when the underlying genetic model is unknown, we define the following MAX statistic based on the Wald test,


In order to evaluate the p-value of WMAX using the rhombus formula, which will be described later, we need to estimate pairwise asymptotic correlation coefficients among (WREC, WADD, WDOM). However, unlike the case for the CATT, we do not have explicit formulas for the correlation coefficients. Instead, we propose to use the approach of Pepe et al. (1999) to estimate them numerically.

Covariance Matrix Estimation

Pepe et al. (1999) originally proposed to use the GEE method to compare several predictors in terms of their strength of association with a common outcome. In our application, we have three predictors,g(R), g(A), and g(D), and one common outcome y. The association between y and each of g(R), g(A), and g(D) is measured by β(R), β(A), and β(D), respectively. Pepe et al.'s (1999) procedure provides a way to estimate the covariance matrix for the estimates of (R), β(A), β(D)), and thus to estimate the correlation coefficients among the three Wald test statistics.

Here is an outline of how to apply the procedure of Pepe et al. (1999) Let {(yi, zi, gi): i= 1, …, n} be observed values for a sample of n subjects. Then form the following coefficient vector by combining coefficients from the three models (recessive, additive, and dominant),


For the ith subject, based on its non-genetic covariates zi and three predictors gi(R), gi(A), and gi(D), create the following three rows of expanded covariates, corresponding to the coefficient vector θ,


We can estimate θ by solving the following estimating equations,


where inline image. Let inline imagebe the estimate based on the above estimation equation. Its covariance matrix inline image can be estimated by the following sandwich estimate,


where inline image is the information matrix and is defined as




We can then define the three Wald tests as inline image, inline image, and inline image, with v(R), v(A), and v(D) being the 7th to 9th diagonal elements in the covariance matrix inline image. Each Wald test has an asymptotic N(0,1) under the null hypothesis. Their correlation matrix can be obtained (after rescaling) from the corresponding principal submatrix of inline image.

It can be verified that inline image is the same as the maximum likelihood estimate (MLE) based on the logistic regression model given by (3) with genotype coded by g(R). The same is true for estimates under the other two genetic models. Thus by solving the estimation equation (4), we simultaneously obtain the MLE for (α, γT, β) under three different models based on (3). However, the estimated variances (v(R), v(A), v(D)) for (inline image, inline image, inline image) are based on the robust sandwich estimate and are different from the ones based on the information matrix derived from (3). Thus our definition for the Wald statistic is slightly different from the standard Wald statistic, although the two definitions are asymptotically equivalent.

A Rhombus Formula to Approximate the P-value of MAX

Once we obtain the estimates of pairwise correlation coefficients among (ZADD, ZDOM, ZREC) or (WREC, WADD, WDOM), we can use the approximation method developed in this section to calculate the p-value of ZMAX or WMAX. We describe the method in its general form. Assume that there are k (k= 3 for our application) test statistics, T1, T2, …, Tk, each of which is used to test the null hypothesis H0, under which these k test statistics approximately follow the standard normal distribution whose density and probability functions are denoted by φ (x) and Φ (t), respectively. We further assume that the correlation coefficient cor (Ti, Tj), for i, j∈{1, 2, …, k}, is known or can be estimated consistently. Let Tmax= max {|T1|, |T2|, …,|Tk|} be the MAX statistic. Given Tmax=t, we are interested in calculating the p-value inline image.

Letting T*max= max {T1, T2, …, Tk}, Efron (1997) derived a formula (called the W-formula) to approximate inline image, with t* being an observed value for T*max. Thus Efron (1997) dealt with a one-sided rejection region, whereas we try to calculate the probability for a rejection region that is symmetric about the origin point. Following the techniques of Efron (1997), we derived a tight upper bound for inline image. The derivation is given in the Appendix. Corresponding to Efron's W-formula, we call ours the rhombus formula, which is given by where Lij= arccos (cor (Ti, Tj)) and I{·} is an indicator function. We can see from (5) that the estimated upper bound for the p-value depends on how these k test statistics are indexed, but inline image is independent of the index. To have a tighter upper bound, in practice, we can compare the upper bound evaluated under all possible orderings of the k test statistics, choosing the smallest value as the estimation for the p-value. This strategy is feasible for ZMAX and WMAX with k= 3.

It should be pointed out that the rhombus formula provides a theoretical upper bound for the true p-value if (T1, T2, …, Tk) follows a joint normal distribution with a known correlation matrix. In real applications, (T1, T2, …, Tk) is asymptotically normal. The correlation matrix is also estimated. Therefore the true p-value of the MAX test is not necessarily less than the bound calculated by the rhombus formula. However, from the numerical examples in both simulations and real data applications, we observed that the values given by the rhombus formula tended to overestimate the true p-values.

Simulations Design

To evaluate the accuracy of the rhombus formula for approximating the p-value associated with ZMAX and WMAX, we conducted simulation studies to estimate the empirical type I error rate under various significance levels. We simulated genotypes for 1,000,000 null SNPs for cases and controls, with various sample sizes. We considered two scenarios: S1, all cases and controls were sampled from a homogeneous population; S2, the study population consisted of two subpopulations. For each scenario, Hardy-Weinberg equilibrium (HWE) was assumed within each subpopulation. Under S1, the MAX test ZMAX was applied, with its p-value estimated by the rhombus formula (5). Under S2, we used WMAX with an adjustment for the (known) subpopulation structure, i.e., we entered the covariate z in model (3), with z= 0 for subjects from subpopulation 1, and z= 1 for subjects from subpopulation 2. Note that the purpose of S2 is not to evaluate the effect of the population substructure, but to demonstrate the use of the WMAX test. We assumed the number of cases (r) was the same as that of controls (s), with r=s= 500, 1,000, 1,500, and 2,000. Under S1, we assumed that minor allele frequencies (MAFs) of all 1,000,000 SNPs were independently generated from the uniform distribution U[0.1, 0.5], and we randomly assigned genotypes to cases and controls according to the genotype frequencies under HWE. Under S2, for any given SNP, its MAFs in these two subpopulations were generated by two independent random draws from a Beta distribution with two parameters, p(1 −FST)/FST and (1 −p)(1 −FST)/FST, where FST= 0.01 (a typical value for divergent European populations), and p was the ancestral population MAF drawn from U[0.1, 0.5] (Price et al. 2006).We further assumed that 60% and 40% of cases were chosen from subpopulations 1 and 2, respectively, while 40% and 60% of controls were sampled from subpopulations 1 and 2, respectively. Under each scenario, based on the p-value estimated by the rhombus formula, we estimated the empirical Type I error for MAX by averaging results over 1,000,000 null SNPs. The nominal level α was set to 0.0001, 0.001, 0.01, 0.05, and 0.1.

Application to 17 Disease-Associated SNPs

We applied the MAX test to 17 SNPs whose association with various complex diseases had been confirmed, including 8 SNPs associated with type 2 diabetes (Sladek et al. 2007), 6 SNPs associated with breast cancer (Hunter et al. 2007), and 3 SNPs associated with prostate cancer (Yeager et al. 2007). For each of the above SNPs, we obtained its genotype counts and applied the MAX test ZMAX. We used the rhombus formula as well as the two resampling-based approaches (the parametric bootstrap and the permutation procedure) to estimate the p-values. The bootstrap method generates genotype counts for the cases and controls under the null hypothesis of no association, based on genotype frequencies in the pooled samples. The permutation procedure just randomly shuffles the case/control status among all individuals. We used 10,000,000 bootstrap or permutation steps to ensure a reliable estimation of the p-value for each SNP.

Application to the GWAS of Coronary Artery Disease (CAD)

CAD is one of the most common heart diseases. It is the main cause of death among the elderly. WTCCC (2007) reported results of a GWAS with 459,446 SNPs for CAD. To demonstrate the application of the MAX test in GWAS, we applied the test based on ZMAX to the CAD study and estimated p-values using the rhombus formula. We focused on 343,413 SNPs, after removing SNPs without SNP IDs, SNPs with a genotype frequency below 5 in any cell listed in Table 1 and SNPs with bad clustering properties.


Simulation Results

Table 2 reports the empirical Type I error results when the p-value of the MAX statistic is approximated by the rhombus formula. It shows that the rhombus formula can estimate the p-value reasonably well. Similar conclusions can be made when we draw MAF from more restrictive intervals rather than uniformly from [0.1, 0.5] (results not shown). It is not surprising that the rhombus formula tends to overestimate the p-value. As a result, the empirical Type I error based on the estimated p-value is lower than the nominal value most of the time. From Table 2, it appears that the rhombus formula is especially appropriate when true p-values are relatively small (less than 0.1). For example, when the nominal level is 0.05, the largest absolute difference between our estimated p-values and 0.05 is less than 0.004 (with N = 500), while the largest absolute difference becomes 0.023 (with N = 500) when the nominal level is 0.2. Thus, the rhombus formula becomes modestly conservative for approximating less extreme p-values. This demonstrates that the rhombus formula is not only particularly useful for GWAS, where the main focus is on the SNPs with small p-values; it can also be applied to candidate studies with a nominal level below 0.05.

Table 2.  Empirical Type I Error Based on 1,000,000 Replicates
Using the MAX test ZMAX with p-values estimated by the rhombus formula
5001.13 × 10−49.17 × 10−40.00930.04630.09150.1768
1,0009.90 × 10−59.46 × 10−40.00960.04760.09340.1791
1,5001.07 × 10−41.03 × 10−30.00990.04820.09360.1789
2,0009.70 × 10−51.02 × 10−30.01000.04840.09390.1797
Using the MAX test WMAX with p-values estimated by the rhombus formula
5009.00 × 10−59.26 × 10−40.00930.04750.09280.1788
1,0009.50 × 10−59.25 × 10−40.00980.04780.09350.1795
1,5009.40 × 10−59.59 × 10−40.00970.04760.09310.1793
2,0009.30 × 10−59.57 × 10−40.00980.04770.09340.1797

Estimated P-Values for 17 Disease-Associated SNPs

Table 3 reports the p-values obtained by the rhombus formula and two resampling-based methods for 17 confirmed SNPs from three genetic studies of type 2 diabetes (Sladek et al. 2007), breast cancer (Hunter et al. 2007), and prostate cancer (Yeager et al. 2007). Table 3 shows that p-values from the three methods generally agree well, especially when the minimum genotype count observed in the cases and controls is larger than 20. The p-value estimated by the rhombus formula tends to be slightly larger than that obtained by the resampling-based procedures. This is consistent with the fact that the rhombus formula provides a theoretical upper bound under the normality assumption. One advantage of using the rhombus formula to estimate the p-value is that it can provide a reasonably accurate approximation when the true p-value is less than 10−6, which requires more than 10,000,000 permutation steps. For example, for SNP rs7903146 in the study of type 2 diabetes (Sladek et al. 2007), the resampling-based estimates show that the p-value is less than 10−7 because no simulated MAX statistics were more extreme than the observed MAX. Using the rhombus formula, we estimated the p-value to be 1.58 × 10−18. This information is useful in replication studies and meta-analyses when p-values from several studies are to be combined.

Table 3.  P-values of Indentitied SNPs in GWASs of Diabetes, Breast, and Prostate Cancers
 r0r1r2s0s1s2Rhombus Formula10,000,000 Bootstraps10,000,000 Permutations
8 confirmed SNPs associated with Type 2 Diabetes3
rs7903146197348149335254651.58 × 10−18< 1 × 10−7< 1 × 10−7
rs1326663454229411532933071.84 × 10−51.52 × 10−51.42 × 10−5
rs1111875773023151193082276.78 × 10−65.40 × 10−67.10 × 10−6
rs7923837663003281162962422.28 × 10−62.20 × 10−62.33 × 10−6
rs748001030132766363246452.18 × 10−51.76 × 10−52.19 × 10−5
rs374087825273386652493531.84 × 10−51.70 × 10−51.52 × 10−5
rs1103790925274387652513531.85 × 10−51.81 × 10−51.71 × 10−5
rs111313225271390632513554.12 × 10−53.68 × 10−53.66 × 10−5
6 reported SNPs associated with breast cancer2
rs1051012695518010854272141.42 × 10−60.90 × 10−60.50 × 10−6
rs1250508060847750628408998.27 × 10−57.92 × 10−57.33 × 10−5
rs1715790377731618862220266.20 × 10−54.95 × 10−55.69 × 10−5
rs12196483525432504335381704.80 × 10−64.30 × 10−64.10 × 10−6
rs76961753536051873964962491.98 × 10−32.07 × 10−32.14 × 10−3
Rs24209463575462424405371655.14 × 10−65.60 × 10−63.80 × 10−6
3 reported SNPs associated with prostate cancer4
rs144729525283864102189291.10 × 10−40.88 × 10−40.80 × 10−4
rs69832673515982232775793012.06 × 10−52.12 × 10−52.36 × 10−5
rs783768886128327939206116.67 × 10−63.70 × 10−63.00 × 10−6

Application to GWAS of CAD

Figure 1 plots the estimated p-values based on the rhombus formula for 343,413 SNPs according to their positions along each chromosome. Table 4 lists all 22 SNPs with an estimated p-value (by the rhombus formula) below 10−5. Also presented in Table 4 are p-values estimated by the two resampling-based methods. From Table 4, it can be seen that results from the rhombus formula agree well with estimates from the two resampling-based methods. For those SNPs with extremely small pvalues (say less than 10−7), it is not computationally feasible to use resampling-based methods. The rhombus formula has no such limitations.

Figure 1.

P-values of MAX statistics along the chromosome for 343,413 SNPs in the GWAS of CAD.

Table 4.  Estimated P-values for 22 Chosen SNPs from The GWAS of CAD
SNP IDChromosomePositionr0r1r2s0s1s2Rhombus Formula10,000,000 Bootstrap10,000,000 Permutation
rs48540902240888623437881025155112715845.88 × 10−77.00 × 10−78.00 × 10−7
rs500717134512471229684117312696618032.03 × 10−74.00 × 10−7<10−7
rs704485992200878144997250253914479501.34 × 10−7<10−72.00 × 10−7
rs52309692200912964996031485714426373.81 × 10−63.70 × 10−63.90 × 10−6
rs51839492200967364896431085614456313.77 × 10−62.30 × 10−63.90 × 10−6
rs1075726492200973250997643762314858285.21 × 10−74.00 × 10−72.00 × 10−7
rs1096521292201379551395045960314288952.74 × 10−9<10−7<10−7
rs129213692201435138295159074614827081.40 × 10−8<10−7<10−7
rs704910592201880150695046159314398973.02 × 10−9<10−7<10−7
rs1096521592201944550695446358814479021.48 × 10−9<10−7<10−7
rs56439892201954772192527792114285834.99 × 10−8<10−7<10−7
rs786561892202100570692928988814296193.80 × 10−9<10−7<10−7
rs1096521992204368753095643961314448761.45 × 10−10<10−7<10−7
rs963288492206230159294938169314228181.13 × 10−12<10−7<10−7
rs647560692207185058895738068314258301.31 × 10−13<10−7<10−7
rs497757492208857438293760580414356981.27 × 10−12<10−7<10−7
rs289116892208861938393860580314356981.75 × 10−12<10−7<10−7
rs133304292209381336593961777014267247.23 × 10−12<10−7<10−7
rs133304892211534761995135473014237813.80 × 10−13<10−7<10−7
rs133304992211550358696037867614318295.36 × 10−14<10−7<10−7
rs176970051612268766154987715361139811059.36 × 10−68.70 × 10−69.80 × 10−6
rs6880342225014189823832265138612662765.81 × 10−64.80 × 10−66.30 × 10−6


A computationally feasible single-marker analysis is usually required at the initial stage of a GWAS to scan through the genome in order to prioritize SNPs for subsequent studies or to identify SNPs that reach the global significance level. The MAX statistic, which is defined as the maximum of several statistics targeting alternative hypotheses for the disease model, is a good candidate for the single-marker analysis because it can retain high power across a wide range of disease models. This robustness is particularly attractive for a GWAS, since it is unlikely that all disease-associated markers follow the same disease model. But its application in GWAS is limited by the difficulty in assessing the significance level of the MAX test. In this paper, we derive a simple approximation formula, called the rhombus formula, for estimating the p-value of the MAX test. Nevertheless, multiple-integration (Conneely & Boehnke, 2007) could be an alternative method for calculating adjusted p-values. Compared with multiple-integration, our method has an analytic expression and is more convenient to use. Our method can be applied to the MAX test with or without adjustment for the effect of covariates, based on three CATTs derived for three alternative disease penetrance models. It doesn't require resampling steps (permutation or bootstrap) and thus is readily applicable to GWAS.

The rhombus formula provides a theoretical upper bound for p-values under the normal assumptions. In real applications, using this upper bound tends to overestimate the true p-value. This formula is particularly suitable for approximating low p-values, but it is less accurate for estimating p-values above 0.2, as is evident from the simulation studies. However this defect should not limit its application in GWAS, where we are interested primarily in identifying SNPs with relatively low p-values.

With the rhombus formula, the MAX test can be used routinely in GWAS. In this paper, we focus mainly on the operational aspects of the MAX test, such as how to estimate the p-value and how to do the MAX test with adjustment for covariate effects. It is important to evaluate the impact of using the MAX test on the design and analysis of the GWAS. Although it is not straightforward to derive an analytic power calculation formula for the MAX test, it is computationally feasible to evaluate its power and other properties through simulation studies, using the rhombus formula.

In practice, case-control studies are susceptible to various confounding effects. One issue in case-control design is population stratification, which leads to spurious associations when the allele (genotype) frequencies and disease prevalence change across subpopulations. Various approaches (e.g., Devlin & Roeder, 1999; Price et al. 2006; Zheng et al. 2006b, and Li & Yu, 2008) have been proposed for the correction of population stratification. We are currently investigating the effect of population stratification on the MAX test and how to apply those correction methods with MAX to GWAS.


We would like to thank the associate editor and two reviewers for their thoughtful comments, which led to an improved manuscript. We thank Sholom Wacholder and B.J. Stone for their help. This research utilized the high-performance computational capabilities of the Biowulf PC/Linux cluster at the National Institutes of Health, Bethesda, Maryland, USA (http://biowulf.nih.gov). K Yu and Q Li are supported by the Intramural Program of the National Institutes of Health. Q Li is supported in part by the Knowledge Innovation Program of the Chinese Academy of Sciences, Nos. 30465W0 and 30475V0. Z Li is supported in part by NIH grant EY014478.


Derivation of the Rhombus Formula:  For any given t, define 2k events:


Let ZN(0, I2), where 0= (0, 0)T and inline image. Then Ti and Tj can be expressed as TiTiZ and TjTjZ, where γi and γj are vectors such that cor(Ti, Tj) =γTiγj. As illustrated in Figure 2, Ei and Ej are two half-spaces. Their boundaries are two tangents that are perpendicular to γi and γj, respectively. The distance between the tangent points and the original point is t. inline image is the shaded region lying between the two parallel lines. Let Ω be the middle rhombus-shaped region. Then, we have

Figure 2.

Graphical representation of inline image (The shaped region), where Ti=γ′i(Z1, Z2)′, Tj=γ′j(Z1, Z2)′, and (Z1, Z2) ∼N(0, I2), 0= (0, 0)′, inline image.

inline image is four times the probability of the event defined by triangle AOC. Expressing Z in polar coordinates, we have


Since inline image for inline image, we have


Hence, we obtain