SEARCH

SEARCH BY CITATION

Keywords:

  • GWAS;
  • case-control study;
  • power;
  • study design;
  • Omni2.5;
  • imputation

Abstract

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information

Genome-wide association studies (GWAS) have been successful in their search for common genetic variants associated with complex traits and diseases. With new advances in array technologies together with available genetic reference sets, the next generation of GWAS will extend the search for associations with uncommon SNPs (1% ⩽ MAF ⩽ 10%). Two possible approaches are genotyping all participants, a prohibitively expensive option for large GWAS, or using a combination of genotyping and imputation. Here, we consider a two platform method that genotypes all participants on a standard genotyping array, designed to identify common variants, and then supplements that data by genotyping only a small proportion of the participants on a platform that has higher coverage for uncommon SNPs. This subset of the study population is then included as part of the imputation reference set. To demonstrate the use of this two-platform design, we evaluate its potential efficiency using a newly available dataset containing 756 individuals genotyped on both the Illumina Human OmniExpress and Omni2.5 Quad. Although genotyping all individuals on the denser array would be ideal, we find that genotyping only 100 individuals on this array, in combination with imputation, leads to only a modest loss of power for detecting associations. However, the loss of power due to imputation can be more substantial if the relative risks for rare variants are significantly larger than those previously observed for common variants. Genet. Epidemiol. 36:400–408, 2012. © 2012 Wiley Periodicals, Inc.


INTRODUCTION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information

Genome-wide association studies (GWAS) have been successful in their search for common genetic variants associated with complex traits, including cancer and type II diabetes (Chung et al. [1], Zeggini et al. [2]). Now, a new generation of GWAS is being designed to discover uncommon variants which have minor allele frequencies (MAF) between 1% and 10%. With improvements in both array technologies and available genetic reference sets, two approaches are looming on the horizon: (1) Next generation sequencing of whole genomes and (2) a combination of genotyping on SNP arrays and imputation. Both options are expensive and, in particular, the first option is prohibitively expensive for the large studies needed to test for associations. Alternatively, some have advocated that it might be possible to use imputation only based on less-dense arrays.

Over the past two years, imputation has emerged as a stable and powerful tool for identifying candidate SNPs and for facilitating meta-analyses. The meta-analyses have identified common loci associated with a variety of traits and diseases, such as cholesterol levels (Clarke [3]), rheumatoid arthritis (Stahl et al. [4]), and multiple sclerosis (De Jager et al. [5]). By itself, the GIANT consortium has successfully used imputation to detect nearly 200 common loci associated with height and weight (Lango Allen et al. [6]). Based on a history of such successes, together with evaluations of imputation accuracy (Biernacka et al. [7], Li et al. [8], Pei [9], Servin [10], Zhao [11]), imputation offers a more economical alternative to directly genotyping all individuals (Servin [10], Scheet and Stephens [12], Li et al. [13], Li et al. [8], Li and Stephens [14], Mott et al. [15], Howie et al. [16], Hao et al. [17], Browning [18], Anderson [19], Daly et al. [20], Purcell et al. [21]). One option is to genotype all individuals on a less-dense platform (e.g., OmniExpress), and then use an imputation procedure, trained on one of the publicly available databases, to estimate the missing genotypes (1000Genomes [22], Consortium [23], Metzker [24]). With this design, if the ancestry of the study population is not adequately represented in the database, the imputation accuracy for uncommon SNPs can be less than ideal and confound study results (Coventry et al. [25], Spencer et al. [26], Crawford and Dilks [27], Howie et al. [16]).

In this paper, we consider a two-platform design that supplements a study using less-dense arrays by genotyping a small subset of the study population on a denser platform (Fridley et al. [28]). We then impute the missing genotypes for those participants not genotyped on the denser platform and perform the GWAS on the augmented dataset. Instead of depending only on a public dataset, the imputation reference set now includes a genotyped subset of the study population (Holm et al. [29], Zeggini [30], Crawford and Dilks [27], Howie et al. [16]). Our goal is to show that this two platform method, which safeguards against confounding from inadequately chosen reference sets, can be efficient, while identifying its limitations. We provide evidence suggesting that it could be possible to observe > 80% of the detectable associations with as few as 100 subjects genotyped on the denser array, an increase of 5–10% over the percentage possible when basing imputation only on a public reference set. However, we also show that if the relative risks (RRs) for rare variants are significantly larger than those previously observed for common variants, as some hypothesize, then the proportion detected would likely be lower. This same evidence cautions against depending on imputation if rare variants are found to have large RRs.

The two-platform design is appropriate whenever two different genotyping methods are available with one method being more inclusive, but more expensive. However, for purposes of demonstrating the potential for this design, we focus on imputing uncommon SNPs for subjects genotyped on Illumina's OmniExpress (≈ 700,000+ common SNPs) using a newly available dataset containing 319, 214, and 223 individuals from the PLCO, CPSII, and ATBC studies (Albanes et al. [31], Thomas et al. [32], Calle et al. [33]) who were genotyped on both the OmniExpress and the Omni2.5-Quad (≈ 2.5 million SNPs) (Wang et al. [34]). We assess imputation accuracy, measured by r2 as a function of MAF, reaffirming that low-MAF alleles are difficult to impute. Using the resulting r2 distribution, we estimate the proportion of detectable SNPs discovered as a function of the proportion of individuals fully genotyped in a two-platform study. This estimation is performed for SNPs with specific MAFs and for a set of SNPs representative of those likely to be seen in a future GWAS, specifically the content of the Omni2.5. To estimate the proportion of SNPs discovered, we build upon previous results that relate r2 with the effective sample size (Hao et al. [17], Terwilliger and Hiekkalinna [35], Li et al. [8], Balding [36]).

MATERIALS AND METHODS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information

SAMPLES

The set of 756 individuals was genotyped as part of the imputation dataset reported elsewhere (Wang et al. [34]). For this analysis, we included 319, 214, and 223 cancer-free controls above 55 years old and from a European background selected from three prospective cohorts, PLCO, CPSII, and ATBC studies. Details about the study populations have been published previously (Albanes et al. [31], Prorok et al. [37], Gohagan et al. [38], Calle et al. [33]). MAF distributions for SNPs on the OmniExpress, only on the Omni2.5, and on both, inline image and inline image, were estimated from all 756 individuals.

GENOTYPING

Genotypes were collected on Illumina's Omni2.5M array and called using the Gentrain2 algorithm within Illumina Genome Studio. An established quality control (QC) process was applied to samples by study to ensure that only high-quality genotypes were retained for the analytic dataset (Wang et al. [34]). Briefly, QC metrics included completion rates by sample or locus, sample heterozygosity rate and duplicate concordance rate, and then standard thresholds for exclusion were applied. Among the 170 duplicates that were included in the analysis (59, 63, and 48 for ATBC, CPSII, and PLCO), the average concordance rate was greater than 99.9%.

The Omni2.5 array contained inline image qualifying SNPs, of which inline image were included on the OmniExpress and inline image were unique to the Omni2.5. However, we only considered the subset of new SNPs that could be imputed using the CEU HapMap/1000 Genome reference set downloaded from IMPUTE2 (http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_pilot_plus_hapmap3.html). We will refer to this reference set, which was built from the 1000 Genomes project June 2010 release and HapMap 3 February 2009 release, as the HapMap/1000 Genome reference set, HM1G. This restriction, which reduced inline image to 1,157,273 prevents underestimating the percentage of associations that can be discovered. Without it, monomorphic SNPs can, because of genotyping error, be counted as low MAF, difficult-to-detect SNPs. For the discussion on power, we label the SNPs by inline image, where inline image if it is included on the OmniExpress.

IMPUTATION

The test set for assessing the accuracy of imputation included inline image=256 individuals, with 108, 72, and 76 randomly selected individuals from the PLCO, CPSII, and ATBC samples. For individuals within this test set, IMPUTEV2 (Howie et al. [16]) was used to impute their genotypes in the new content from their genotypes in the OmniExpress content and from a fully genotyped training set. Training sets contained a total of inline image = 50, 100, 200, 300, 400, or 500 distinct individuals, with approximately 42%, 28%, and 30% contributed by PLCO, CPSII, and ATBC, respectively. We refer to these 500 individuals as our population-specific reference set, PSREF, and they were included as an unphased reference set. Except where noted, we also used HM1G as reference.

The quality of the imputation procedure at SNP j was measured by inline image, the squared-correlation between the true and imputed genotypes.

  • display math(1)

where inline image and inline image are the true and imputed genotypes at SNP j for subject i in the test sample. Although inline image. When inline image and inline image was defined to be 1. If only one variable was identically 0, inline image was defined to be 0. For a given inline image and MAF, we estimated the distribution, inline image, of inline image empirically from the data.

POWER CALCULATIONS

We calculated the power, or the proportion of associations likely to be detected, assuming a case/control study with individuals divided equally between the two groups. We further assumed that associations would be tested by the score statistic, inline image, at an α-level of 10−7. The MAF of the tested SNPs was either fixed or assumed to follow a distribution similar to that found on the Omni2.5. To estimate the distribution of relative risks (RRs) for common variants, we extrapolated from the effect sizes (ES) reported by Park and others (Park et al. [39]). To accommodate the hypothesis that the RR should be higher for rare SNPs, when MAF < 0.05, we multiplied the extrapolated values of RR by either 1 + 8(0.05 − MAF) or 1 + 16(0.05 − MAF), so the maximum increases were 1.4 and 1.8 when MAF = 0. To determine the efficiency of the two-stage design, we will focus on and report the relative power, or the relative proportion discovered (rPD). rPD is defined to be the percentage of associations discovered by the two-stage design as compared to the percentage discovered by the one-stage design using only the denser array. Specific details follow.

Disease risk was assumed to follow a logistic model, inline image, where inline image is the expected value of inline image, an indicator variable for disease status, inline image is the number of minor alleles at the observed locus j, and γ1 is the log(OR). Disease prevalence was fixed at 0.1. Under the null hypothesis, inline image is assumed to be distributed as a central chi-square distribution with 1 df: inline image. Under the alternative, inline image is distributed as a noncentral chi-square distribution with 1 df and noncentrality parameter η: inline image. Therefore, power, β, is defined as inline image. Derivation of η is reserved for the Appendix, but in general, η is a function of n, MAF, RR, π, and (when inline image) inline image, where π is the proportion of individuals genotyped on the denser platform. The parameter, η, will be denoted by inline image when appropriate. To fix the power at M, we effectively choose any combination of RR and n such that inline image.

To transform the ES reported by Park and others (Park et al. [39]) into RRs, we used the equality inline image and assumed MAF=0.10, resulting in a discrete distribution of RR with cumulative probabilities of 0.348, 0.553, 0.737, and 0.826 at RRs of 1.109, 1.134, 1.138, and 1.155. Sixteen more values were permitted such that the cumulative probabilities of 0.90 and 0.99 occurred at RRs of 1.20 and 1.44.

The average power, when using the one-platform design, for all SNPs on the Omni2.5 array can be obtained from

  • display math(2)

where integration was estimated numerically. This average, like all averages in the results section, assume that all SNPs are equally likely to be the causal SNP. This calculation was repeated for the new content by replacing inline image with inline image.

With this notation, the rPD for a single imputed SNP can be obtained from

  • display math(3)

For Figure 3, we assumed that the distribution of inline image was derived from an external set containing 500 individuals and HM1G. Therefore, the estimated rPD for a given MAF was

  • display math(4)

The average rPD for a study with n individuals, of which inline image are genotyped, can be obtained from

  • display math(5)
  • display math(6)

When inline image, we still estimated the distribution of r2 by inline image. Therefore, above this threshold, the gain in power may be underestimated if imputation accuracy improves above inline image.

RESULTS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information

POWER AND SAMPLE SIZE FOR A ONE-PLATFORM DESIGN

Figure 1 shows that the number of subjects needed to detect associations with uncommon variants may be prohibitively large for direct genotyping. To have a power of 0.5 to detect an SNP with MAF=0.02 and a RR of 1.2, for example, a study would need to include 640,00 individuals. The required sample size would be reduced to 17,000 and 8,400 individuals if the RRs were raised to 1.4 and 1.6 (Fig. 1).

image

Figure 1. The power to detect an association increases with the total number of subjects. Power is shown when MAF = 0.02 (unbroken line) and MAF = 0.05 (broken line), and when the relative risks (RRs) = 1.2, 1.4, and 1.6.

Download figure to PowerPoint

Future GWAS, using either sequencing or the new, and denser, arrays will provide genotypes for SNPs with low MAF. For example, the majority of SNPs on the Omni 2.5 array have a MAF below 0.10 (Wang et al. [34]). In our combined ATBC, CPSII, and PLCO samples, the median MAF was 0.08, and over 26%, 34%, and 44% have a MAF below 0.01, 0.02, and 0.05, respectively (see Supporting Information for details). Among the 1,772,611 SNPs absent from the OmniExpress that passed QC, we refer to the 1,157,273 SNPs that could be imputed using only HM1G as the new content. All discussion will be limited to this subset of SNPs or “all” Omni2.5 SNPs, a combination of this subset and those SNPs on the OmniExpress. Restricted to this new content, the median MAF was 0.12, and over 13%, 19%, and 32% have a MAF below 0.01, 0.02, and 0.05.

The average power, or the proportion of all associations likely to be detected in the study, can be estimated by averaging the power calculated for each SNP in the study. For this calculation (Equation (2)), assume that susceptibility SNPs have MAFs similar to those observed on the Omni2.5, and that RRs are similar to those observed for common variants in previous studies. Figure 2 shows that to attain an average power of 0.5, a study would need to include 32,000 individuals. However, if RRs for rare variants were larger than those observed for common variants, the required sample size would decrease (Fig. 2). Specifically, if RRs for SNPs with MAF <0.05 were increased by a factor as large as 1.4x, the required sample size would be 9,100 individuals, and if the RRs were increased by 1.8x, the required sample size would be further reduced to 4,400.

image

Figure 2. Power is averaged over all SNPs on the Omni2.5 (unbroken line) and the new SNPs on the Omni2.5 (broken line). When the relative risks (RRs) for rare variants are the same as those previously observed for common variants, power is illustrated by the purple lines (1.0x). When the RRs for SNPs with MAF < 0.05 are increased by up to 1.4x, the power is illustrated by red lines, and when increased by up to 1.8x, power is illustrated by orange lines.

Download figure to PowerPoint

IMPUTATION QUALITY: PRELUDE TO THE TWO-PLATFORM DESIGN

Among rare and uncommon SNPs, imputation quality increases with MAF. The average performance, measured by the median r2, for Omni2.5 SNPs with MAF < 0.01, 0.01–0.02, and 0.02–0.05 were 0.00, 0.35, and 0.54 using only HM1G as a reference set (Fig. 3). The proportion of SNPs that performed poorly, measured as the proportion with an r2 below 0.25, was 0.80, 0.40, and 0.23 for these same MAF groupings, respectively. Imputation quality noticeably improved when individuals in our own study population were also included in the reference set, especially when judging by the reduction of poorly imputed SNPs (Fig. 3). With 500 individuals from PSREF, the proportion of SNPs that performed poorly was lowered to 0.52, 0.12, and 0.04 for these MAF groupings. The largest improvement occurs for low-MAF SNPs. With 100 individuals from PSREF, we still see a significant decrease in the proportion of SNPs performing poorly. However, the converse does not hold. When using at least 100 individuals from PSREF in our reference set, also including HM1G provided minimal additional benefit (see Supporting Information). This asymmetry further proves that the benefit observed from using PSREF in addition to HM1G is larger than the benefit that would likely have occurred from expanding HM1G by an equivalent number of samples.

image

Figure 3. Imputation accuracy increases with MAF, as evidenced by the inline image, and 75th percentiles of the r2 distribution for SNPs on the Omni2.5 array. Percentiles are illustrated when imputation uses the HapMap/1000G reference set and 500 (unbroken lines), 100 (short dash), or no individuals (long dash) from our population-specific reference set.

Download figure to PowerPoint

When considering the new SNPs on the Omni2.5, the 25th, 50th, and 75th percentiles of the r2 values were 0.43, 0.76, and 0.91 with only HM1G as a reference set. However, by removing those SNPs with MAF < 0.02, effectively leaving only those SNPs where there is reasonable power to detect associations when the sample size is below 10,000, the quartiles increased to 0.54, 0.80, and 0.92. With 500 individuals from PSREF included in the reference set, the quartiles increased to 0.69, 0.92, and 0.97, and excluding those SNPs with MAF < 0.02, the quartiles increased to 0.80, 0.93, and 0.98. Figure 3 illustrates the extent to which adding individuals from PSREF increases the r2 values.

TWO-PLATFORM DESIGN

In the two-platform design, we genotype only a proportion, π, of the n individuals on the denser platform and the remaining individuals on the existing GWAS platform designed for common variants. If we can impute a SNP with an accuracy of r2 after training our imputation procedure on inline image individuals, our effective sample size is shown to be inline image (Equation (A12)). Although the effective sample size is mathematically convenient, the more relevant quantity is the relative power, or the rPD, comparing the percentage of associations discovered by the two-stage design and the percentage discovered by the one-stage design using only the denser array. Figure 4 compares the rPD achievable when using only HM1G for reference in imputation with the rPD from including HM1G and 500 individuals from PSREF.

image

Figure 4. The relative proportion (rPD) of associated SNPs from the Omni2.5 that are discovered by the two-platform experiment. The rPD is illustrated when the maximum power, M, to detect an association is 0.2, 0.5, and 0.8, and when imputation uses the HapMap/1000G reference set and 500 (unbroken lines) or no individuals (long dash) from a population-specific, but external, imputation training set.

Download figure to PowerPoint

As the proportion, π, of subjects directly genotyped increases, rPD increases very rapidly, reaches a point of inflection, and then increases more slowly with a linear trend (Figs. 5, 6). The initial rapid increase can be attributed to the improvement in imputation quality, whereas the remaining linear increase can be attributed to directly genotyping more individuals. Without using HM1G, rPD is necessarily 0 when inline image, making the initial rapid increase more dramatic. However, even when including HM1G, genotyping 100 individuals on the larger platform would raise the rPD from 0.28 to 0.43, 0.45 to 0.62, and 0.75 to 0.88 when the RR = 1.2, 1.4, and 1.6 for an SNP with MAF = 0.04 (Fig. 5). Genotyping a subset of individuals has similar benefits for SNPs with MAF = 0.02. However, as imputation is less accurate for SNPs with low MAF, the overall increase in rPD is slower. For example, for SNPs with MAF = 0.02 and RR = 1.6, rPD does not exceed 0.88 until inline image. In addition to being sensitive to MAF, Figure 4 illustrates that unlike the effective sample size defined by inline image, rPD increases with power, or RR, as well.

image

Figure 5. The relative proportion discovered (rPD) increases with the proportion, π, of individuals directly genotyped. The rPD curves are illustrated when the relative risks (RRs) are 1.2, 1.4, and 1.6, and when MAF is 0.04 (unbroken line) and 0.02 (broken lines). Sample size is fixed at n=10,000 and both HM1G and the appropriate individuals from PSREF are used as reference sets for imputation.

Download figure to PowerPoint

image

Figure 6. Power is averaged over all SNPs on the Omni2.5. When the relative risks (RRs) for rare variants are the same as those previously observed for common variants, rPD is illustrated by the purple lines (1.0x). When the RRs for SNPs with MAF < 0.05 are increased by up to 1.4x and 1.8x, the rPD is illustrated by the red and orange lines, respectively. Unbroken lines illustrate rPD when both HM1G and PSREF are used as reference sets, while the dashed line illustrates rPD using only PSREF. Note the near-perfect overlap when inline image.

Download figure to PowerPoint

The average rPD, or the relative proportion discovered, among all SNPs on the Omni2.5, is always high when the RRs attributable to rare variants are similar to those observed for common variants. In our example, Figure 6 and Table 1 show that 85% of the associations that would be detected by directly genotyping all 10,000 individuals would still be detected with a two-platform approach that only directly genotypes 100 individuals. As the ancestry of our study subjects is well represented by HM1G, the rPD is 0.79 when only this public dataset is used for an imputation reference set. Even without any imputation, the minimum rPD, occurring with inline image, is above 0 in our OmniExpress/Omni2.5 study because approximately 33% of all SNPs will be genotyped on all subjects (Fig. 6). These results, as compared to those in Figures 4 or 5, downplay the potential improvement from imputation because the average rPD is upweighted by this subset of SNPs genotyped on everyone. When the RRs for SNPs with MAF < 0.05 are increased by up to 1.8x, then the rPD using only HM1G or HM1G with 100 individuals from PSREF drops below 0.73. When low-frequency variants have higher RR, they become detectable when all 10,000 individuals are genotyped. Therefore, in absolute terms, increasing RR increases the total percentage of associations detected (Table 1), but as imputation is less accurate for these low-MAF variants, studies that have directly genotyped only inline image individuals perform comparatively poorly.

Table 1. The relative proportion (and absolute proportion) of SNPs discovered by the two-platform design for three scenarios. When the RRs for rare variants are the same as those previously observed for common variants, we consider the relationship between RR and MAF to be “none.” When the RRs for SNPs with MAF < 0.05 are increased by up to 1.4x, the relationship is considered “mild,” and when increased by 1.8x, the relationship is “large.” Columns indicate the proportion of SNPs genotyped on the larger platform. Both HM1G and PSREF were used as reference sets for imputation
  Proportion of individuals genotyped
  0.000.010.501.0
RR vs. MAFNone0.79 (0.14)0.85 (0.15)0.94 (0.16)1 (0.17)
 Mild0.77 (0.14)0.82 (0.13)0.92 (0.17)1 (0.19)
 Large0.65 (0.19)0.73 (0.23)0.88 (0.26)1 (0.29)

When the funds available for genotyping are fixed, power is a function of the proportion, inline image, of those funds used to genotype individuals on the denser array. Here, the effective sample size is inline image, a function of the maximum number, inline image, of individuals who could be genotyped on the larger array and the relative cost, RC, of the larger array as compared to the smaller array (i.e., RC > 1). For a case/control study of up to 10,000 individuals using the Omni2.5 and OmniExpress, spending inline image of the funds on the Omni2.5 array promises at least 99% of the maximum power possible when RC > 1.5 (Supporting Information).

DISCUSSION

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information

For the next generation of GWAS, large studies will be needed to discover uncommon susceptibility SNPs with MAFs between 1% and 10%, stressing both financial and practical collections of samples. In this analysis, we explore how cost of such studies can be reduced by avoiding genotyping a large number of participants using more expensive technologies. The most inexpensive approach would be to only genotype subjects on one of the common SNP arrays. In studies that have already performed GWAS for common variants, reliance on imputation alone would require no additional genotyping. However, we show that employing a tw-oplatform approach, which additionally genotypes a small proportion of the study population on the larger array or with next generation sequencing, will improve our power to detect associations. Although impressive, the gains illustrated in our example are likely a lower bound as the ancestry of individuals in PLCO, CPSII, and ATBC is well represented in HM1G. Had we focused on individuals from a less common ancestry, imputation quality would be reduced and we could expect larger gains from a two-platform design. When the second platform is next generation sequencing, the power gains could also be potentially larger as sequencing is likely to capture variants too rare to be included in HM1G. Such gains might be magnified by sequencing a larger number of cases and potentially identifying more associated rare variants. However, performing analyses for studies where cases were preferentially sequenced requires care to avoid biasing results (Li and Leal [40]).

In this analysis, we have shown that the two platform approach performs better, or the inline image is generally higher, when the RRs attributed to low-frequency SNPs are similar to those observed for common SNPs. If the RRs prove to be significantly larger, SNPs with MAF below 0.02 would be detectable in the one-platform study, but a sizeable proportion of these SNPs may be imputed poorly, at least when imputation is based on 500 or fewer individuals and/or HM1G. In the near future, it is likely that databases and large scale next generation sequencing efforts will capture more precisely estimates of the MAFs between 0.1% and 3%, but imputation accuracy will be subject to issues in population genetics and ascertainment bias in reference populations.

We limited our study to imputation accuracy and inline image for reference samples based on 500 or fewer individuals. One immediate consequence is that we were unable to determine empirically whether imputation accuracy would continue to improve for poorly imputed SNPs if sample size continued to increase. Another consequence is that we underestimate inline image for large π by not allowing imputation accuracy to improve past inline image. We suspect that the bias is minimal as only a small proportion of the detectable SNPs in our example studies had low r2. We also underestimated inline image for low π by not accounting for the linkage disequilibrium between the new content and the OmniExpress content. Presumably, some associations may be detected using either set of SNPs. In our study, we also presumed that the Omni2.5 identified the true alleles. Without this assumption, the improvement from adding individuals from PSREF may only show that we had improved our ability to impute the Omni2.5 reported allele, as opposed to actual allele.

We have presented an analysis based on currently available arrays (e.g., OmniExpress and Omni2.5), but our results can be generalized to other genotyping platforms and eventually next generation sequencing studies once the quality of calling algorithms has stabilized. The equations we derived relating effective sample size with both π and inline image are general, as are our conclusions about the dependence of inline image on sample size and RR, and the dependence of the two-platform method on the general properties of the associations. Furthermore, if the SNPs in the new content are representative of low-frequency SNPs (MAF between 0.01 and 0.1), particularly with respect to the estimates of linkage disequilibrium observed with the OmniExpress array, then the results should be applicable to denser association studies, including both SNP arrays and next generation sequencing.

ACKNOWLEDGMENTS

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information

This study utilized the high-performance computational capabilities of the Biowulf Linux cluster at the National Institutes of Health, Bethesda, Maryland (http://biowulf.nih.gov).

References

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information
  • 1
    Chung CC, Magalhaes WCS, Gonzalez-Bosquet J, Chanock SJ. 2010. Genome-wide association studies in cancer-current and future directions. Carcinogenesis 31:111120.
  • 2
    Zeggini E, Scott LJ, Saxena R, Voight BF, Marchini JL, Hu T, de Bakker PI, Abecasis GR, Almgren P, Andersen G, Ardlie K, Bostrom KB, Bergman RN, Bonnycastle LL, Borch-Johnsen K, Burtt NP, Chen H, Chines PS, Daly MJ, Deodhar P, Ding C-J, Doney ASF, Duren WL, Elliott KS, Erdos MR, Frayling TM, Freathy RM, Gianniny L, Grallert H, Grarup N, Groves CJ, Guiducci C, Hansen T, Herder C, Hitman GA, Hughes TE, Isomaa B, Jackson AU, Jorgensen T, Kong A, Kubalanza K, Kuruvilla FG, Kuusisto J, Langenberg C, Lango H, Lauritzen T, Li Y, Lindgren CM, Lyssenko V, Marvelle AF, Meisinger C, Midthjell K, Mohlke KL, Morken MA, Morris AD, Narisu N, Nilsson P, Owen KR, Palmer CNA, Payne F, Perry, JRB, Pettersen E, Platou C, Prokopenko I, Qi L, Qin L, Rayner NW, Rees M, Roix JJ, Sandbaek A, Shields B, Sjogren M, Steinthorsdottir V, Stringham HM, Swift AJ, Thorleifsson G, Thorsteinsdottir U, Timpson NJ, Tuomi T, Tuomilehto J, Walker M, Watanabe RM, Weedon MN, Willer CJ, Illig T, Hveem K, Hu FB, Laakso M, Stefansson K, Pedersen O, Wareham NJ, Barroso I, Hattersley AT, Collins FS, Groop L, McCarthy MI, Boehnke M, Altshuler D. 2008. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet 40:638645.
  • 3
    Clarke R. 2007. Cholesterol fractions and apolipoproteins as risk factors for heart disease mortality in older men. Arch Intern Med 167:13731378.
  • 4
    Stahl EA, Raychaudhuri S, Remmers EF, Xie G, Eyre S, Thomson BP, Li Y, Kurreeman FAS, Zhernakova A, Hinks A, Guiducci C, Chen R, Alfredsson L, Amos CI, Ardlie KG, Barton A, Bowes J, Brouwer E, Burtt NP, Catanese JJ, Coblyn J, Coenen MJH, Costenbader KH, Criswell LA, Crusius JBA, Cui J, de Bakker PIW, De Jager PL, Ding B, Emery P, Flynn E, Harrison P, Hocking LJ, Huizinga TWJ, Kastner DL, Ke X, Lee AT, Liu X, Martin P, Morgan AW, Padyukov L, Posthumus MD, Radstake TRDJ, Reid DM, Seielstad M, Seldin MF, Shadick NA, Steer S, Tak PP, Thomson W, van der Helm-van Mil AHM, van der-Horst Bruinsma IE, van der Schoot CE, van Riel PLCM, Weinblatt ME, Wilson AG, Wolbink GJ, Wordsworth BP, Wijmenga C, Karlson EW, Toes REM, de Vries N, Begovich AB, Worthington J, Siminovitch KA, Gregersen PK, Klareskog L, Plenge R. 2010. Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat Genet 42:508514.
  • 5
    De Jager PL, Jia X, Wang J, de Bakker PIW, Ottoboni L, Aggarwal NT, Piccio L, Raychaudhuri S, Tran D, Aubin C, Briskin R, Romano S, Baranzini SE, McCauley JL, Pericak-Vance MA, Haines JL, Gibson RA, Naeglin Y, Uitdehaag B, Matthews PM, Kappos L, Polman C, McArdle WL, Strachan DP, Evans D, Cross AH, Daly MJ, Compston A, Sawcer SJ, Weiner HL, Hauser SL, Hafler DA, Oksenberg JR. 2009. Meta-analysis of genome scans and replication identify cd6, irf8 and tnfrsf1a as new multiple sclerosis susceptibility loci. Nat Genet 41:776782.
  • 6
    Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, Willer CJ, Jackson AU, Visscher PM, Chatterjee N, Loos RJF, Boehnke M, McCarthy MI, Ingelsson E, Lindgren CM, Abecasis GR, Stefansson K, Frayling TM, Hirschhorn JN. 2010. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467:832838.
  • 7
    Biernacka J, Tang R, Li J, McDonnell S, Rabe K, Sinnwell J, Rider D, de Andrade M, Goode E, Fridley B. 2009. Assessment of genotype imputation methods. BMC Proceedings 3:S5.
  • 8
    Li Y, Willer C, Sanna S, Abecasis G. 2009. Genotype imputation. Annu Rev Genom Hum Genet 10:387406.
  • 9
    Pei Y, Li J, Zhang L, Papasian C, Deng H. 2008. Analyses and comparison of accuracy of different genotype imputation methods. PLoS ONE 3:e3551.
  • 10
    Servin B, Stephens M. 2007. Imputation-based analysis of association studies: candidate regions and quantitative traits. PLoS Genet 3:e114.
  • 11
    Zhao Z. 2008. Imputation of missing genotypes: an empirical evaluation of impute. BMC Genet 9:85.
  • 12
    Scheet P, Stephens M. 2006. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78:629644.
  • 13
    Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. 2010. Mach: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol 34:816834.
  • 14
    Li N, Stephens, M. 2003. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165:22132233.
  • 15
    Mott R, Talbot CJ, Turri MG, Collins AC, Flint J. 2000. A method for fine mapping quantitative trait loci in outbred animal stocks. Proc Natl Acad Sci USA 97:1264912654.
  • 16
    Howie BN, Donnelly P, Marchini J. 2009. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5:e1000529.
  • 17
    Hao K, Chudin E, McElwee J, Schadt E. 2009. Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies. BMC Genet 10:27.
  • 18
    Browning SR. 2006. Multilocus association mapping using variable-length markov chains. Am J Hum Genet 78:903913.
  • 19
    Anderson C. 2008. Evaluating the effects of imputation on the power, coverage, and cost efficiency of genome-wide SNP platforms. Am J Hum Genet 83:112119.
  • 20
    Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES. 2001. High-resolution haplotype structure in the human genome. Nat Genet 29:229232.
  • 21
    Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, de Bakker P, Daly M, Sham PC. 2007. Plink: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 81:559575.
  • 22
    1000Genomes,. 2010. A map of human genome variation from population-scale sequencing. Nature 467:10611073.
  • 23
    Consortium TIH. 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467:5258.
  • 24
    Metzker ML. 2010. Sequencing technologies---the next generation. Nat Rev Genet 11:3146.
  • 25
    Coventry A, Bull-Otterson LM, Liu X, Clark AG, Maxwell TJ, Crosby J, Hixson JE, Rea TJ, Muzny DM, Lewis LR, Wheeler DA, Sabo A, Lusk C, Weiss KG, Akbar H, Cree A, Hawes AC, Newsham I, Varghese RT, Villasana D, Gross S, Joshi V, Santibanez J, Morgan M, Chang K, Walker IV H, Templeton AR, Boerwinkle E, Gibbs R, Sing CF. 2010. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat Commun 1:131.
  • 26
    Spencer C, Hechter E, Vukcevic D, Donnelly P. 2011. Quantifying the underestimation of relative risks from genome-wide association studies. PLoS Genet 7:e1001337.
  • 27
    Crawford DC, Dilks HH. 2011. Strategies for Genotyping. (John Wiley & Sons, Inc.).
  • 28
    Fridley BL, Jenkins G, Deyo-Svendsen ME, Hebbring S, Freimuth R. 2010. Utilizing genotype imputation for the augmentation of sequence data. PLoS ONE 5:e11018.
  • 29
    Holm H, Gudbjartsson DF, Sulem P, Masson G, Helgadottir HT, Zanon C, Magnusson OT, Helgason A, Saemundsdottir J, Gylfason A, Stefansdottir H, Gretarsdottir S, Matthiasson SE, Thorgeirsson G, Jonasdottir A, Sigurdsson A, Stefansson H, Werge T, Rafnar T, Kiemeney LA, Parvez B, Muhammad R, Roden DM, Darbar D, Thorleifsson G, Walters GB, Kong A, Thorsteinsdottir U, Arnar DO, Stefansson K. 2011. A rare variant in myh6 is associated with high risk of sick sinus syndrome. Nat Genet 43:316320.
  • 30
    Zeggini E. 2011. Next-generation association studies for complex traits. Nat Genet 43:287288.
  • 31
    Albanes D, Heinonen O, Huttunen J, Taylor P, Virtamo J, Edwards B, Haapakoski J, Rautalahti M, Hartman A, Palmgren J. 1995. Effects of alpha-tocopherol and beta-carotene supplements on cancer incidence in the alpha-tocopherol beta-carotene cancer prevention study. Am J Clin Nutr 62:1427S1430S.
  • 32
    Thomas G, Jacobs KB, Yeager M, Kraft P, Wacholder S, Orr N, Yu K, Chatterjee N, Welch R, Hutchinson A, Crenshaw A, Cancel-Tassin G, Staats BJ, Wang Z, Gonzalez-Bosquet J, Fang J, Deng X, Berndt SI, Calle EE, Feigelson HS, Thun MJ, Rodriguez C, Albanes D, Virtamo J, Weinstein S, Schumacher FR, Giovannucci E, Willett WC, Cussenot O, Valeri A, Andriole GL, Crawford ED, Tucker M, Gerhard DS, Fraumeni JF, Hoover R, Hayes RB, Hunter DJ, Chanock SJ. 2008. Multiple loci identified in a genome-wide association study of prostate cancer. Nat Genet 40:310315.
  • 33
    Calle EE, Rodriguez C, Jacobs EJ, Almon ML, Chao A, McCullough ML, Feigelson HS, Thun MJ. 2002. The American Cancer Society cancer prevention study ii nutrition cohort. Cancer 94:24902501.
  • 34
    Wang Z, Jacobs K, Yeager M, Hutchinson A, Sampson J, Chatterjee N, Albanes D, Berndt SI, Diver RW, Gapstur S, Teras L, Haiman CA, Henderson BE, Stram D, Hsing AS, Purdue M, Taylor P, Tucker M, Chanock S. 2011. Improved imputation of common and uncommon single nucleotide polymorphisms (snps) with a new reference set. Nature Precedings 44.
  • 35
    Terwilliger JD, Hiekkalinna T. 2006. An utter refutation of the ‘fundamental theorem of the hapmap’. Eur J Hum Genet 14:426437.
  • 36
    Balding D. 2006. A tutorial on statistical methods for population association studies. Nat Rev Genet 7:78191.
  • 37
    Prorok PC, Andriole GL, Bresalier RS, Buys SS, Chia D, Crawford ED, Fogel R, Gelmann EP, Gilbert F, Hasson MA, Hayes RB, Johnson CC, Mandel JS, Oberman A, O'Brien B, Oken MM, Rafia S, Reding D, Rutt W, Weissfeld JL, Yokochi L, Gohagan JK. 2000. Design of the prostate, lung, colorectal and ovarian (plco) cancer screening trial. Controlled Clinical Trials 21:273S309S.
  • 38
    Gohagan JK, Prorok PC, Hayes RB, Kramer B-S. 2000. The prostate, lung, colorectal and ovarian (plco) cancer screening trial of the national cancer institute: history, organization, and status. Control Clin Trials 21:251S272S.
  • 39
    Park J-H, Wacholder S, Gail MH, Peters U, Jacobs KB, Chanock SJ, Chatterjee N. 2010. Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat Genet 42:570575.
  • 40
    Li B, Leal SM. 2009. Discovery of rare variants via sequencing: implications for the design of complex trait association studies. PLoS Genet 5:e1000481.

Appendix

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information

The score statistic, S, is defined by inline image, where

  • display math(A1)
  • display math(A2)

As we will only discuss a single SNP, the subscript “j” is dropped from all notation. In our case, the score statistic is nothing more than the number of subjects, n, multiplied by the square of the estimated correlation between inline image and inline image.

  • display math(A3)

Here inline image can refer to the true genotypes, imputed genotypes, or some combination of the two. For the two-platform design, when the SNP is included on the OmniExpress, inline image, otherwise inline image is the imputed value. The following asymptotic results for inline image are independent of the calculation used to estimate inline image. Asymptotic results will use the following three assumptions

  • (A1)
    The SNPs used for imputation are independent of the outcome, when inline image is known: inline image.
  • (A2)
    Locus j has an additive effect on the outcome: inline image, where k1 and k2 are constants.
  • (A3)
    The expectation of the imputed genotype is the minor allele frequency at SNP j: inline image.

Theorem 1. Let assumptions A1–A3 hold. Then under the null hypothesis where inline image

  • display math(A4)

Under the alternative, conditional on inline image,

  • display math(A5)

where the noncentrality parameter, η, is defined by

  • display math(A6)

Theorem 2. Let assumptions A1-A3 hold and define inline image, and inline image. Then

  • display math(A7)

With the assumptions, this shows that we can achieve our original power, η1, if we increase the sample size by inline image, or equivalently, the effective sample size of study based on the imputed genotype is inline image. The proof for Theorem 2 is straight forward.

Proof 1. Let inline image, and inline image be the variances of inline image, and Y, respectively. Then we can use the fact that under assumptions (A1), (A2), and (A3)

  • display math(A8)
  • display math(A9)

Therefore,

  • display math(A10)

and

  • display math(A11)

Theorem 3. In addition to the assumptions A1–A3 hold, let inline image, where inline image is a set of covariates for individual i (i.e., the SNPs used for imputation). Define inline image if inline image and inline image if inline image. Furthermore, define inline image, and inline image. Then

  • display math(A12)

With the assumptions, this shows that we can achieve our original power, η1, if we increase the sample size by inline image, or equivalently, the effective sample size of the two-platform study that genotypes a fraction, π, of individuals on the larger array is inline image. The proof follows.

Proof 2.
  • display math(A13)
  • display math(A14)

WLOG assumes that inline image. Therefore, the score statis-tic is

  • display math(A15)

Supporting Information

  1. Top of page
  2. Abstract
  3. INTRODUCTION
  4. MATERIALS AND METHODS
  5. RESULTS
  6. DISCUSSION
  7. ACKNOWLEDGMENTS
  8. References
  9. Appendix
  10. Supporting Information

Disclaimer: Supplementary materials have been peer-reviewed but not copyedited.

FilenameFormatSizeDescription
gepi21634-sup-0001-figures1.pdf54KSupplemental Material

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.