SAMPLES
The set of 756 individuals was genotyped as part of the imputation dataset reported elsewhere (Wang et al. [34]). For this analysis, we included 319, 214, and 223 cancer-free controls above 55 years old and from a European background selected from three prospective cohorts, PLCO, CPSII, and ATBC studies. Details about the study populations have been published previously (Albanes et al. [31], Prorok et al. [37], Gohagan et al. [38], Calle et al. [33]). MAF distributions for SNPs on the OmniExpress, only on the Omni2.5, and on both,
and
, were estimated from all 756 individuals.
GENOTYPING
Genotypes were collected on Illumina's Omni2.5M array and called using the Gentrain2 algorithm within Illumina Genome Studio. An established quality control (QC) process was applied to samples by study to ensure that only high-quality genotypes were retained for the analytic dataset (Wang et al. [34]). Briefly, QC metrics included completion rates by sample or locus, sample heterozygosity rate and duplicate concordance rate, and then standard thresholds for exclusion were applied. Among the 170 duplicates that were included in the analysis (59, 63, and 48 for ATBC, CPSII, and PLCO), the average concordance rate was greater than 99.9%.
The Omni2.5 array contained
qualifying SNPs, of which
were included on the OmniExpress and
were unique to the Omni2.5. However, we only considered the subset of new SNPs that could be imputed using the CEU HapMap/1000 Genome reference set downloaded from IMPUTE2 (http://mathgen.stats.ox.ac.uk/impute/data_download_1000G_pilot_plus_hapmap3.html). We will refer to this reference set, which was built from the 1000 Genomes project June 2010 release and HapMap 3 February 2009 release, as the HapMap/1000 Genome reference set, HM1G. This restriction, which reduced
to 1,157,273 prevents underestimating the percentage of associations that can be discovered. Without it, monomorphic SNPs can, because of genotyping error, be counted as low MAF, difficult-to-detect SNPs. For the discussion on power, we label the SNPs by
, where
if it is included on the OmniExpress.
IMPUTATION
The test set for assessing the accuracy of imputation included
=256 individuals, with 108, 72, and 76 randomly selected individuals from the PLCO, CPSII, and ATBC samples. For individuals within this test set, IMPUTEV2 (Howie et al. [16]) was used to impute their genotypes in the new content from their genotypes in the OmniExpress content and from a fully genotyped training set. Training sets contained a total of
= 50, 100, 200, 300, 400, or 500 distinct individuals, with approximately 42%, 28%, and 30% contributed by PLCO, CPSII, and ATBC, respectively. We refer to these 500 individuals as our population-specific reference set, PSREF, and they were included as an unphased reference set. Except where noted, we also used HM1G as reference.
POWER CALCULATIONS
We calculated the power, or the proportion of associations likely to be detected, assuming a case/control study with individuals divided equally between the two groups. We further assumed that associations would be tested by the score statistic,
, at an α-level of 10−7. The MAF of the tested SNPs was either fixed or assumed to follow a distribution similar to that found on the Omni2.5. To estimate the distribution of relative risks (RRs) for common variants, we extrapolated from the effect sizes (ES) reported by Park and others (Park et al. [39]). To accommodate the hypothesis that the RR should be higher for rare SNPs, when MAF < 0.05, we multiplied the extrapolated values of RR by either 1 + 8(0.05 − MAF) or 1 + 16(0.05 − MAF), so the maximum increases were 1.4 and 1.8 when MAF = 0. To determine the efficiency of the two-stage design, we will focus on and report the relative power, or the relative proportion discovered (rPD). rPD is defined to be the percentage of associations discovered by the two-stage design as compared to the percentage discovered by the one-stage design using only the denser array. Specific details follow.
Disease risk was assumed to follow a logistic model,
, where
is the expected value of
, an indicator variable for disease status,
is the number of minor alleles at the observed locus j, and γ1 is the log(OR). Disease prevalence was fixed at 0.1. Under the null hypothesis,
is assumed to be distributed as a central chi-square distribution with 1 df:
. Under the alternative,
is distributed as a noncentral chi-square distribution with 1 df and noncentrality parameter η:
. Therefore, power, β, is defined as
. Derivation of η is reserved for the Appendix, but in general, η is a function of n, MAF, RR, π, and (when
)
, where π is the proportion of individuals genotyped on the denser platform. The parameter, η, will be denoted by
when appropriate. To fix the power at M, we effectively choose any combination of RR and n such that
.
To transform the ES reported by Park and others (Park et al. [39]) into RRs, we used the equality
and assumed MAF=0.10, resulting in a discrete distribution of RR with cumulative probabilities of 0.348, 0.553, 0.737, and 0.826 at RRs of 1.109, 1.134, 1.138, and 1.155. Sixteen more values were permitted such that the cumulative probabilities of 0.90 and 0.99 occurred at RRs of 1.20 and 1.44.
The average power, when using the one-platform design, for all SNPs on the Omni2.5 array can be obtained from
(2)
where integration was estimated numerically. This average, like all averages in the results section, assume that all SNPs are equally likely to be the causal SNP. This calculation was repeated for the new content by replacing
with
.
With this notation, the rPD for a single imputed SNP can be obtained from
(3)
For Figure 3, we assumed that the distribution of
was derived from an external set containing 500 individuals and HM1G. Therefore, the estimated rPD for a given MAF was
(4)