Power and Sample Size for Testing Associations of Haplotypes with Complex Traits

Authors


*Corresponding Author: Daniel J. Schaid, Ph.D., Harwick 775, Section of Biostatistics, Mayo Clinic, 200 First Street, SW, Rochester, MN 55905. Tel: 507-284-0639. Fax: 507-284-9542. E-mail: schaid@mayo.edu

Summary

Evaluation of the association of haplotypes with either quantitative traits or disease status is common practice, and under some situations provides greater power than the evaluation of individual marker loci. The focus on haplotype analyses will increase as more single nucleotide polymorphisms (SNPs) are discovered, either because of interest in candidate gene regions, or because of interest in genome-wide association studies. However, there is little guidance on the determination of the sample size needed to achieve the desired power for a study, particularly when linkage phase of the haplotypes is unknown, and when a subset of tag-SNP markers is measured. There is a growing wealth of information on the distribution of haplotypes in different populations, and it is not unusual for investigators to measure genetic markers in pilot studies in order to gain knowledge of the distribution of haplotypes in the target population. Starting with this basic information on the distribution of haplotypes, we derive analytic methods to determine sample size or power to test the association of haplotypes with either a quantitative trait or disease status (e.g., a case-control study design), assuming that all subjects are unrelated. Our derivations cover both phase-known and phase-unknown haplotypes, allowing evaluation of the loss of efficiency due to unknown phase. We also extend our methods to when a subset of tag-SNPs is chosen, allowing investigators to explore the impact of tag-SNPs on power. Simulations illustrate that the theoretical power predictions are quite accurate over a broad range of conditions. Our theoretical formulae should provide useful guidance when planning haplotype association studies.

Introduction

Evaluating the association of haplotypes with disease or quantitative phenotypes can be a potent method to study the genetic basis of complex human traits. Although a wide variety of statistical methods have been developed to test associations with haplotypes, and to estimate parameters that describe haplotype associations (for a review, see Schaid, 2004), there has been little guidance on how to determine sample size and power for haplotype association studies. The purpose of this report is to illustrate how sample size and power can be computed for studies of the association of haplotypes with traits. Both directly measured haplotypes and haplotypes with unknown phase are considered, as well as subsets of markers used to tag haplotypes. Furthermore, computations are illustrated for both quantitative traits and retrospective case-control study designs.

When multiple marker loci are discussed in the literature, the terms used to describe phased versus un-phased marker haplotype data are often inconsistent. Hence, before we present the technical methods on how to compute power for haplotype association studies, it is worthwhile clarifying the terminology that we shall use, as defined elsewhere (Brumfield et al. 2000; Clayton et al. 2004). The term haplotype refers to the set of alleles at linked loci on a single chromosome, inherited from a subject's parent. A pair of autosomal haplotypes, one haplotype from each parent, is a genotype. When the haplotypes are directly measured, the combination of alleles on each haplotype is known, which is referred to as phased marker data. When considering alleles from multiple marker loci whose parental origins are unknown, the underlying pair of haplotypes for a subject is ambiguous when the subject has more than one heterozygous marker locus. That is, marker phase is unknown. The collection of single-locus marker genotypes whose phase is unknown is a diplotype. Hence, it is possible, and very likely, for a diplotype to have more than one underlying genotype (pair of haplotypes). As a cautionary note, the terms diplotype and genotype are sometimes used in the literature according to the ways we have defined them, and sometimes interchanged.

When measuring a single genetic marker, it is well known that the factors that influence power are the size of the effect of the causative genotype on the phenotype, the frequency of the causative allele, the amount of linkage disequilibrium (LD) between a causative allele and the measured genetic marker, and how close the allele frequencies match between a causative allele and the marker allele (Zondervan & Cardon, 2004). When measuring multiple marker loci, the strength of LD among the marker loci will additionally influence power. Although the benefit of haplotype analyses versus multi-marker tests for association that ignore phase have been widely discussed and debated in the literature (e.g., Chapman et al. 2003; Clayton et al. 2004), it appears that the greatest gain in power provided by haplotype analyses occurs when linkage disequilibria exist among the marker loci at orders higher than pair-wise LD (Nielsen et al. 2004).

O'Hely & Slatkin (2003) provided a quantitative evaluation of the increase in sample size needed to compensate for unknown haplotype phase when performing case-control studies. They assumed that the causative locus has two alleles with a multiplicative effect on the disease penetrance. To compare sample sizes between situations where phase is unknown versus where it is known, they derived a ratio Q, where the numerator of Q is the sample size needed for a specified power when phase is unknown, and the denominator is that for when phase is known. When Q > 1, there is a loss in efficiency due to unknown haplotype phase. They showed that the value of Q depends on the pattern of LD between the causative allele and the marker alleles, but Q is independent of the frequency of the causative allele. In general, Q is large when the amount of LD among the markers is weak, but Q approaches unity when the markers are in strong LD. These results indicate that the allele frequencies and strength of LD among the markers will have a large impact on the sample size of a study when haplotype phase is unknown. Although the results from O'Hely & Slatkin (O'Hely & Slatkin 2003) provide useful guidelines, they are restricted to case-control studies, with the assumption of multiplicative effects of the causative allele. Also, because their derivations are based on disequilibria parameters and formulae that depend on the number of marker loci, it is cumbersome to implement them when an investigator assumes a set of haplotypes and their frequencies, and corresponding odds-ratios for pairs of haplotypes.

In the section of Statistical Methods, we derive methods to compute sample size and power for the association of haplotypes with either quantitative traits or disease status in retrospective case-control studies. One of our main assumptions is that an investigator knows the distribution of haplotypes in the target population. This is a reasonable assumption, because it is common for investigators to genotype a sample of subjects in order to gain preliminary data on the distribution of markers in the genomic region of interest. Furthermore, data available in public databases, such as from the International HapMap project (http://www.hapmap.org/), provide this type of information.

For quantitative traits, power depends on the multiple correlation coefficient, R2, which provides a simple measure of the percentage of the variance of the quantitative trait that is explained by the haplotypes. However, the value of R2 is not very informative, because R2 depends on both the sizes of the effects of the haplotypes on the trait and the distribution of the haplotypes in the population. For this reason, we specify the alternative hypothesis according to the shift in the mean of the trait caused by the haplotypes (i.e., the regression of the quantitative trait on a haplotype design matrix), and illustrate how the specified regression coefficients and haplotype frequencies can be translated into an R2 value. For case-control studies, we specify the alternative hypothesis according to the anticipated genotype odds ratios. We show how the relative efficiency of genotype versus diplotype data can be computed, in order to evaluate the increase in sample size due to unknown linkage phase. Finally, we consider the impact of choosing a subset of markers to tag haplotypes on power.

Statistical Methods

The methods proposed to compute sample size and power are based on either the non-central chi-square distribution, or the non-central F-distribution. The general form for the power function is Power= 1 −P(T > c;df, NCP), where P() is either the non-central chi-square distribution or the non-central F-distribution, T is the test statistic, c is the critical value that corresponds to a specified Type-I error rate, df is the degrees of freedom (either a single df for the chi-square distribution, or both a numerator and denominator df for the F-distribution), and NCP is the non-centrality parameter. The NCP depends on the product of the sample size and a term that measures the strength of effect of the haplotypes on the trait. Hence, the main focus of this paper is calculation of the NCP for various scenarios.

Power for Association of Haplotypes with Quantitative Traits

For a quantitative trait Y, we assume the regression equation Yo+β′X, where βo is the intercept, the vector β represents the effects of the haplotypes on the average trait value, the vector X has quantitative codes for the pair of haplotypes a subject possesses, and ɛ is an error term with mean zero. A critical aspect of this regression equation is how a pair of haplotypes is coded in the vector X. For example, assuming that there are K unique haplotypes, one can evaluate the effect of each unique pair of haplotypes [K(K+1)/2 such pairs], although this approach is likely to have weak power because of the large number of degrees of freedom. Alternatively, one can consider K terms, treating haplotypes as having either dominant, recessive or additive effects. For our exposition we shall assume additive effects of haplotypes. This simplifies the presentation of the score statistics, and this approach will likely have sufficient power as long as the true effects are not recessive (Schaid, 2002). However, our methods are general, allowing investigators to choose the type of genetic effects that they wish to assess.

When considering the additive effects of haplotypes on a trait, we create scores for a subject's pair of haplotypes, where the scores represent haplotype dosages, and place these scores into a vector denoted Xi for the ith subject. The kth element of Xi is either 0, 1, or 2, according to the number of haplotypes of type k. If there are K unique haplotype categories, Xi would have length K. However, for identifiability, one haplotype category is ignored, treating it as a “baseline” for the construction of a design matrix, so the length of Xi is (K−1). For other types of scorings, such as dominant or recessive, see Schaid (1996).

To test the association of haplotypes with a quantitative trait, score statistics for either genotype or diplotype data have been proposed and discussed elsewhere (Schaid et al. 2002; Chapman et al. 2003; Clayton et al. 2004). For large samples, score statistics have a chi-square distribution. Alternatively, one could use an F-test to compare the regression of the trait on the haplotype design matrix with the null model that has only an intercept (Zaykin et al. 2002). The numerator degrees of freedom of the F-test is the same as the degrees of freedom for the chi-square score statistic, and for large sample sizes the null distributions of the score statistic and F-test converge to the same chi-square distribution. However, as we show by simulations, the score-statistic can be conservative for small sample sizes. Furthermore, the F-test may give greater power than the score statistic when the effects of haplotypes are large. Although we shall base our sample size and power derivations on the F-test, it is easiest to first consider derivations for the score statistic, and then make minor changes to give results for the F-test. This approach highlights some of the statistical deficiencies of the score statistics, which is important because of the attention they have received in the literature.

Genotype Data

To illustrate our derivations, we first assume that we observe genotypes (i.e., haplotypes are directly observed). The score statistic to test the null hypothesis of no association, Ho1=⋯=βK−1= 0, is T=UV−1U, where

image
image

and where inline image is the sample variance of Y. If other covariates are to be adjusted for, then in place of inline image we would use the mean square error of the regression of the trait on only the other covariates. The statistic T has an asymptotic chi-square distribution with (K−1) degrees of freedom. Under an alternative hypothesis, T has a non-central chi-square distribution with a non-centrality parameter that can be computed by replacing U and V by their expected values in the expression for T. This non-centrality parameter can then be used to compute either sample size or power.

To determine the expected value of U, we use the basic regression equation, Yio+β′Xii, subtract μY (the population expected value of Y) from both sides of the equation, multiply both sides of the equation by X′, and then take the expectation,

image

Substituting βoY−β′E[X] into the above expression and rearranging leads to E[U]=Nβ′VX, where

image

In this expression for VX, G denotes a genotype (a pair of haplotypes) and P(G) its probability, and the summation is over all possible genotypes. The value of E[XG] is computed over this same distribution. Assuming Hardy-Weinberg (HW) proportions for pairs of haplotypes, P(G=hi, hj) =[2 −I(i=j)]pipj, where I() has values of 1 or 0 according to whether its argument is true or false, and pi is the population frequency of the ith haplotype. The expected value of V is simply E[V]=Nσ2YVX. Substituting E[U] and E[V] into the expression for T results in the non-centrality parameter for the score statistic,

image

Diplotype Data

When marker data are measured without directly observing the underlying pairs of haplotypes, haplotypes are ambiguous when a subject has more than one heterozygous marker locus. The score statistic for diplotype data is similar to that for genotype data, but with Xi replaced by its expectation given the observed diplotype data. Using Di to denote a diplotype (i.e., the un-phased marker data), this conditional expectation is

image

where the summation is over only those genotypes that are consistent with diplotype Di, and P(GDi) is the posterior probability of genotype G, given Di,

image

These posterior probabilities are calculated under the null hypothesis. In practice, this is accomplished by using the expectation-maximization (EM) algorithm to estimate the haplotype frequencies (Excoffier & Slatkin, 1995), assuming Hardy-Weinberg proportions for pairs of haplotypes, and using the posterior probabilities that contribute to the EM algorithm. Because of the ambiguity of haplotypes, the X*i scores tend to be shrunk toward the overall mean score, inline image. This implies that diplotype data will have less power than genotype data. This approach for incomplete data is well founded (Dempster et al. 1977), and is the basis for the development of score statistics for ambiguous haplotypes and general traits (Schaid et al. 2002; Chapman et al. 2003; Clayton et al. 2004), as well as a simplified expectation substitution regression method for incomplete haplotype data (Zaykin et al. 2002).

Parallel to the non-centrality parameter for genotype data, the non-centrality parameter for diplotype data can be shown to be

image

where now we use

image

Here, the expectation is with respect to the distribution of all diplotype configurations. The probability of diplotype D is the sum of the probabilities of the genotypes that are consistent with inline image. Note that E[X*] in VX* is equivalent to E[X] in VX.

Relationship of Non-Centrality Parameters: Score statistic and F-Test

One of the main distinctions between the score statistic and the F-test is that the score statistic performs computations under the null hypothesis, and so uses the sample variance of Y, whereas the F-test performs the regression of Y on the haplotype design matrix, and uses the variance of the residuals. A second distinction is that the F-test makes use of the degrees of freedom used to estimate the residual variance, which is important for small sample sizes. To see the impact of the residual variance on the non-centrality parameter, we first illustrate how the score statistic non-centrality parameter can be expressed in terms of the usual multiple correlation coefficient, R2, for the regression of the trait on the haplotype scores.

The multiple correlation coefficient portrays the percentage of the total variance of Y that is explained by the haplotypes. As we shall show, R2=β′Vβ/σ2Y, where V is either VX for genotype data, or VX* for diplotype data. To see why this is so, note that the total variance of Y can be partitioned as V(Y) =σ2Y=E[V(YX)]+V(E[YX]). The multiple correlation coefficient is R2=V(E[YX])/σ2Y. To determine V(E[YX]) for genotype data, we need to average the squared difference between the predictions of Y based on a model with haplotype effects versus those based on a model without haplotype effects,

image

After substituting βoY−β′E[XG] into the above equation and rearranging,

image

A similar proof follows for diplotype data. Hence, the non-centrality parameter for the score statistic can be expressed as NCP=NR2. This illustrates that the percentage of the variance of Y explained by the haplotypes depends on the magnitude of the haplotype effects, β, as well as the distribution of the genotypes when haplotypes are directly observed, or the distribution of the diplotypes when haplotypes are not directly observed. Small values of β with large probabilities [either P(G) or P(D)], or large values of β with small probabilities, can give the same value of R2, and hence the same power.

Note that the score statistic uses computations under the null hypothesis, which is why σ2Y is used in the denominator of the non-centrality parameters. In contrast, if we actually perform the regression of Y on X and use the least squares estimates of β to compute the variance of the residuals, denoted σ2ɛ, then we would substitute σ2ɛ for σ2Y. But, σ2ɛ=E[V(YX)]=σ2YV(E[YX]), so that σ2ɛ2Y−β′VXβ. Using this latter term in the denominator of the non-centrality parameter and simplifying results in

image

which is the non-centrality parameter for the F-test. This implies that the score statistic and F-test will have similar power when R2 is small, but that the F-test can have greater power as R2 increases.

Because of the conservativeness of the score statistic, and its potential for lesser power than the F-test, we propose use of the F-test non-centrality parameter for computing sample size and power. One could simply assume a given magnitude of R2 to compute the NCP, but whether it is feasible to achieve a given value of R2 will depend on the distribution of haplotypes and the magnitude of the corresponding regression coefficients. For this reason, we advocate using existing data to estimate the distribution of haplotypes, using this to compute VX, and then with specified β, compute R2=β′VXβ/σ2Y. Or, for a given value of R2, determine the necessary values of β. This would provide greater insight into whether the anticipated R2 can be achieved with realistic haplotypes effects represented by β.

Power for Haplotype Associations with Case-Control Data

Genotype Data

The above derivations for a quantitative trait are based a random sample of subjects. For retrospective case-control studies with diplotype data, the score statistics presented above remain valid, although the estimated regression coefficients for the haplotype X matrix can be biased, as discussed elsewhere (Epstein & Satten, 2003; Satten & Epstein, 2004; Schaid, 2004). For case-control data, Y has values of 1 for cases and 0 for controls. Let ncase denote the number of cases, ncontrol the number of controls, N=ncase+ncontrol the total sample size, and inline image (the sample fraction of cases). Then, for genotype data, the term U for the score statistic becomes

image

The last expression for U emphasises that the score statistic is proportional to the difference in the mean score vectors between cases and controls. To evaluate the expectation of U under a given alternative hypothesis, we evaluate the expected values of X conditional on case and control status. This follows from the retrospective study design, in contrast to the random sample probabilities used for the derivations of the power for a quantitative trait. The resulting expectation is

image

which emphasizes that the expected score depends on both the scoring of genotypes and the difference in probabilities of genotypes for cases versus controls. To simplify later expressions, we let inline image, so that E[U]=Nθ (1 −θ)δ.For case-control data, an estimate of the variance matrix of U is

image

To evaluate the expected value of V under an alternative hypothesis, first consider the expected value of inline image. It is a weighted average of expectations, weighted by case and control sample fractions,

image

where P(Gcase) is used to evaluate E[Xcase] and P(Gcontrol) is used to evaluate E[Xcontrol]. Similarly, E[V] is a weighted average,

image

where

image
image

We shall use VX={θVX,case+ (1 −θ)VX,control}, so that E[V]=Nθ (1 −θ)VX. Substituting E[U] for U and E[V] for V in the expression T=UV−1U, the non-centrality parameter for genotype data can be expressed as

image

Besides the effects of δ and VX on the non-centrality parameter, it should be clear that for a fixed sample size the non-centrality parameter is maximum when there is an equal number of cases and controls (θ= 1/2).

Logistic Model for Distribution of Haplotype Pairs for Cases and Controls

At this point it is worthwhile considering how P(Gcase) and P(Gcontrol) can be modelled for power computations. We assume that the influence of a pair of haplotypes on disease status is determined by the logistic regression model,

image

Here, the coding of XG is at the discretion of the investigator. The coding could allow for either dominant, recessive, or additive effects of haplotypes, or the coding could be more general to model the effects of pairs of haplotypes. The corresponding vector β represents the log-odds-ratios. The conditional probability P(Gcase) can be expressed as

image(1)

where π is the population disease prevalence, P(G) is based on HW proportions, and the intercept βo is chosen to solve the equation

image

In a similar fashion, the conditional probability P(Gcontrol) can be expressed as

image(2)

It is important to recognize that the analysis model that we propose, scoring additive effects of haplotypes, does not need to match the assumed underlying effects of pairs of haplotypes. By developing this general approach, one can consider the potential loss in power by analyzing data by an additive model when the true underlying model is different.

Diplotype Data

When analyzing diplotype data, we cannot directly score the underlying haplotypes. Instead, the analysis requires us to consider all possible genotypes consistent with the observed diplotypes, score the possible genotypes, and then average these scores using the posterior probabilities as weights to compute X*=E[XD]. The posterior probabilities are computed by the EM algorithm by pooling cases and controls (i.e., assuming the null hypothesis to be true). These X* scores for diplotype data are used in place of X scores for genotype data in the U vector,

image

and in the variance matrix,

image

Although this is the approach we used for quantitative traits, there is a difference in how the posterior probabilities are computed, in order to compute the expected X* scores for a specified alternative hypothesis. For quantitative traits we computed the posterior probabilities under the null hypothesis, since we assumed ascertainment does not depend on the quantitative trait, and the associations of haplotypes with quantitative traits are not likely to have a large impact on the distribution of genotypes. In contrast, for case-control data large odd ratios can create large differences in the genotype probabilities between cases and controls, and so it is necessary to consider the alternative hypothesis when computing the posterior probabilities. To do so, our power calculations parallel actual analyses by use of exemplary data. That is, we first assume a logistic model for the influence of genotypes on disease status. The probabilities of the genotypes are based on case/control status, P(Gcase) and P(Gcontrol). We then enumerate all possible genotypes, and collapse these into distinguishable diplotype categories. The probabilities of the diplotypes are also based on case/control status, inline image and inline image. These probabilities are then used to determine the posterior probabilities in the pool of all cases and controls as follows.

With actual data, one would pool marker data for cases and controls in order to run the EM algorithm to estimate haplotype frequencies and posterior probabilities, assuming the null hypothesis is true. We parallel this by using the enumerated diplotype data as exemplary data, with different weights for cases and controls, and running the usual EM algorithm on this weighted exemplary data. This general approach for power calculations has been discussed elsewhere (Self & Mauritsen, 1988; Longmate, 2001). If M denotes the set of all possible diplotype configurations, then the cases and controls have the same exemplary data M, but the weight for diplotype D for a case is θP(Dcase), whereas that for a control is (1 −θ)P(Dcontrol), and the sum of the weights over all diplotypes for both cases and controls is unity. Running the EM algorithm on this weighted exemplary data provides a mechanism to compute the haplotype frequencies in the pooled data along with posterior probabilities, P(GD), but under the alternative hypothesis. These posterior probabilities are then used to compute the expected X* scores.

The expected value of U* is E[U*]=Nθ (1 −θ){E[X*case]−E[X*cont]}. Because the X*i scores are shrunk toward the overall mean, δ*={E[X*case]−E[X*cont]} tends to have smaller values than when genotype data is used, implying a reduction in power. The expected scores for cases and controls are computed according to

image
image

It can be shown that inline image, where

image
image
image
image

Using the above expressions for E[U*] and E[V*], the non-centrality parameter for diplotype data can be expressed as

image

Relative Efficiency of Genotype versus Diplotype Data

We can use the non-centrality parameters to evaluate the relative efficiency of using genotype versus diplotype data by taking their ratio,

image

This ratio gives the reduction in sample size when genotype data are used compared to diplotype data. For example, when RE= 0.5 the sample size required for a specified power would be half as large for genotype data compared to diplotype data.

Power When Using Haplotype Tag SNPs

To reduce genotyping efforts, it is becoming common practice to choose a subset of informative single-nucleotide polymorphisms (SNPs), called tag-SNPs, instead of all available SNPs, along a haplotype region of interest. Although a large number of papers have been published on this topic, some common strategies are: 1) to choose haplotype-tag SNPs by choosing SNPs that provide accurate predictions of the underlying common haplotypes (Stram et al. 2003b); 2) to choose genotype-tag SNPs which ignore haplotype phase, but rather provide accurate predictions of other SNP genotypes within a block (Chapman et al. 2003); and 3) to choose LD-tag SNPs which do not require pre-determined blocks, but rather SNPs with high pair-wise LD are grouped into sets and only one representative SNP per set is chosen (Carlson et al. 2004). By measuring only tag-SNPs, haplotypes involving all the SNPs would be collapsed into fewer haplotype categories, which can reduce haplotype information. Whether this leads to an increase or decrease in power depends on the strength of LD between the tag-SNPs and the causative variant, and whether the non-tagged SNPs provide additional information on the association of the haplotype with the trait, beyond that provided by the tag-SNPs. For example, if any of the tag-SNPs are causative variants, then reducing haplotypes to those with only tag-SNPs would increase power. On the other hand, if the tag-SNPs are in weak LD with the causative variant, yet other non-tagged SNPs are in high LD with the causative variant, then reducing to tag-SNPs can decrease power. The ambiguity caused by reducing full haplotypes to tag-SNP haplotypes is analogous to that caused by using diplotype data instead of genotype data, so our derivations of the non-centrality parameter for diplotype data can be slightly generalized to allow for evaluating the power when using tag-SNPs. We illustrate this for a quantitative trait and known phase of tag-SNPs; calculations for unknown phase of the tag-SNPs and for case-control data easily follow. Let G denote a genotype when all SNPs are measured, and let Gt denote a genotype for when only the tag-SNPs are measured. Similar to diplotype data, we use the expected score for the complete genotype data, conditional on the observed tagged data Gt, to compute the score statistic. This conditional expectation is

image

where

image

The expected variance matrix is simply

image

A major distinction between scoring the complete genotype data versus the subset of tagged genotype data is that the degrees of freedom for the tagged data is very likely to be less than that for the complete genotype data, due to the choice of tag-SNPs. If the complete genotype data has H distinguishable haplotypes, and the tagged genotype data has Ht < H distinguishable haplotypes, then the dimension of Vmath image is (H− 1), yet its rank is (Ht− 1). Because Vmath image is not of full rank, we use a generalized inverse in the calculation of the non-centrality parameter,

image

A key point is that we use the probability of the complete genotype data, conditional on the observed tag-SNP genotypes, to compute expectations. If the phase of the tag-SNPs is unknown, we would use the probability of the complete genotype data, conditional on the diplotype data composed only of the tag-SNPs.

We can use the non-centrality parameters and their corresponding degrees of freedom to determine the difference in sample size or power between the use of all SNPs versus the use of only a subset of tag-SNPs. Because degrees of freedom can differ, we cannot simply measure the relative efficiency by the ratio of non-centrality parameters. Rather, if we specify the Type-I error rate, desired power, the frequency and effects of haplotypes (β) based on all SNPs, and the subset of tag-SNPs, we can compute the corresponding non-centrality parameters and the required sample sizes, and then determine relative efficiency by the ratio of estimated sample sizes.

Simulation Methods

Simulations for Type-I Error

Zaykin et al. (2002) proposed regressing a quantitative trait on the conditional haplotype scores, X*i, and through extensive simulations showed that the resulting F-test maintains the nominal Type-I error rate. Their simulations varied the sample size from 25 to 500, along with various departures from Hardy-Weinberg equilibrium for the pairs of haplotypes. We have performed similar simulations to compare the Type-I error rate of the score statistic with that of the F-test for quantitative traits, and to evaluate the Type-I error rate of the score statistic for case-control data. For these simulations we focused on diplotype data and assumed that the markers were in linkage equilibrium, which is the worst case situation because the haplotypes have the greatest ambiguity (Fallin & Schork, 2000; Stram et al. 2003a), and hence the asymptotic properties of the test statistics are expected to be most adversely affected (Zaykin et al. 2002). Five diallelic marker loci were simulated, with the minor allele frequency for each locus set at 0.33 (the median frequency from a prior genome scan using the Affymetrix 10K SNP array; Schaid et al. 2004). We allowed for the marker genotypes to depart from HW proportions by setting the HW disequilibrium parameter to 75% of its most extreme value (either positive for an excess of homozygotes, or negative for an excess of heterozygotes). That is, genotypes at each marker locus were randomly generated from the genotype proportions PAA=p2+d, PAB= 2p(1 −p) − 2d, and PBB= (1 −p)2+d, where p is the minor allele frequency, and d is the HW disequilibrium parameter, which is bounded according to max[−p2, − (1 −p)2]≤dp(1 −p) (Weir, 1996). A quantitative trait was simulated from a standard normal distribution, and for a case-control design the fraction of cases was set at θ= 0.5.

Simulations for Power

As our theoretical derivations illustrate, power depends on the distribution of haplotypes. To evaluate the adequacy of our power calculations, we simulated five-locus haplotypes, each locus having two alleles, from three different distributions, ranging from strong pair wise LD to no LD among the alleles from the loci. The first two distributions were from real data (Xu et al. 2001) for which the haplotype phase was known: Hap-1 is from the NAT2 locus on chromosome 8, covering 0.5 kb, and Hap-2 is from the X chromosome, covering 140 kb. The X chromosome haplotypes were used because phase was known, yet we used them as if they were autosomes in order to create pairs of haplotypes per subject. The third distribution, Hap-3, was such that all haplotypes are equally frequent, and hence alleles from all loci were in linkage equilibrium with each other. The pair-wise values of D′∣ had a range of 0.93–1.0 for Hap-1, and a range of 0.10–0.71 for Hap-2. The 32 haplotypes and their frequencies for each of these three distributions are given in Table 1.

Table 1.  Haplotypes and their frequencies for three different haplotype distributions
Haplotype
number
Hypothetical
“risk haplotype”
LocusHaplotype frequency
12345Hap-1Hap-2Hap-3
  1. Hap-1: NAT2 locus on chromosome 8, covering 0.5 kb.

  2. Hap-2: Chromosome X, covering 140 kb.

  3. Hap-3: Equal haplotype frequencies, linkage equilibrium.

  4. NA: haplotype does not exist (i.e., frequency = 0.0).

1*111110.2350.00600.0312
2*111120.0060.03920.0312
3*11121NA0.01910.0312
4*11122NANA0.0312
5 11211NA0.12370.0312
6 11212NA0.01310.0312
7 11221NA0.04530.0312
8 11222NA0.01310.0312
9 12111NA0.08450.0312
10 121120.0120.11770.0312
11 12121NA0.03220.0312
12 12122NANA0.0312
13 122110.0310.16300.0312
14 122120.4440.17000.0312
15 12221NA0.09760.0312
16 12222NA0.01310.0312
17*211110.0250.00600.0312
18*21112NANA0.0312
19*211210.247NA0.0312
20*21122NANA0.0312
21 21211NA0.00600.0312
22 21212NANA0.0312
23 21221NANA0.0312
24 21222NANA0.0312
25 22111NA0.00600.0312
26 22112NA0.01910.0312
27 22121NA0.00600.0312
28 22122NANA0.0312
29 22211NA0.00600.0312
30 22212NA0.01310.0312
31 22221NANA0.0312
32 22222NANA0.0312

For simulated power for a quantitative trait, two haplotypes were independently sampled for each subject from one of the distributions. The overall mean of the simulated trait was zero, with unit variance. All haplotypes that have alleles 1–1 at the second and third loci were considered “risk haplotypes” (see Table 1), increasing the mean of the trait by an amount β. The magnitude of β was set to achieve a specified model R2, and so the value of β depended on the distribution of the haplotypes. The value of R2 ranged from 0.01 to 0.25, and for each R2 value the sample size required to achieve power of either 80% or 90%, with a nominal Type-I error rate of 5%, was computed according to our derived asymptotic formulae. For each of these scenarios the empiric power was computed by simulations using 1,000 replicates. This allowed us to evaluate the adequacy of our formulae for regions of high power, yet over a range of effect sizes, a range of sample sizes, and different haplotype distributions.

Our approach for simulated power for a case-control study paralleled that for a quantitative trait, with two haplotypes independently sampled for each subject from one of the distributions, and haplotypes with alleles 1–1 at the second and third loci considered “risk haplotypes,” increasing the log-odds of disease by an amount β. The odds-ratio per high-risk haplotype was set to either 1.25, 1.50, 2.0, or 2.5. The prevalence of the disease in the general population was set to 0.10, which determines the intercept parameter. For each scenario we calculated the sample size required to achieve an asymptotic power of either 80% or 90%, with a nominal Type-I error rate of 5%, and an equal number of cases and controls, and calculated the empiric power by simulations. For these simulations the prevalence was combined with the distribution of the haplotypes and the value of β to determine the distribution of haplotypes, conditional on disease status (see expressions 1 and 2). We then sampled from these conditional distributions to generate genotypes for cases and controls, which were then analyzed by the score statistics.

Simulation Results

Quantitative Trait F-test: Type-I Error

The results in Table 2 show that the score statistic can be conservative for small sample sizes (N < 500), particularly for a nominal Type-I error rate of 0.01; for N= 500 and a nominal Type-I error rate of 0.05, the score statistic behaved as expected. In contrast, the F-test maintained the appropriate Type-I error rate for almost all situations of Hardy Weinberg equilibrium and disequilibrium, in agreement with findings by Zaykin et al. (2002).

Table 2.  Empirical Type-I error rates for score statistic and F-test when there is Hardy-Weinberg equilibrium (HWD = 0), excess heterozygotes (HWD =−0.75), and excess homozygotes (HWD =+0.75)
HWDNNominal Type-I error-rate(1)
0.010.05
ScoreF-testScoreF-test
  1. 1) Bold values fall outside 95% confidence intervals for nominal Type-I error rates:

  2.  Type-I error rate of 0.01: 0.004–0.016.

  3.  Type-I error rate of 0.05: 0.037–0.064.

0250.00.0090.00.052
500.00.0050.0080.037
1000.0020.0100.0260.049
5000.0140.0170.0560.060
−0.75250.00.0070.00.048
500.00.0090.0150.047
1000.0020.0060.0290.044
5000.0080.0090.0560.062
+0.75250.00.0100.00.049
500.00.0110.0090.052
1000.0020.0100.0210.046
5000.0010.0030.0480.056

Quantitative Trait F-Test: Power

Theoretical and simulated power for a quantitative trait are presented in Table 3. We determined the sample sizes needed for both genotype data (Ngenotype) and diplotype data (Ndiplotype) to achieve theoretical power of either 80% or 90%, for when the R2 for genotype data was either 0.01, 0.05, 0.10, or 0.25. Table 3 also illustrates the diminished value of R2 for diplotype data due to unknown phase. Sample sizes for the simulations were based on the diplotype data (Ndiplotype), so that the power for the diplotype data (last column of Table 3) should be compared to the theoretical values of 80% or 90%. The simulated power for the genotype data simply presents the gain in power if haplotype phase were known. From the last column in Table 3 it can be seen that the simulated power is generally quite close to the theoretical power. Although the simulated power was sometimes lower than predicted, particularly for smaller sample sizes and theoretical power of 90% (see bolded values in Table 3), the underestimate of power was less than 5%, and most often less than 3%.

Table 3.  Theoretical and simulated power for quantitative trait analysis using F-tests. Simulations were based on the sample size required to achieve the desired power when using diplotype data (Ndiplotype)
Haplotype
distribution
Theoretical valuesSimulated power(1)
R2genotypeR2diplotypeβgenotypePowerNgenotypeNdiplotypeGenotypesDiplotypes
  1. 1) Bold values fall outside 95% confidence intervals for theoretical power:

  2. Theoretical power of 0.80: 0.775 – 0.825.

  3. Theoretical power of 0.90: 0.881 – 0.919.

10.010.010.1410.8135513550.8130.813
0.9173117310.9120.908
0.050.050.3160.82652650.7990.792
0.93373370.8950.887
0.100.100.4470.81291290.8000.791
0.91631630.8720.871
0.250.250.7070.848480.8150.810
0.959590.9050.902
20.010.0070.2760.8209128260.9190.799
0.9260335190.9740.895
0.050.0370.6180.84145610.9280.785
0.95126960.9780.895
0.100.0740.8740.82052780.9210.773
0.92513430.9710.875
0.250.1851.3820.8801090.9290.756
0.9951320.9750.852
30.010.0080.1630.8248529820.8910.791
0.9306836830.9580.888
0.050.0420.3650.84955950.8940.779
0.96077300.9660.909
0.100.0830.5160.82472970.8890.777
0.93003610.9430.870
0.250.2080.8160.8991190.8840.793
0.91171410.9630.876

Case-Control Score Statistic: Type-I Error

The results in Table 4 show that the score statistic can be quite conservative for small sample sizes, and that it is only for large sample sizes (e.g., N= 500) that the empirical Type-I error rate approached the nominal level. This pattern did not seem to be affected by the presence or absence of HW disequilibrium. This suggests that, in practice, it may be worthwhile performing permutations to compute p-values when the sample size is not very large.

Table 4.  Empirical Type-I error rates of score statistic for diplotypes with case-control data
HWDNNominal Type-I error-rate(1)
0.010.05
  1. 1) Bold values fall outside 95% confidence intervals for nominal Type-I error rates:

  2.  Type-I error rate of 0.01: 0.004 – 0.016.

  3.  Type-I error rate of 0.05: 0.037 – 0.064.

0250.0010.020
5000.013
1000.0030.014
5000.0030.034
−0.75250.0130.048
500.0020.009
1000.0030.021
5000.0040.032
+0.752500.002
5000.001
1000.0010.011
5000.0090.038

Case-Control Score Statistic: Power

Theoretical and simulated power for case-control data are presented in Table 5. We determined the sample sizes needed for both genotype data (Ngenotype) and diplotype data (Ndiplotype) to achieve theoretical power of either 80% or 90%, for when the OR for the high-risk haplotypes, using genotype data, was either 1.25, 1.5, 2.0, or 2.5. Sample sizes for the simulations were based on the diplotype data (Ndiplotype), so that the power for the diplotype data (last column of Table 5) should be compared to the theoretical values of 80% or 90%. The simulated power for the genotype data illustrates the gain in power if haplotype phase were known. From the last column in Table 5 it can be seen that the simulated power is generally quite close to the theoretical power. Although the sample sizes were generally large, it is somewhat surprising that the simulated power is quite close to the theoretical predictions for the smaller sample sizes (e.g., less than 500), because simulations under the null hypothesis suggest that the score test tends to be conservative for small sample sizes. These encouraging results suggest that our theoretical power predictions should be accurate and useful for planning studies.

Table 5.  Theoretical and simulated power for case-control data using score statistics. Simulations were based on the sample size required to achieve the desired power when using diplotype data (Ndiplotype)
Haplotype
distribution
Theoretical valuesSimulated power(1)
ORPowerNgenotypeNdiplotypeGenotypesDiplotypes
  1. 1) Bold values fall outside 95% confidence intervals for theoretical power:

  2.  Theoretical power of 0.80: 0.775–0.825.

  3.  Theoretical power of 0.90: 0.881–0.919.

11.250.8221422140.8150.815
0.9283028300.8990.898
1.50.86846840.7980.792
0.98768760.9040.903
2.00.82482480.8090.805
0.93163160.9120.909
2.50.81501500.8290.829
0.91921920.9080.903
21.250.811978159820.9360.798
0.914934199260.9780.891
1.50.8345445660.9180.793
0.9430656920.9750.905
2.00.8111214520.9160.799
0.9138618100.9710.909
2.50.86167980.9230.765
0.97689960.9760.901
31.250.8513461100.8860.814
0.9635075580.960.895
1.50.8152618060.8910.81
0.9188822340.9520.884
2.00.85186080.8820.792
0.96427540.9540.917
2.50.83003520.8960.786
0.93724340.9410.891

Discussion

We derived the non-centrality parameters for score statistics and F-tests to test the association of a quantitative trait with haplotypes, when the haplotype data is either directly or indirectly observed (i.e., genotypes versus diplotypes), and analogous non-centrality parameters for score statistics when using case-control data. Our simulations illustrate that the theoretical power predictions are quite accurate, and hence should provide guidance when designing haplotype association studies. A significant advantage of our approach is that it is easy to determine sample size and power for planning studies when the distribution of haplotypes is available from prior studies. A second advantage is that it is straightforward to evaluate the sizes of the effects of specified haplotypes that are needed to achieve a specified value of R2, the percentage of the variance of a quantitative trait explained by the haplotypes. A third advantage is that we provide a means to evaluate the relative efficiency of using phase-known versus phase-unknown haplotype data. Finally, we extended our calculations to evaluate the impact of choosing tag-SNPs on power.

Our statistical approach for diplotype data is similar to that presented by Chapman et al. (2003), but in a different context. Chapman et al. assume that haplotype phase is known, and that the observed markers on haplotypes are used to predict the unobserved genotype at a causal locus. In that context, they showed that the non-centrality parameter is η= (N− 1)R2CR2Y, where R2C is the multiple correlation coefficient for the prediction of the causal genotype from the phased marker haplotype data, and R2Y is the multiple correlation coefficient for the prediction of the trait from the causal genotype. Their goal was to choose haplotype-tag SNPs (ht-SNPs) that maximize R2C. Because they assume the casual genotype has only two alleles, and that the allelic effects are additive, there is only one regression coefficient. In contrast, we consider K haplotypes, requiring (K−1) regression coefficients, making the non-centrality parameter a quadratic equation.

An advantage of our approach is that it allows one to evaluate the impact of the choice of tag-SNPs on the power of haplotype association studies when diplotype data is observed. For example, one can assign to haplotypes their frequencies and effect sizes, and then computer power for diplotype data assuming that all markers are measured. Then, a subset of tag-SNPs could be chosen, and power recomputed with this subset. The diplotype categories for the complete set of markers will be collapsed to a reduced set of new diplotype categories for the subset of tag-SNPs, resulting in loss of power, which can be evaluated by our general approach.

Somewhat similar to our approach, Stram et al. (2003b) provide a method to choose haplotype-tag SNPs, such that the tag SNPs provide accurate predictions of the underlying haplotypes. Their method is based on an R2haplotype for predicting a particular haplotype from tag-SNPS. An advantage of their approach is that if a specific sample size N is needed to achieve adequate power when all markers are measured, then the increase in sample size needed to allow for increased haplotype ambiguity due to selecting only a subset of haplotype tag-SNPs can be readily computed as N/R2haplotype. This approach, however, does not consider a global test statistic to evaluate all haplotypes simultaneously, nor does it provide the initial sample size (N) needed for when all markers are measured. In contrast, our methods consider all haplotypes simultaneously via the F-test for quantitative traits, or a global score statistic for case-control data, both for all markers and for a subset of tag-SNPs.

Software

Software for power calculations are written in the S programming language, called haplo.power, which runs in both S-PLUS and R computing environments, and will be available from our Web site (http://mayoresearch.mayo.edu/mayo/research/biostat/schaid.cfm).

Acknowledgements

This work was supported by the U.S. Public Health Service, National Institutes of Health, contract grant number GM65450.

Ancillary