## Introduction

Evaluating the association of haplotypes with disease or quantitative phenotypes can be a potent method to study the genetic basis of complex human traits. Although a wide variety of statistical methods have been developed to test associations with haplotypes, and to estimate parameters that describe haplotype associations (for a review, see Schaid, 2004), there has been little guidance on how to determine sample size and power for haplotype association studies. The purpose of this report is to illustrate how sample size and power can be computed for studies of the association of haplotypes with traits. Both directly measured haplotypes and haplotypes with unknown phase are considered, as well as subsets of markers used to tag haplotypes. Furthermore, computations are illustrated for both quantitative traits and retrospective case-control study designs.

When multiple marker loci are discussed in the literature, the terms used to describe phased versus un-phased marker haplotype data are often inconsistent. Hence, before we present the technical methods on how to compute power for haplotype association studies, it is worthwhile clarifying the terminology that we shall use, as defined elsewhere (Brumfield *et al.* 2000; Clayton *et al.* 2004). The term *haplotype* refers to the set of alleles at linked loci on a single chromosome, inherited from a subject's parent. A pair of autosomal haplotypes, one haplotype from each parent, is a *genotype*. When the haplotypes are directly measured, the combination of alleles on each haplotype is known, which is referred to as *phased* marker data. When considering alleles from multiple marker loci whose parental origins are unknown, the underlying pair of haplotypes for a subject is ambiguous when the subject has more than one heterozygous marker locus. That is, marker phase is unknown. The collection of single-locus marker genotypes whose phase is unknown is a *diplotype*. Hence, it is possible, and very likely, for a diplotype to have more than one underlying genotype (pair of haplotypes). As a cautionary note, the terms diplotype and genotype are sometimes used in the literature according to the ways we have defined them, and sometimes interchanged.

When measuring a single genetic marker, it is well known that the factors that influence power are the size of the effect of the causative genotype on the phenotype, the frequency of the causative allele, the amount of linkage disequilibrium (LD) between a causative allele and the measured genetic marker, and how close the allele frequencies match between a causative allele and the marker allele (Zondervan & Cardon, 2004). When measuring multiple marker loci, the strength of LD among the marker loci will additionally influence power. Although the benefit of haplotype analyses versus multi-marker tests for association that ignore phase have been widely discussed and debated in the literature (e.g., Chapman *et al.* 2003; Clayton *et al.* 2004), it appears that the greatest gain in power provided by haplotype analyses occurs when linkage disequilibria exist among the marker loci at orders higher than pair-wise LD (Nielsen *et al.* 2004).

O'Hely & Slatkin (2003) provided a quantitative evaluation of the increase in sample size needed to compensate for unknown haplotype phase when performing case-control studies. They assumed that the causative locus has two alleles with a multiplicative effect on the disease penetrance. To compare sample sizes between situations where phase is unknown versus where it is known, they derived a ratio *Q*, where the numerator of *Q* is the sample size needed for a specified power when phase is unknown, and the denominator is that for when phase is known. When *Q* > 1, there is a loss in efficiency due to unknown haplotype phase. They showed that the value of *Q* depends on the pattern of LD between the causative allele and the marker alleles, but *Q* is independent of the frequency of the causative allele. In general, *Q* is large when the amount of LD among the markers is weak, but *Q* approaches unity when the markers are in strong LD. These results indicate that the allele frequencies and strength of LD among the markers will have a large impact on the sample size of a study when haplotype phase is unknown. Although the results from O'Hely & Slatkin (O'Hely & Slatkin 2003) provide useful guidelines, they are restricted to case-control studies, with the assumption of multiplicative effects of the causative allele. Also, because their derivations are based on disequilibria parameters and formulae that depend on the number of marker loci, it is cumbersome to implement them when an investigator assumes a set of haplotypes and their frequencies, and corresponding odds-ratios for pairs of haplotypes.

In the section of Statistical Methods, we derive methods to compute sample size and power for the association of haplotypes with either quantitative traits or disease status in retrospective case-control studies. One of our main assumptions is that an investigator knows the distribution of haplotypes in the target population. This is a reasonable assumption, because it is common for investigators to genotype a sample of subjects in order to gain preliminary data on the distribution of markers in the genomic region of interest. Furthermore, data available in public databases, such as from the International HapMap project (http://www.hapmap.org/), provide this type of information.

For quantitative traits, power depends on the multiple correlation coefficient, *R*^{2}, which provides a simple measure of the percentage of the variance of the quantitative trait that is explained by the haplotypes. However, the value of *R*^{2} is not very informative, because *R*^{2} depends on both the sizes of the effects of the haplotypes on the trait and the distribution of the haplotypes in the population. For this reason, we specify the alternative hypothesis according to the shift in the mean of the trait caused by the haplotypes (i.e., the regression of the quantitative trait on a haplotype design matrix), and illustrate how the specified regression coefficients and haplotype frequencies can be translated into an *R*^{2} value. For case-control studies, we specify the alternative hypothesis according to the anticipated genotype odds ratios. We show how the relative efficiency of genotype versus diplotype data can be computed, in order to evaluate the increase in sample size due to unknown linkage phase. Finally, we consider the impact of choosing a subset of markers to tag haplotypes on power.

### Statistical Methods

The methods proposed to compute sample size and power are based on either the non-central chi-square distribution, or the non-central F-distribution. The general form for the power function is *Power*= 1 −*P*(*T* > *c*;*df*, *NCP*), where *P*() is either the non-central chi-square distribution or the non-central F-distribution, *T* is the test statistic, *c* is the critical value that corresponds to a specified Type-I error rate, *df* is the degrees of freedom (either a single *df* for the chi-square distribution, or both a numerator and denominator *df* for the F-distribution), and *NCP* is the non-centrality parameter. The *NCP* depends on the product of the sample size and a term that measures the strength of effect of the haplotypes on the trait. Hence, the main focus of this paper is calculation of the *NCP* for various scenarios.

### Power for Association of Haplotypes with Quantitative Traits

For a quantitative trait *Y*, we assume the regression equation *Y*=β_{o}+β′*X*+ɛ, where β_{o} is the intercept, the vector β represents the effects of the haplotypes on the average trait value, the vector *X* has quantitative codes for the pair of haplotypes a subject possesses, and ɛ is an error term with mean zero. A critical aspect of this regression equation is how a pair of haplotypes is coded in the vector *X*. For example, assuming that there are *K* unique haplotypes, one can evaluate the effect of each unique pair of haplotypes [*K*(*K*+1)/2 such pairs], although this approach is likely to have weak power because of the large number of degrees of freedom. Alternatively, one can consider *K* terms, treating haplotypes as having either dominant, recessive or additive effects. For our exposition we shall assume additive effects of haplotypes. This simplifies the presentation of the score statistics, and this approach will likely have sufficient power as long as the true effects are not recessive (Schaid, 2002). However, our methods are general, allowing investigators to choose the type of genetic effects that they wish to assess.

When considering the additive effects of haplotypes on a trait, we create scores for a subject's pair of haplotypes, where the scores represent haplotype dosages, and place these scores into a vector denoted *X*_{i} for the *i*^{th} subject. The *k*^{th} element of *X*_{i} is either 0, 1, or 2, according to the number of haplotypes of type *k*. If there are *K* unique haplotype categories, *X*_{i} would have length *K*. However, for identifiability, one haplotype category is ignored, treating it as a “baseline” for the construction of a design matrix, so the length of *X*_{i} is (*K*−1). For other types of scorings, such as dominant or recessive, see Schaid (1996).

To test the association of haplotypes with a quantitative trait, score statistics for either genotype or diplotype data have been proposed and discussed elsewhere (Schaid *et al.* 2002; Chapman *et al.* 2003; Clayton *et al.* 2004). For large samples, score statistics have a chi-square distribution. Alternatively, one could use an F-test to compare the regression of the trait on the haplotype design matrix with the null model that has only an intercept (Zaykin *et al.* 2002). The numerator degrees of freedom of the F-test is the same as the degrees of freedom for the chi-square score statistic, and for large sample sizes the null distributions of the score statistic and F-test converge to the same chi-square distribution. However, as we show by simulations, the score-statistic can be conservative for small sample sizes. Furthermore, the F-test may give greater power than the score statistic when the effects of haplotypes are large. Although we shall base our sample size and power derivations on the F-test, it is easiest to first consider derivations for the score statistic, and then make minor changes to give results for the F-test. This approach highlights some of the statistical deficiencies of the score statistics, which is important because of the attention they have received in the literature.

### Genotype Data

To illustrate our derivations, we first assume that we observe genotypes (i.e., haplotypes are directly observed). The score statistic to test the null hypothesis of no association, *H*_{o}:β_{1}=⋯=β_{K−1}= 0, is *T*=*U*′*V*^{−1}*U*, where

and where is the sample variance of *Y*. If other covariates are to be adjusted for, then in place of we would use the mean square error of the regression of the trait on only the other covariates. The statistic *T* has an asymptotic chi-square distribution with (*K*−1) degrees of freedom. Under an alternative hypothesis, *T* has a non-central chi-square distribution with a non-centrality parameter that can be computed by replacing *U* and *V* by their expected values in the expression for *T*. This non-centrality parameter can then be used to compute either sample size or power.

To determine the expected value of *U*, we use the basic regression equation, *Y*_{i}=β_{o}+β′*X*_{i}+ɛ_{i}, subtract μ_{Y} (the population expected value of *Y*) from both sides of the equation, multiply both sides of the equation by *X*′, and then take the expectation,

Substituting β_{o}=μ_{Y}−β′*E*[*X*] into the above expression and rearranging leads to *E*[*U*]=*N*β′*V*_{X}, where

In this expression for *V*_{X}, *G* denotes a genotype (a pair of haplotypes) and *P*(*G*) its probability, and the summation is over all possible genotypes. The value of *E*[*X*_{G}] is computed over this same distribution. Assuming Hardy-Weinberg (HW) proportions for pairs of haplotypes, *P*(*G*=*h*_{i}, *h*_{j}) =[2 −*I*(*i*=*j*)]*p*_{i}*p*_{j}, where *I*() has values of 1 or 0 according to whether its argument is true or false, and *p*_{i} is the population frequency of the *i*^{th} haplotype. The expected value of *V* is simply *E*[*V*]=*N*σ^{2}_{Y}*V*_{X}. Substituting *E*[*U*] and *E*[*V*] into the expression for *T* results in the non-centrality parameter for the score statistic,

### Diplotype Data

When marker data are measured without directly observing the underlying pairs of haplotypes, haplotypes are ambiguous when a subject has more than one heterozygous marker locus. The score statistic for diplotype data is similar to that for genotype data, but with *X*_{i} replaced by its expectation given the observed diplotype data. Using *D*_{i} to denote a diplotype (i.e., the un-phased marker data), this conditional expectation is

where the summation is over only those genotypes that are consistent with diplotype *D*_{i}, and *P*(*G*∣*D*_{i}) is the posterior probability of genotype *G*, given *D*_{i},

These posterior probabilities are calculated under the null hypothesis. In practice, this is accomplished by using the expectation-maximization (EM) algorithm to estimate the haplotype frequencies (Excoffier & Slatkin, 1995), assuming Hardy-Weinberg proportions for pairs of haplotypes, and using the posterior probabilities that contribute to the EM algorithm. Because of the ambiguity of haplotypes, the *X**_{i} scores tend to be shrunk toward the overall mean score, . This implies that diplotype data will have less power than genotype data. This approach for incomplete data is well founded (Dempster *et al.* 1977), and is the basis for the development of score statistics for ambiguous haplotypes and general traits (Schaid *et al.* 2002; Chapman *et al.* 2003; Clayton *et al.* 2004), as well as a simplified expectation substitution regression method for incomplete haplotype data (Zaykin *et al.* 2002).

Parallel to the non-centrality parameter for genotype data, the non-centrality parameter for diplotype data can be shown to be

where now we use

Here, the expectation is with respect to the distribution of all diplotype configurations. The probability of diplotype *D* is the sum of the probabilities of the genotypes that are consistent with . Note that *E*[*X**] in *V*_{X*} is equivalent to *E*[*X*] in *V*_{X}.

### Relationship of Non-Centrality Parameters: Score statistic and F-Test

One of the main distinctions between the score statistic and the F-test is that the score statistic performs computations under the null hypothesis, and so uses the sample variance of *Y*, whereas the F-test performs the regression of *Y* on the haplotype design matrix, and uses the variance of the residuals. A second distinction is that the F-test makes use of the degrees of freedom used to estimate the residual variance, which is important for small sample sizes. To see the impact of the residual variance on the non-centrality parameter, we first illustrate how the score statistic non-centrality parameter can be expressed in terms of the usual multiple correlation coefficient, *R*^{2}, for the regression of the trait on the haplotype scores.

The multiple correlation coefficient portrays the percentage of the total variance of *Y* that is explained by the haplotypes. As we shall show, *R*^{2}=β′*V*β/σ^{2}_{Y}, where *V* is either *V*_{X} for genotype data, or *V*_{X*} for diplotype data. To see why this is so, note that the total variance of *Y* can be partitioned as *V*(*Y*) =σ^{2}_{Y}=*E*[*V*(*Y*∣*X*)]+*V*(*E*[*Y*∣*X*]). The multiple correlation coefficient is *R*^{2}=*V*(*E*[*Y*∣*X*])/σ^{2}_{Y}. To determine *V*(*E*[*Y*∣*X*]) for genotype data, we need to average the squared difference between the predictions of *Y* based on a model with haplotype effects versus those based on a model without haplotype effects,

After substituting β_{o}=μ_{Y}−β′*E*[*X*_{G}] into the above equation and rearranging,

A similar proof follows for diplotype data. Hence, the non-centrality parameter for the score statistic can be expressed as *NCP*=*NR*^{2}. This illustrates that the percentage of the variance of *Y* explained by the haplotypes depends on the magnitude of the haplotype effects, β, as well as the distribution of the genotypes when haplotypes are directly observed, or the distribution of the diplotypes when haplotypes are not directly observed. Small values of β with large probabilities [either *P*(*G*) or *P*(*D*)], or large values of β with small probabilities, can give the same value of *R*^{2}, and hence the same power.

Note that the score statistic uses computations under the null hypothesis, which is why σ^{2}_{Y} is used in the denominator of the non-centrality parameters. In contrast, if we actually perform the regression of *Y* on *X* and use the least squares estimates of β to compute the variance of the residuals, denoted σ^{2}_{ɛ}, then we would substitute σ^{2}_{ɛ} for σ^{2}_{Y}. But, σ^{2}_{ɛ}=*E*[*V*(*Y*∣*X*)]=σ^{2}_{Y}−*V*(*E*[*Y*∣*X*]), so that σ^{2}_{ɛ}=σ^{2}_{Y}−β′*V*_{X}β. Using this latter term in the denominator of the non-centrality parameter and simplifying results in

which is the non-centrality parameter for the F-test. This implies that the score statistic and F-test will have similar power when *R*^{2} is small, but that the F-test can have greater power as *R*^{2} increases.

Because of the conservativeness of the score statistic, and its potential for lesser power than the F-test, we propose use of the F-test non-centrality parameter for computing sample size and power. One could simply assume a given magnitude of *R*^{2} to compute the *NCP*, but whether it is feasible to achieve a given value of *R*^{2} will depend on the distribution of haplotypes and the magnitude of the corresponding regression coefficients. For this reason, we advocate using existing data to estimate the distribution of haplotypes, using this to compute *V*_{X}, and then with specified β, compute *R*^{2}=β′*V*_{X}β/σ^{2}_{Y}. Or, for a given value of *R*^{2}, determine the necessary values of β. This would provide greater insight into whether the anticipated *R*^{2} can be achieved with realistic haplotypes effects represented by β.