Using Tree-Based Recursive Partitioning Methods to Group Haplotypes for Increased Power in Association Studies

Authors


*Correspondence Author: Kai Yu, Ph.D., Washington University School of Medicine, Division of Biostatistics, 660 S. Euclid, Campus Box 8067, St. Louis, MO 63110, Phone: (314) 362-3765, Fax: (314) 362-2693. E-mail: kai@wubios.wustl.edu

Summary

Motivated by the increasing availability of high-density single nucleotide polymorphism (SNP) markers across the genome, various haplotype-based methods have been developed for candidate gene association studies, and even for genome-wide association studies. Although haplotype approaches dramatically reduce the multiple comparisons problem (as compared to single SNP analysis), even the number of existing haplotypes is relatively large, which increases the degrees of freedom and decreases the power for the corresponding test statistic. Grouping haplotypes is a way to reduce the degrees of freedom. We propose a procedure that uses a tree-based recursive partitioning algorithm to group haplotypes into a small number of clusters, and conducts the association test based on groups of haplotypes instead of individual haplotypes. The method can be used for both population-based and family-based association studies, with known or ambiguous phase information. Simulation studies suggest that the proposed method has the right type I error rate, and is more powerful than some existing haplotype-based tests.

Introduction

As the number of single nucleotide polymorphism (SNP) markers increases rapidly across the genome, candidate gene and genome-wide association studies with dense SNPs have become popular in the genetic dissection of complex diseases. Typically, a list of candidate genes or regions is selected according to prior knowledge, such as linkage evidence, and a set of high-density markers is identified and genotyped in each candidate gene/region. To assess associations between candidate genes and the trait of interest we can consider single marker analyses, or study haplotypes formed by several tightly linked SNP markers. Single marker analyses can be quite informative; however, the multiple comparisons problem becomes daunting. In general, haplotype-based methods tend to be more powerful in identifying genes predisposing to complex diseases (Akey et al. 2001; Morris & Kaplan, 2002; Zhao et al. 2003), although a recent study by Conti & Gauderman (2004) suggests a multi-marker genotype based model as a viable alternative to haplotype based approaches.

Various haplotype-based association tests have been developed for both population-based and family-based association analyses. In each situation a group of case haplotypes is compared with a group of control haplotypes in terms of haplotype, frequencies or similarities. In a population-based study, case haplotypes are obtained from affected individuals; controls are from unaffected individuals. In a family-based TDT type of analysis, case haplotypes can be parental haplotypes transmitted to affected offspring; control haplotypes are those non-transmitted haplotypes.

A drawback of even haplotype-based association tests is the relatively large number of observed (distinct) haplotypes, which therefore increases the degrees of freedom (df) for the chosen test statistic (Seltman et al. 2001; Thomas et al. 2003; Zhang et al. 2003; Tzeng 2004). For example, for m realized unique haplotypes, a goodness-of-fit test that compares haplotype frequencies in cases with those in controls asymptotically follows a chi-square distribution with m− 1 df under the null hypothesis of no associations. The same is true for the haplotype transmission/disequilibrium test for family-based studies (Clayton, 1999), as well as for the global haplotype score test developed for population-based case control studies (Schaid et al. 2002).

Large degrees of freedom reduce the power of haplotype analyses, thus also limiting the modelling capacity for incorporating other factors. One way to limit the degrees of freedom is to construct tests based on haplotype similarity comparisons (De Vries et al. 1996; Van der Meulen & Te Meerman 1997; Bourgain et al. 2000, 2001, 2002; Zhang et al. 2003; Tzeng et al. 2003; Qian 2004; Yu et al. 2004b; Yu et al. in press). In such an approach, haplotype similarity in cases is compared with that in controls, around each marker in the region. The final test statistic is based on the maximum of the contrast among all comparisons. As a result, the degrees of freedom of the test (in a broad sense) equals the number of markers considered in the analysis (Zhang et al. 2003), which in general is much lower than the number of haplotypes observed.

Another approach, which is the focus of this paper, is to group haplotypes into clusters and perform analyses based on clusters of haplotypes, instead of individual haplotypes (Templeton et al. 1987; Templeton, 1995; Li et al. 2001; Seltman et al. 2001, 2003; Thomas et al. 2003; Molitor et al. 2003b; Tzeng et al. 2003; Durant et al. 2004; Tzeng, 2004). To group haplotypes a distance metric is needed to quantify the closeness between any given pair of haplotypes. The expectation is that haplotypes grouped into the same cluster will induce similar risks. Most distance metrics in the literature fall into two classes. One class is based on the mutation model, which assumes that after disease susceptibility mutations are introduced, the haplotype diversity is mainly driven by marker mutations, and rarely by recombination over generations (Templeton et al. 1987; Templeton, 1995; Seltman et al. 2001, 2003; Durant et al. 2004; Tzeng, 2004). The other type of metric is based on the recombination model, which assumes that recombination is the major force causing the variation in present day haplotypes (De Vries et al. 1996; Van der Meulen & Te Meerman, 1997; Bourgain et al. 2000, 2001, 2002; Thomas et al. 2003; Yu et al. 2004a). Molitor et al. (2003a) suggest a more flexible metric that incorporates both recombination and mutation factors, although it requires an extra user-specified parameter.

In this paper we adopt the recombination model, and develop a haplotype grouping scheme for association studies of binary traits. Under the recombination model, carrier haplotypes that inherited the disease causing mutation from a common ancestral haplotype can be viewed as trimmed (by recombination over generations) versions of the original ancestral haplotype in the vicinity of the disease locus (MacLean et al. 2000). Built on this reasoning, we develop a clustering procedure based on a recursive partitioning algorithm. The method tries to group haplotypes into clusters, with each cluster consisting of structurally similar haplotypes. The procedure generates a list of candidate grouping schemes, based on which the chosen test statistic (we use the Pearson's chi-square statistic as an illustration) can be calculated. The final global test is based on the smallest empirical p-value estimated on those statistics. The significance level is evaluated through a permutation procedure by Ge et al. (2003). Under the setup of family based association studies, simulation results show the resulting test has the correct type I error rate in stratified populations, and is more powerful than the test based on the original (ungrouped) haplotype distribution, as well as the test based on the grouped haplotype distribution using the grouping scheme suggested by Tzeng et al. (2003).

Methods

We first describe the method for situations where haplotype phase information is known, and later we extend the discussion to data with ambiguous phase information. Suppose that in a study of a binary trait a candidate region is evaluated with K+ 1 tightly linked markers, and that a sample of case haplotypes and a sample of control haplotypes are collected. A haplotype can be represented as a vector H= (h1, …, hK+1), here hk∈{1, …, ak} labels the ak alleles at marker k, 1 ≤kK+ 1. Let the set of all C observed distinct haplotypes be denoted by H={Hi, i= 1, …, C}. Denote the corresponding haploytpe frequencies observed in the sample of N case haplotypes as p={pi, i= 1, …, C}, with inline image, where pi is the proportion of cases having the haplotype Hi. Similarly, let haplotype frequencies observed in a sample of M controls be q={qi, i= 1, …, C}, and let haplotype frequencies observed in the joint sample combining both cases and controls be r={ri, i= 1, …, C}. In candidate gene association studies, the null hypothesis to test is that there is no association between haplotypes in the considered gene/region and the disease. To test the null hypothesis, the following Pearson's chi-square statistic can be used

image(1)

In the following discussion, we call the chi-square statistic calculated in (1), which is based on the original ungrouped haplotype distribution, the “Original.” For a given haplotype grouping scheme, each haplotype cluster defines a new observation category. The grouped haplotype frequency in a category is the total sum of frequencies of haplotypes belonging to that category. Then the Pearson's chi-square statistic can be calculated based on those grouped frequencies. One type of grouping scheme (called “Group-DR”) is given by Tzeng et al. (2003). In this scheme, a rare haplotype is merged with a corresponding common haplotype that differs from it by a one-step mutation. According to Tzeng et al. (2003), rare haplotypes are those with frequencies less than 0.2rmax, where rmax is the largest haplotype frequency in the joint sample.

With reduced degrees of freedom, the test based on clusters of haplotypes tends to be more powerful than that based on individual haplotypes, if an appropriate grouping scheme is used (resulting in a much smaller number of clusters than the original number of haplotypes). Also an appropriate grouping scheme is potentially helpful for identifying common ancestral haplotypes.

Tree-based Recursive Partitioning Algorithm

We propose using a tree-based recursive partitioning algorithm to group haplotypes into clusters. Recursive partitioning is a method that builds a tree model in the presence of interactions using a given set of covariates, which are observed features on independent variables (such as marker allele types), to either predict the outcome or partition the dataset into relatively more homogeneous (by certain criteria) subsets. This is a very general and flexible model building technique. See Breiman et al. (1984) and Zhang & Singer (1999) for detailed descriptions of this methodology. Therneau & Atkinson (1997) developed a general software package implementing the algorithm. Some applications of recursive partitioning in linkage and association studies include works by Zhang & Bonney (2000), Shannon et al. (2001) and Province et al. (2001). In the recursive partitioning procedure, the full dataset (called the root node) is first split into two offspring nodes by a splitting rule. The splitting process can be repeated recursively on offspring nodes until terminated by certain stopping rules, such as the targeting node having too few subjects. The splitting rule is defined based on a covariate and its binary partition, that divides the corresponding covariate space into two non-overlapping sub-regions. The splitting rule is chosen to achieve the best performance based on a user-designed criterion. Usually the tree built by the recursive partitioning process is too large. To find the right size of tree from the overgrown tree, a cross validation procedure can be used (Breiman et al. 1984; Zhang & Singer, 1999), but this step is not relevant here since the tree we build is based on haplotypes from the pooled sample, ignoring their case/control statuses.

A Working Example

Before getting into a more detailed description, we use the following artificial example as an illustration for some terminologies used in the tree-based recursive partitioning method. Suppose we are studying a simple dominant disease, and have collected 160 case haplotypes from affected people and 160 control haplotypes from unaffected individuals. The candidate region consists of 4 SNP markers, with the disease susceptibility locus flanked by SNP markers 2 and 3. Assume that the disease is caused by a single mutation at the susceptibility locus introduced on the original ancestral haplotype (1, 1, 1, 1) generations ago, and that among 160 case haplotypes half of them contain the mutation. Due to recombination over generations, suppose for those 80 mutation bearing haplotypes the following four haplotypes (1, 1, 1, 1), (2, 1, 1, 2), (2, 1, 1, 1) and (1, 1, 1, 2) are evenly distributed. In 240 non-carrier haplotypes (160 from controls, 80 from cases), we further assume that all 16 possible haplotypes occur uniformly among them. Haplotype frequencies in cases and controls are summarized in Table 1. For every haplotype four covariates (x1, x2, x3, x4) are defined, with each corresponding to one marker. That is, for haplotype (1, 2, 1, 2), its covariate x1 is 1, x2 is 2, and so on.

Table 1.  Haplotypes and their frequencies for the working example
HaplotypeaFrequencies in controlsFrequencies in cases
  1. Note: aHighlighted haplotypes are found in carrier chromosomes.

1 1 1 110/16025/160
1 1 1 210/16025/160
1 1 2 110/1605/160
1 1 2 210/1605/160
1 2 1 110/1605/160
1 2 1 210/1605/160
1 2 2 110/1605/160
1 2 2 210/1605/160
2 1 1 110/16025/160
2 1 1 210/16025/160
2 1 2 110/1605/160
2 1 2 210/1605/160
2 2 1 110/1605/160
2 2 1 210/1605/160
2 2 2 110/1605/160
2 2 2 210/1605/160

Based on haplotype frequencies ri, 1 ≤i≤ 16, in the joint sample and their covariates, the recursive partitioning algorithm is used to group all 16 haplotypes into clusters. The resulting binary tree is given in Fig. 1. In this figure, the root node (node 1) contains all 16 haplotypes. It has a total weight of 1, and a score of 1.59. The weight of a node is calculated as the sum of frequencies of haplotypes belonging to this node. The score is the maximum of average haplotype similarities measured around all focal points (defined in (2) later).

Figure 1.

A binary tree partitioning for the working example.

Using a selection criterion (defined in (3) later), the algorithm first chooses covariate x2 to split the root node into node 2 and node 3. Based on this splitting rule, haplotypes having allele 2 at marker 2 are grouped into node 2; while haplotypes with allele 1 at marker 2 are put into node 3. In this particular example, due to the symmetric nature of the dataset, there actually exist multiple optimal splitting rules for the root node. Splitting rules based on x3 can achieve the same optimality level as the rule based on x2. In situations like this, the algorithm just randomly picks one. Thus, the resulting tree might not be unique.

Nodes 2 and 3 are further split. In Fig. 1 we show a tree having 4 splits and 5 terminal nodes (nodes without any offspring nodes). A sub-tree of an existing tree is a tree obtained by cutting off some lower branches from the existing tree. A more formal definition for sub-trees is given in page 64 of Breiman et al. (1984). For example, a tree consisting of nodes 1, 2 and 3 is a sub-tree of the original tree, so is the tree consisting of nodes 1, 2, 3, 6 and 7. For any sub-tree, its terminal nodes define clusters of haplotypes. For example, for the sub-tree including nodes 1, 2, 3, 6, and 7, its terminal nodes 2, 6 and 7 divide all 16 possible haplotypes into 3 disjointed clusters. In particular, node 6 defines a group that encompasses haplotypes with allele 1 on both markers 2 and 3. This group of haplotypes happens to include all carrier haplotypes. To compare haplotype frequencies between cases and controls based on haplotype clusters defined by nodes 2, 6 and 7, we obtain a Pearson's chi-square statistic of 45.7, with df of 2. This test clearly is more powerful than the comparison based on the original ungrouped haplotype distribution, which results into a Pearson's chi-square statistic of 45.7, but with a much larger df.

Splitting Rules for Grouping Haplotypes

We want to build a tree based on haplotype frequencies in the joint sample, ignoring the haplotype's case/control status. To grow the tree recursively we need to define how a splitting rule is chosen. The goal is to find a splitting rule that divides a node, into two more homogenous offspring nodes. That is, within the same offspring node, we expect haplotypes to be similar to each other. To quantify the haplotype similarity in a node, a pair-wise similarity metric λk (H i, Hj) is needed for any given pair of haplotypes from H. The metric measures two haplotypes' similarity around a focal point located between markers k and k+ 1, with 1 ≤kK. There are a number of ways of defining the similarity metric. We choose to use the “identity length measure” of Bourgain et al. (2000), that is, λk (H i, Hj) equals the number of consecutive markers around the focal point over which two haplotypes share the same allele. For node A, its score is calculated as the maximum of average haplotype similarities measured at all focal points, that is

image(2)

where inline image, is the weight for node A. For any split that generates two offspring nodes AL and AR, its performance is measured by

image(3)

To split a node, we screen all possible binary partitions of every covariate, and choose a splitting rule that gives the largest Δ defined by (3).

There are several options for defining covariates for a haplotype. One obvious choice is to define a covariate for every marker. For example, when considering haplotypes consisting of 4 SNPs, each haplotype then has 4 covariates, with each covariate having two possible observations, 1 or 2 (the allelic types). To search for more complicated split rules, a second option is to create a covariate for every two markers. Thus for haplotypes of 4 SNPs, the second option creates 6 covariates with each covariate having 4 possible observations which can be labelled as “a”, “b”, “c” and “d”, representing observations (1,1), (1,2), (2,1) and (2,2), on two considered markers, respectively. For a covariate with 4 possible observations there are maximum 7 possible binary partitions, which are: a|(b, c, d), b|(a, c, d), c|(a, b, d), (a, b)|(c, d), (a, c)|(b, d), and (a, d)|(b, c). For example, a|(b, c, d) means a partition that puts all haplotypes having “a” for the covariate into an offspring node, and groups the remainder into the other. According to our experiences, those two options generate comparable results in term of the final performance of the association test, with the latter option having a slight advantage over the first one. Hence we recommend using the second option to define covariates. More discussions on choices of defining covariates are given the Discussion section.

Association Tests Based on Grouped Haplotypes

Using the splitting rule defined above, a full tree model T can be generated recursively until each terminal node contains only one unique type of haplotype. The full tre T has the same number of terminal nodes as the number of distinct haplotypes observed in the joint sample. From this overgrown tree there are many sub-trees; each sub-tree defines a haplotype grouping scheme. We want to identify an optimal sub-tree based on which the association test provides the strongest evidence against the null hypothesis. An exhaustive search over the full tree is not computationally efficient and, more importantly, involves excessively large numbers of comparisons which would make the procedure less powerful. Here we use the idea described in Yu et al. (2004b). The procedure consists of three steps. First, a sequence of candidate sub-trees of T is identified through a step-forward search algorithm. Then for each sub-tree from the identified sequence, grouped haplotype frequencies are compared between cases and controls using the Pearson's chi-square statistic. The associated raw (unadjusted for multiple comparisons) p-value is estimated by a permutation procedure. Thirdly, the global test statistic for the association is based on the minimum raw p-value observed in the comparison sequence. To assess the significance level of this test, the permutation procedure developed by Ge et al. (2003) is used to avoid the more computationally intensive nested two-level permutation procedure. More details about the testing procedure are given in Appendix I.

Data with Ambiguous Phase Information

In previous discussions we assumed haplotypes are known for both case and control samples. For population-based case control studies, what we can observe are genotypes from a sample of unrelated individuals. While for family-based association studies, parental haplotypes and their transmission statuses may not be uniquely determined. Thus in both study designs, case and control haplotypes are not directly available. In order to apply the proposed method to data with ambiguous phase information, we need to use E-M algorithms to estimate case haplotype frequencies p={pi, i= 1, …, C}, control haplotype frequencies q={qi, i= 1, …, C} and pooled haplotype frequencies r={ri, i= 1, …, C}. More details are given in Appendix II.

Simulation Studies

Focussing on family-based association studies, we use simulations to evaluate the performance of the proposed tree-based haplotype grouping scheme when using the Pearson's chi-square statistic. All analyses are done with phase information unknown. We consider a candidate region consisting of 12 diallelic markers with 0.1 cM map density. The disease-susceptibility locus has two alleles, E and e.

Type I Errors

To verify that the proposed method has the correct type I error rates and is robust to population stratifications, the simulation design of Zhang et al. (2003) and Yu et al. (in press) is adopted. We consider stratified populations consisting of two subpopulations. The disease locus is 10 cM away from the first marker. We assume there is no association between the disease locus and the candidate region within each subpopulation. The disease allele frequency is 0.2 in the first subpopulation, and 0.3 in the second subpopulation. Penetrances γEE, γEe and γee in the first subpopulation are 0.3, 0.15, and 0.02 for genotypes with two, one and zero copies of disease allele E, respectively. For the second subpopulation, the corresponding penetrances are 0.2, 0.11, and 0.02.

Haplotypes in each subpopulation are generated using the method described by Lam et al. (2000), and Tzeng et al. (2003). This procedure can generate random linkage disequilibrium among alleles on normal chromosomes. Each subpopulation is founded by 100 individuals growing for 150 generations. For the first 50 generations, the expected population size remains constant, and then grows exponentially for another 100 generations to the present-day population with an expected size of 10,000 individuals. To generate 200 founder haplotypes for the first subpopulation, we assume that all markers' minor allele frequencies are 0.5, and generate each founder haplotype by randomly assigning allelic types at every marker independently. For the second subpopulation, those 200 founder haplotypes are randomly generated assuming that all minor allele frequencies are f2, which is chosen to be 0.1, 0.3, and 0.5 in our simulations. For every generated present-day haplotype, the allele at the disease locus is chosen independently (because of no association) according to the disease allele frequency in the corresponding subpopulation.

For each considered minor allele frequency f2 in founders for the second subpopulation, we simulate 2000 stratified populations. From each stratified population, we collect a dataset of 200 nuclear families with one affected offspring, among which 200pp families are ascertained from the first subpopulation; the remain families are from the second subpopulation. In the simulation, we choose pp to be 1/2, 1/4 and 1/6. Details on how nuclear families are generated can be found in Zhang et al. (2003).

Based on 2000 replications, the estimated type I error rates of the proposed test are given in Tables 2. It can be concluded from the Table that the method yields the correct expected type I error rate under nominal levels of 0.05 and 0.01.

Table 2.  Type I error rates of the test based on 2000 replications. The sample size is 200 families with one affected child in each family
f2ppSignificance level
0.050.01
  1. Note –f2 is the minor allele frequency in founders for the second subpopulation. pp is the proportion of nuclear families sampled from the first subpopulation.

0.11/20.0510.010
1/40.0430.009
1/60.0540.009
0.31/20.0550.013
1/40.0530.011
1/60.0520.014
0.51/20.0550.012
1/40.0490.008
1/60.0530.011

Power Comparisons

We want to evaluate the performance of association tests based on the following three haplotype grouping schemes. One is induced by recursive partitioning algorithm (called “Group-RP”) as proposed in this paper. The other is based on the original haplotype distribution (called “Original”); and the third one (called “Group-DR”) is based on the grouping scheme described by Tzeng et al. (2003).

We consider a candidate region of 12 markers with 0.1 cM map density. The disease-susceptibility locus locates between markers 6 and 7. Assuming no population stratification, each study population is generated using the same procedure as described above, except that: at the 51st generation, a disease-causing mutation is introduced at the disease locus on one or two randomly selected haplotypes. To initialize those 200 founder haplotypes for a study population, we simulate alleles at each marker independently according to the allele frequencies. The minor allele frequency for each marker is chosen from a uniform distribution over the interval 0.1–0.5. From each simulated population, given a disease model, nuclear families with one affected offspring are ascertained using the same procedure as Zhang et al. (2003).

Simulations are designed to compare methods under various genetic conditions, which are characterized by disease models and the level of founder heterogeneity. For all disease models considered, the disease allele frequency is set to be 0.2, with baseline penetrance γee= 0.05. The relative risk of genotypes, defined as RREEee, is varied from 2 to 6 for three disease models, recessive, additive, and dominant. The level of founder heterogeneity is decided by the number of ancestral haplotypes involved, and by the percentage of carrier haplotypes derived from each founder. In the simulation, the number of ancestral haplotypes is chosen to be one or two. For the scenario with two ancestral haplotypes, 1/2 or 1/3 of sampled carrier haplotypes are derived from one founder.

Under each genetic condition considered, 500 datasets from 500 randomly simulated study populations are generated. Each dataset consists of 200 nuclear families with one affected offspring. The significance level is set at 0.05. Power comparison results for various founder heterogeneity scenarios are given in Figs 2 to 4. From all the Figures, it is clear that the test based on the proposed grouping scheme has the best performance. Also, it can be noted that the test based on each of the two considered grouping schemes outperforms the test based on the original ungrouped haplotype distribution.

Figure 2.

Power comparisons for three tests based on 500 replications. The sample size is 200 families with one affected child in each family. There is only one ancestral haplotype involved.

Figure 3.

Power comparisons for three tests based on 500 replications. The sample size is 200 families with one affected child in each family. There are two ancestral haplotypes involved, one is the founder for one third of mutation carrier haplotypes, the other is responsible for the remaining two thirds of carrier haplotypes.

Figure 4.

Power comparisons for three tests based on 500 replications. The sample size is 200 families with one affected child in each family. There are two ancestral haplotypes involved, one is the founder for one half of mutation carrier haplotypes, the other is responsible for the remaining half of carrier haplotypes.

Discussion

It is generally believed that association tests based on multi-marker haplotypes tend to be more powerful than single marker analyses. A drawback of haplotype-based association tests is the relatively large number of observed haplotypes, which increases the degrees of freedom for the corresponding test statistic. To reduce the degrees of freedom in haplotype-based tests, a recursive partitioning algorithm is proposed to group haplotypes into clusters, with each cluster consisting of haplotypes with similar haplotype structure, and (hopefully) similar risk liability. The proposed grouping strategy can be used with existing association test procedures for both population-based and family-based association tests, with known or ambiguous phase information. Simulation studies suggest that when used with the standard Pearson's chi-square statistic, the proposed procedure has the correct type I error rate in stratified populations, and enhanced power compared with tests based on other haplotype grouping strategies.

A major component of the recursive partitioning algorithm is how the splitting rule is defined. A splitting rule is characterized as a binary partition of a covariate space. For each haplotype, as described in the Methods section, we recommend forming a categorical covariate based on every two markers. Clearly, there would be many other ways of defining covariates. For example, a covariate can be defined based on a single marker, or based on several consecutive or non-consecutive markers. For studying haplotypes with tightly linked markers, markers adjacent to each other could be highly correlated. Thus, forming covariates only based on adjacent markers may provide a very limited number of choices to split the data. On the other hand, if we allow covariates to be defined by markers far away from each other, we may increase the chance of finding an optimal binary partition. In the setup of the simulation study described above, we compare tests based on the recursive partitioning algorithm using various covariate definitions, including the ones based on one marker, two markers, two adjacent markers, three markers, and three adjacent markers. We find all tests have similar performances (results not shown). No one outperforms the others over all the considered simulation scenarios. In general, using covariates defined by every two markers tends to have the best overall performance.

Once an over-grown tree has been built, candidate haplotype grouping schemes are found through a step-forward fashion. An alternative strategy is to use a backward pruning procedure. That is, starting with the full tree, we sequentially prune away one pair of end nodes at a time from the existing tree. The pruning process can generate a sequence of candidate sub-trees just as the step-forward procedure in Algorithm 1 does. We found both procedures usually end up with identical list of sub-trees.

In this paper, we use Pearson's chi-square statistic for the association test. Other test statistics can be used. For example, instead of Pearson's chi-square statistic, the likelihood ratio test statistic G2 (Christensen, 1997) can be used. Another option is to use the logistic-regression model framework to test the association of clusters of haplotypes with the disease, as in Durrant et al. (2004). In general, we would expect the recursive partitioning based haplotype grouping scheme to perform well in conjunction with various test statistics. Also, more extensive simulation studies are needed to compare the proposed test with other existing association testing procedures (e.g. Clayton, 1999; McIntyre et al. 2000; Schaid et al. 2002; Becker & Knapp, 2004b).

The haplotype similarity metric used in this paper is based on the recombination model, and has been used widely in haplotype-sharing methods (e.g. Bourgain et al. 2001; Zhang et al. 2003). Knapp & Becker (2004) point out that this metric is sensitive to genotyping errors. In family based studies, transmitted haplotypes (cases) are partially checked for genotyping errors, whereas there is no such checking for non-transmitted haplotypes (controls). The unbalanced checking for genotyping errors could result into inflated type I error rates of certain tests based on haplotype similarity comparisons, such as the HS-TDT of Zhang et al. (2003). However, the proposed recursive partitioning algorithm is less vulnerable to this effect because it builds the tree based on haplotype frequencies in the joint sample, instead of treating case and control samples separately.

Acknowledgements

The work was supported in part by National Heart Lung and Blood Institute grants 5R01HL056567-07 and U01 HL54473, and, National Institute of General Medical Sciences grants GM28719 and U01 GM63340. We thank the editor and three referees for their thoughtful comments that greatly improved an earlier version of the manuscript.

Appendix I

In this section, we give details on three steps in the association testing procedure once a full tree T is built. First, we want to identify a set of D sub-trees {Td, d= 1, …, D} that are good candidates for haplotype grouping schemes. The sub-tree Td, 1 ≤dD, has d terminal nodes, so it divides the haplotype space H into d groups. Here D is predetermined by the user, and is the largest number of haplotype groups to be considered. It is usually much smaller than the number of unique haplotypes observed in the sample. Based on our experience, taking D to be around 10 is sufficient for a set of 30 to 60 haplotypes. For the sub-tree Td to be a good haplotype grouping scheme, haplotypes assigned into the same terminal node should be similar around a certain focal point, so they have likely inherited the disease mutation from the same founder, and thus induce the same level of disease susceptibility. A measure for the fitness of a sub-tree T can be,

image(4)

where the summation is over all terminal nodes of the sub-tree. Obviously, sub-trees T1 and T2 are trivial to find, since there is only one sub-tree in T having 1 or 2 terminal nodes. To find the sub-tree Td, for d≥ 3, one way is to search through all possible sub-trees with d terminal nodes and find the one with the largest fitness measure defined by (4). This exhaustive search becomes infeasible as d becomes larger. Instead we use the following greedy step-forward search algorithm to find those candidate sub-trees approximately.

Algorithm 1: Procedure for Finding Candidate Haplotype Grouping Schemes

  • a. Start with T1, the sub-tree consisting of only the top node of the full tree T.
  • b. Given the sub-tree Td, its sub-tree Td+1 is chosen as the one that has the largest fitness measure defined by (4) among all sub-trees containing Td and having d+ 1 terminal nodes.
  • c. Repeat Step b until we obtain the sub-tree TD with D terminal nodes.

Next, based on the haplotype grouping scheme defined by Td, 1 ≤dD, we can calculate Pearson's chi-square statistic χd to compare grouped haplotype frequencies between cases and controls. The following algorithm can be used to estimate the corresponding raw p-values pd for χd, 1 ≤dD.

Algorithm 2: Procedure for Obtaining Raw p-values

  • a. Generate B (say, 10000) pairs of “case” and “control” samples by randomly permuting case/control labels of the observed dataset
  • b. For the bth generated case and control samples, compare their grouped haplotype frequencies based on each haplotype grouping scheme defined by Td, 1 ≤dD, and obtain the corresponding statistics χ(b)d.
  • c. The raw p-value associated with χd, 1 ≤dD, can be estimated as the proportion of χ(b)d, 1 ≤bB, that are larger than χd.

In the third step, the association test is based on min1≤dDpd, the minimum observed raw p-values. To assess the significance level of min1≤dDpd, the traditional two nested layers of permutation (Westfall & Young, 1993) is not computationally feasible. Recently, Ge et al. (2003) gave an algorithm that needs only one layer of permutations in the context of microarray analysis. The trick is to use statistics generated by one layer of permutations to approximate the distribution of the empirical p-value. Here we adopt Ge's algorithm to evaluate the significance level.

Algorithm 3: Procedure for Obtaining the p-value for min1≤dDpd

  • a. Let inline image.
  • b. Following Steps a and b in Algorithm 2 to generate additional (separate from those generated in Algorithm 2) B null datasets, and obtain test statistics χ(b)d, 1 ≤dD, for the bth generated dataset, 1 ≤bB.
  • c. Based on χ(b)d, with 1 ≤dD and 1 ≤bB, use the algorithm of Ge et al. (2003) to obtain raw p-values p(b)d.
  • d. Let p(b)= min1≤dDp(b)d with 1 ≤bB. The p-value for min1≤dDpd is estimated as the proportion of p(b), 1 ≤bB, that are less than inline image.

Note in (c) of the above algorithm, p(b)d is the raw p-value corresponding to χ(b)d, and is estimated as

image

More details are given in Ge et al. (2003).

Appendix II

Here we give more details on how to handle data with ambiguous phase information in both population-based and family-based association studies.

Population Based Association Studies

Given genotypes for a sample of N affected individuals, and a sample of M normal individuals, we first use the standard EM algorithm (Excoffier & Slatkin, 1995) to estimate the MLE r={ri, i= 1, …, C} for haplotype frequencies in the pooled sample (both affected and unaffected). Similarly, we can find the MLE p={pi, i= 1, …, C}, and q={qi, i= 1, …, C} for case and control haplotype frequencies. To avoid the rare situation when ri= 0, but pi > 0, or qi > 0, for some i (which would inflate the statistic calculated by (1) to infinite), we force pi= 0 and qi= 0 whenever ri= 0 the E-M steps for finding p and q. Based on r, an overgrown tree model can be built in the same way as for data with phase information known. Similarly, Algorithm 1 given in Appendix I can be used to find candidate haplotype grouping schemes Td, 1 ≤dD. Based on to each Td, Pearson's chi-square statistic χd can be calculated using estimated p, q, and r.

To estimate the raw p-values for χd, 1 ≤dD, we can use Algorithm 2 with its Step (a) substituted by a bootstrap procedure (Tzeng et al. 2003). That is, null datasets are generated by randomly drawing case and control haplotypes with a probability given by r (Tzeng et al. 2003). Also the same bootstrap procedure can be used in Algorithm 3 for finding the significance level of min1≤dDpd.

Family Based Association Studies

For a set of F nuclear families, with each family having multiple affected offspring, we adopt the approach of Yu et al. (in press) to estimate transmitted (case) and non-transmitted (control) haplotype frequencies. First, to estimate haplotype frequencies in 2F parents, the E-M algorithm under the restriction of family information (Abecasis et al. 2001; Rohde & Fuerst, 2001; Becker & Knapp, 2004a) is used. For the nth nuclear family, 1 ≤nF, among all parental haplotype assignments that are compatible with observed family genotypes we assign haplotypes (hn (1), hn (2)) for the father, and (hn (3), hn (4)) for the mother, using the most likely assignment (based on the estimated haplotype frequencies) as described by Zhang et al. (2003) and Zhao et al. (2000). Then for the parental haplotype hn (1), its transmission frequency δn (1), the estimated probability of being transmitted to an affected offspring, is calculated as

image

where cn is the number of affected children in the nth family, ξin (1) = 1 if only hn (1) in (hn (1), hn (2)) can be transmitted to the ith affected child (assuming the transmitted haplotype from the other parent is either hn (3) or hn (4)), and ξin (1) = 1/2 if hn (1) and hn (2) are all compatible with the ith affected child. Also, for hn (1), its non-transmission frequency, the estimated probability of not being transmitted to an affected offspring, can be calculated as 1 −δn (1). For other parental haplotypes, their transmission and non-transmission frequencies can be calculated similarly.

Haplotype frequencies r={ri, i= 1, …, C} in the pooled sample of both transmitted and non-transmitted haplotypes can be estimated from inline image, that is,

image

For case haplotype frequencies p={pi, i= 1,…, C}, they can be estimated as:

image

Similarly, control haplotype frequencies q={qi, i= 1,…, C} can be estimated as:

image

After haplotype frequencies p, q, and r have been obtained, the procedure described for phase unknown population-based studies can be used for family data. The only difference is how null datasets are generated. When dealing with families with multiple affected offspring, we use the permutation procedure proposed by Monks & Kaplan (2000), that is, for each family, transmitted and non-transmitted status of the parental haplotypes for all the offspring are simultaneously permuted. This is equivalent to: for the nth nuclear family, keep its four estimated parental haplotypes (hn (1), hn (2)) and (hn (3), hn (4)) unchanged, and randomly permute two transmission frequencies in each pair of n (1), δn (2)) and n (3), δn (4)).

Ancillary