SEARCH

SEARCH BY CITATION

Keywords:

  • linkage disequilibrium;
  • candidate gene association studies;
  • founder heterogeneity;
  • haplotype similarity analyses

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Studies
  6. Leptin Gene and Obesity
  7. Discussion
  8. Acknowledgements
  9. References

Taking advantage of increasingly available high-density single nucleotide polymorphisms (SNP) markers across the genome, various types of transmission/disequilibrium tests (TDT) using haplotype information have been developed. A practical challenge arising in such studies is the possibility that transmitted haplotypes have inherited disease-causing mutations from different ancestral chromosomes, or do not bear any disease-causing mutations (founder heterogeneity). To reduce the loss of signal strength due to founder heterogeneity, we propose an SP-TDT test that combines a sequential peeling procedure with the haplotype similarity based TDT method. The proposed SP-TDT method is applicable to any size of nuclear family with or without ambiguous phase information. Simulation studies suggest that the SP-TDT method has the correct type I error rate in stratified populations, and enhanced power compared with some existing haplotype similarity based TDT methods. Finally, we apply the proposed method to study the association of the leptin gene with obesity from the National Heart, Lung, and Blood Institute Family Heart Study.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Studies
  6. Leptin Gene and Obesity
  7. Discussion
  8. Acknowledgements
  9. References

The transmission/disequilibrium test (TDT) (Spielman et al. 1993) and its extensions have been popular tools for testing genetic linkage and association between a marker and a susceptibility locus. One of the main advantages of TDT-type analyses is their robustness to population stratification. Also, for the detection of linkage of complex trait loci, TDT-type methods may have greater power than traditional linkage analyses under certain circumstances (Risch & Merikangas, 1996). Motivated by the increasing availability of high-density single nucleotide polymorphism (SNP) markers within genes and across the genome, recent developments of TDT-type tests have been focused on how to efficiently use haplotype information from multiple closely linked markers (Lazzeroni & Lange, 1998; Clayton, 1999; Clayton & Jones, 1999; Dudbridge et al. 2000; Bourgain et al. 2000, 2001, 2002; Rabinowitz & Laird, 2000; Zhao et al. 2000; Li et al. 2001; Seltman et al. 2001, 2003; Zhang et al. 2003; Qian, 2004). In general, simultaneously studying multiple markers in genetic dissection of complex traits tends to more powerful than a single marker (Akey et al. 2001; Morris & Kaplan, 2002).

Seltman et al. (2001) pointed out the increase in the degrees of freedom (df) as a drawback of haplotype-based tests. In particular, for m realized haplotypes, the tests in general follow a χ2 distribution with m− 1 df under the null hypothesis of no linkage or associations. Increasing degrees of freedom will limit the power of haplotype-based tests. One way to reduce the degrees of freedom is to focus TDT-type tests on sets of haplotypes instead of on individual haplotypes. There are several approaches to group haplotypes into a small number of subgroups. For example, Seltman et al. (2001) used the haplotype evolutionary relationship, and Li et al. (2001) used a clustering method based on haplotype similarities. Another way to limit the degrees of freedom is through the haplotype similarity based TDT approach, which compares haplotype similarity in transmitted haplotypes to that in non-transmitted haplotypes, such as the maximum identity length contrast (MILC) method (Bourgain et al. 2000, 2001, 2002), and the haplotype-sharing TDT (HS-TDT) method (Zhang et al. 2003). The advantage of such approaches is that the degrees of freedom, in a broad sense, equal the number of markers considered (Zhang et al. 2003), which in general are much lower than the number of haplotypes observed.

A challenge in the genetic dissection of complex traits is the existence of founder heterogeneity, that is, haplotypes in affected individuals may not bear any disease-causing mutations, or may inherit disease-causing mutations from different ancestral haplotypes. In the setting of the TDT-type of analysis under founder heterogeneity, the set Ω of haplotypes transmitted to affected offspring can be a union Ω=Q0Q1⋯∪Qr of subset Qi, with Q0 containing non-carrier haplotypes (haplotypes without any disease-susceptibility mutations), and Qi, 1 ≤ir, containing haplotypes with a mutation inherited from a common ancestral haplotype. We call subsets Qi , 1 ≤ir, clusters. The existence of founder heterogeneity can adversely affect the detection of disease-susceptibility genes (Morris & Kaplan, 2002). It reduces the power of TDT-type tests by diminishing the difference in both frequencies and similarity between transmitted and non-transmitted haplotypes.

Assuming the existence of several relatively large clusters in Ω, ideally we can mitigate the adverse effect of founder heterogeneity, by removing those haplotypes that are more likely non-carriers or carriers belonging to relatively small clusters. Based on this motivation, Yu et al. (2004b) proposed a sequential peeling procedure for population-based case control studies. In this approach, a peeling procedure is used to sequentially delete case haplotypes not likely belonging to larger clusters. As a result, a sequence of nested and more homogeneous case subsets (with a decreased degree of founder heterogeneity) are identified. Then the haplotype similarity in each identified case subset is compared with the haplotype similarity in controls. To adjust for multiple comparisons, the overall statistical significance level is assessed through a random permutation procedure. By simulation studies, Yu et al. (2004b) showed that their approach is more powerful than standard haplotype similarity comparison methods, such as the MILC of Bourgain et al. (2000, 2001, 2002).

In the present paper, we extend the idea of the sequential peeling procedure of Yu et al. (2004b) to nuclear family data, and propose a more powerful TDT test based on haplotype similarity comparison. This method, called sequential peeling TDT (SP-TDT), is applicable to binary traits, and to any size of nuclear families with or without ambiguous phase information. Simulation results show that the proposed method has the correct type I error rate in stratified populations, and is more powerful than some existing haplotype similarity based TDT methods, such as the HS-TDT method of Zhang et al. (2003). As a real data application, we use the SP-TDT method to study the association of the leptin gene (LEP) with obesity from the National Heart, Lung, and Blood Institute Family Heart Study (Jiang et al. 2004).

Methods

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Studies
  6. Leptin Gene and Obesity
  7. Discussion
  8. Acknowledgements
  9. References

Suppose that in a study of a qualitative trait, a candidate region/gene is evaluated with K+ 1 tightly linked markers, and that N nuclear families are sampled with a variable number of affected offspring in each family. A haplotype can be represented as a vector H= (h1 , … , hK+1), here hk∈{1, … , tk} labels the tk alleles at marker k, 1 ≤kK+ 1. The null hypothesis we wish to test is that there is no linkage or no association between the considered candidate region and the disease susceptibility locus. More specifically, assuming a disease-susceptibility locus with two alleles D and d, “no linkage” means that there is no linkage between the disease-susceptibility locus and any one of the K+ 1 markers considered. For “no association”, when there is no population stratification, it is expected that for any haplotype h, Pr(D, h) = Pr(D) Pr(h), where Pr(D, h), is the probability of a chromosome having both D and h, Pr(D) and Pr(h) are the disease allele and haplotype frequencies in the considered population. When population stratification exists, “no association” means: there is no association in any of subpopulations from which families are collected (Monks & Kaplan, 2000).

Next we describe the proposed method for the case where the phase information is assumed known, and then we extend the discussion to family data with ambiguous phase information.

Known Phase Information

Here it is assumed that for each sampled family we know all four parental haplotypes, and how they are transmitted to their affected offspring. Suppose there are N sampled nuclear families. For the n th family, 1 ≤nN, we denote the two haplotypes of the father as (H2n−1(1), H2n−1(2)), and two from the mother as (H2n(1), H2n(2)). In the following discussion, all 2N pairs of parental haplotypes are denoted by F={(Hi (1), Hi (2)), i= 1, … , 2N}. For the i th parental haplotype pair (Hi (1), Hi (2)), the transmission status to the affected children can be summarized as transmission frequencies i (1), δi (2)), where δi (1) (or δi (2)) is the proportion of affected children receiving haplotype Hi (1) (or Hi (2)). For example, if (Hi (1), Hi (2)) is from a parent of a nuclear family with three affected offspring, Hi (1) is transmitted to one affected child, and Hi (2) is inherited by the other two, then δi (1) = 1/3, and δi (2) = 2/3. For a parent of a nuclear family with one affected offspring, its δi (1) or δi (2) is 0. Also we always have δi (1) +δi (2) = 1.

Under the null hypothesis of no linkage or no association, Zhang et al. (2003) proved that Hi (1) and Hi (2) have the same probability of being transmitted to an affected child. Thus, under the null hypothesis, haplotype similarity in transmitted haplotypes would be around the same level as that in non-transmitted haplotypes. Under the alternative where the disease-susceptibility locus falls into the considered region, and there are disproportionately large clusters of transmitted haplotypes which inherit the mutation from relatively few common founders, those transmitted haplotypes are apt to exhibit more allele sharing than non-transmitted haplotypes do at markers near the mutation locus. This is the rationale for TDT analyses based on haplotype similarity comparison. This type of analysis depends on a pair-wise similarity metric λk (a, b), which measures the similarity between any given pair of haplotypes a and b around the sub-interval between markers k and k+ 1, with 1 ≤kK.

There are a number of ways of defining the pair-wise similarity metric λk (a, b).Tzeng et al. (2003) gave a summary of them. For example, Bourgain et al. (2000) use the “identity length measure”, which is the length of the contiguous region around a focal point over which the two haplotypes are IBS (identical by state). In this paper, we use the “weighted similarity measure” of Yu et al. (2004a) as the similarity metric. Comparing this to other commonly used similarity metrics, Yu et al. (2004b) demonstrated through simulations the power advantage of using such a similarity metric in the context of population-based case control studies. The weighed similarity measure can be defined as

  • image

where λLk (a, b) and λRk (a, b) measure similarity to the left and right of the considered focal point through the maximum consecutive matching length. Specifically, we set λRk (a, b) = 0 if ak+1bk+1, and

  • image

Here IRk={k+ 1, k+ 2, …} is the largest set of contiguous marker labels for which haplotypes a and b share the same allele. The weights wRi are chosen to favor IBS in rare alleles more than in common alleles, and are defined as

  • image

Here Ai is the allele at marker i on a randomly selected haplotype from the population. We estimate those probabilities based on non-transmitted haplotypes. λLk (a, b) is defined similarly. If we let all wLi, iILk and wRi, iIRk equal 1, then λk (a, b) is the number of consecutive IBS markers around the considered sub-interval, which is very similar to the identity length measure (Bourgain et al. 2000) if we assume markers are evenly spaced.

When analyzing families with multiple affected children, it is possible that one parental haplotype can be transmitted to some affected children but not to others. To take this transmission status ambiguity into consideration, for the parental haplotype pair (Hi (1), Hi (2)), we treat its “transmitted haplotype” as a mixture of Hi (1) and Hi (2), with δi (1) and δi (2) as their mixture proportions. Similarly, its “non-transmitted haplotype” can be regarded as a mixture of Hi (1) and Hi (2), with 1 −δi (1) and 1 −δi (2) as their mixture proportions. Then haplotype similarity between transmitted haplotypes from ith and jth parents can be estimated as

  • image

Based on that, we can measure the average similarity for transmitted haplotypes from the set F of all parents as

  • image(1)

Similarly, the average similarity for non-transmitted haplotypes from all parents can be calculated as

  • image(2)

The null hypothesis can be tested using the following statistic,

  • image(3)

When every sampled nuclear family has only one affected child, and the “identity length measure” of Bourgain et al. (2000) is used, the test statistic (3) becomes the MILC method of Bourgain et al. (2000), which is also equivalent to the HS-TDT of Zhang et al. (2003).

To evaluate the p value of the test given by (3), the permutation procedure proposed by Monks & Kaplan (2000) can be used. We simultaneously permute the transmitted and non-transmitted status of the parental haplotypes for all affected children in the family. By doing this, we can generate null datasets in the presence of linkage. This is equivalent to permuting randomly two haplotypes in each pair (Hi (1), Hi (2)), and leaving transmission frequencies i (1), δi (2)) unchanged, that is, if after the permutation, the haplotype pair becomes (Hi (2), Hi (1)), the transmission frequencies for Hi (2) and Hi (1) are set to be δi (1) and δi (2) respectively. In the following discussion, we call the test statistic (3) that uses the weighted similarity metric the weighted haplotype similarity based TDT method (WT-TDT).

Since founder heterogeneity almost always exists in studies of complex diseases, we expect some transmitted haplotypes to be non-carriers, or carriers from relatively small clusters. Those transmitted haplotypes would adversely affect the power of haplotype similarity based TDT methods by diminishing S(F). To increase the power of the WT-TDT method under founder heterogeneity, we adopt the idea of the sequential peeling method described in Yu et al. (2004b), and call the improved method SP-TDT. The goal of the SP-TDT method is to test the null hypothesis based on a subset of F that has a decreased degree of founder heterogeneity, and provides the strongest evidence against the null hypothesis. There are three steps involved in SP-TDT, which are very similar to those given in Yu et al. (2004b). Here we just provide a brief summary. First, we want to remove some parental haplotype pairs whose transmitted haplotypes most likely are non-carriers, or carriers from small clusters. This goal is achieved by sequentially peeling away some parental haplotype pairs in order to maximize the test statistic S(.) given by (3) for the remaining pairs. The peeling process generates a sequence of nested subsets F0F1 , … , ⊃FM, each with S(Fj) as large as possible, where F0=F, the original data. Here is the algorithm for the first step.

Algorithm 1: Procedure for Top-down Peeling

  • a. 
    Start with F0=F, the original set of parental haplotype pairs.
  • b. 
    Given the set Fj, construct a subset Fj+1 by deleting a proportion α (we use α= 0.1) of haplotype pairs chosen from Fj to maximize S(Fj+1) (approximately).
  • c. 
    Repeat Step b until we obtain a subset FM with a small size (say, | FM |/| F0 | ≤ 5%, where | Fj | is the number of parental haplotype pairs in Fj).

This goal in (b) is not feasible to achieve exactly unless the number of possible subsets is small. Instead we use a more computationally efficient approximation. This algorithm involves sequentially deleting a single parental haplotype pair from Fj where, at each deletion, the deleted pair is the one whose removal yields the greatest similarity comparison S(.) for the remaining pairs.

Next we use the following algorithm to test the null hypothesis based on each S(Fj), 0 ≤jM, and estimate the corresponding raw p-value pi (the p-value that hasn't been adjusted for multiple comparisons)

Algorithm 2: Procedure for Obtaining Raw p-value

  • a. 
    Generate B (say, 10000) datasets {F(b), b= 1, … , B} of parental haplotype pairs under the null hypothesis. Each F(b) is generated by randomly permuting two haplotypes in every observed parental haplotype pair (Hi (1), Hi (2)), i= 1, … , 2N, and leaving their transmission frequencies i (1), δi (2)) unchanged.
  • b. 
    For the bth generated null dataset F(b), go through Algorithm 1 to find {F(b)j, j= 0, … , M}, and calculate S(b)j=S(F(b)j), with j  from 0   toM.
  • c. 
    The raw p-value associated with S(Fj) is estimated as inline image, where I (.) is the indicator function.

Finally, the test statistic of SP-TDT is min0≤jMpj, the observed minimum raw p-values. To assess the significance level of min0≤jMpj, the traditional double permutation procedure is not computationally feasible. Recently, Ge et al. (2003) proposed an algorithm that needs only one layer of permutations. The original algorithm was developed for finding adjusted p-values for multiple comparisons in microarray analysis. The trick is to reuse statistics generated by one layer of permutations to estimate the empirical p-value for each comparison. Becker & Knapp (2004b), and Yu et al. (2004b) modified Ge's algorithm and applied it in the context of genetic association studies. Here we use a similar algorithm to that given by Yu et al. (2004b) to estimate the p-value for min0≤jMpj.

Algorithm 3: Procedure for Obtaining the p-value for min0≤jMpj

  • a. 
    From “observed” raw p-values {pj, j from 0 to M}, let inline image.
  • b. 
    Following Steps (a) and (b) of Algorithm 2 to generate additional (separate from those generated in Algorithm 2) B null datasets {F(b), b= 1, … , B}, and calculate statistics S(b)j=S(F(b)j), 0 ≤jM, for the bth generated null dataset, 1 ≤bB.
  • c. 
    Based on S(b)j, with 0 ≤jM and 1 ≤bB, use the algorithm by Ge et al. (2003) to obtain raw p-values p(b)j, 0 ≤jM, 1 ≤bB.
  • d. 
    Let p(b)= min0≤jMp(b)j for 1 ≤bB. Finally the estimated p-value for min0≤jMpj is estimated as
    • image

Note in (c) of the above algorithm, p(b)j is the raw p-value corresponding to S(b)j, and is estimated as

  • image(4)

In the standard double permutation procedure, p(b)j is obtained through another layer of permutation. But according to Ge et al. (2003), p(b)j can be estimated by (4), which needs only one level of permutation.

Ambiguous Phase Information

In the previous subsection we described the proposed SP-TDT method for data with known phase information. However, for real data application, parental haplotypes and their transmission statuses may not be uniquely determined. Next, we extend the approach described above to families with ambiguous phase information.

There are two commonly used approaches dealing with phase ambiguity. The first is to assign the most likely (based on estimated haplotype frequencies) haplotypes that are compatible with the observed genotypes, and then analyze the data as if the phase information were known. The second approach is to give weights based on estimated haplotype frequencies to each possible haplotype assignment, and then to compose a statistic based on those weights. Zhang et al. (2003) compared performances of two such approaches for the HS-TDT method, and found that they have similar powers. However, the second approach leads to extra computational burdens.

We adopt the first approach mainly for its simplicity. To estimate haplotype frequencies, the E-M algorithm under the restriction of family information (Abecasis et al. 2001; Rohde & Fuerst, 2001; Becker & Knapp, 2004a) can be used. For the nth nuclear family, 1 ≤nN, among all parental haplotype assignments that are compatible with observed family genotypes, we assign the four parental haplotypes (H2n−1(1), H2n−1(2)) and (H2n(1), H2n(2)) using the most likely assignment as described by Zhang et al. (2003) and Zhao et al. (2000). Then the transmission frequency δ2n−1(1), for the parental haplotype H2n−1(1) is calculated as

  • image

where cn is the number of affected children in the nth family, ξi2n−1 (1) = 1 if only H2n−1(1) in (H2n−1(1), H2n−1(2)) can be transmitted to the ith affected child (assuming the transmitted haplotype from the other parent is either H2n(1) or H2n(2)), and ξi2n−1 (1) = 1/2 if H2n−1(1) and H2n−1(2) are all compatible with the ith affected child. Other transmission frequencies are calculated similarly. After the parental haplotypes are assigned and their corresponding transmission frequencies are calculated, the WT-TDT and the SP-TDT methods described in the previous subsection can used.

Simulation Studies

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Studies
  6. Leptin Gene and Obesity
  7. Discussion
  8. Acknowledgements
  9. References

Simulations are used to evaluate the performance of proposed methods. All analyses are done with phase information unknown. We consider a candidate region consisting of 12 diallelic markers with 0.1 cM map density. The disease-susceptibility locus has two alleles, D and d.

Type I Errors

To verify that the proposed WT-TDT and SP-TDT methods both have the correct type I error rates and are robust to the population stratification, the simulation design of Zhang et al. (2003) is adopted. We consider stratified populations consisting of two subpopulations. The disease locus is 10 cM away from the first marker. We assume there is no association between the disease locus and the candidate region within each subpopulation. For the first subpopulation, the disease allele frequency is pD= 0.2. Penetrances are fDD= Pr(affected|DD) = 0.3, fDd= Pr(affected|Dd) = 0.16 and fdd= Pr(affected|dd) = 0.02. For the second subpopulation, the disease allele frequency is pD= 0.3. Penetrances fDD, fDd, and fdd are chosen to be 0.2, 0.11, and 0.02. Let pp denote the proportion of families that are sampled from the first subpopulation. We consider pp to be 1/2, 1/4 and 1/6 in the simulations.

Haplotypes in each subpopulation are generated using the method described by Lam et al. (2000) and Tzeng et al. (2003). Each subpopulation is founded by 100 individuals, starting from 150 generations ago. At any given generation, individuals are paired at random and give birth to a number of children. The number of offspring per couple is randomly generated according to a Poisson distribution with a mean determined by the exponential growth rate of the current generation. For the first 50 generations, the expected population size remains constant. Then the population grows exponentially for another 100 generations to the present-day population, with an expected size of 10,000 individuals. To generate 200 founder haplotypes for the first subpopulation, we assume that all markers' minor allele frequencies are 0.5, and generate each founder haplotype by randomly assigning allelic types at every marker independently. For the second subpopulation, those 200 founder haplotypes are randomly generated assuming that all minor allele frequencies are q2, which is chosen to be 0.1, 0.3, and 0.5 in our simulations. For every present-day haplotype generated, the allele at the disease locus is chosen independently according to the disease allele frequency in the corresponding subpopulation.

For a given stratified population configuration, characterized by q2, the minor allele frequency for founder haplotypes in the second subpopulation, and pp, the mixture proportion, we simulate 2000 stratified populations. From each simulated stratified population, we generate a large number of nuclear families having one offspring. Then we collect 200pp families with one affected child from the first subpopulation, and 200(1 −pp) families with one affected child from the second subpopulation. Each nuclear family is simulated using the same procedure as described in Zhang et al. (2003).

In order to estimate type I error rates in studies of families with multiple affected children, we also simulate datasets consisting of 200 nuclear families with at least two affected offspring, ascertained from families having 8 children.

The estimated type I error rates of the two tests are given in Tables 1 and 2. Table 1 summarizes estimated type I error rates when analyzing families with a single affected child. Table 2 is for families with multiple affected offspring. From Tables 1 and 2, it can be concluded that both WT-TDT and SP-TDT yield the correct expected type I error rate under nominal levels of 0.05 and 0.01.

Table 1. Type I error rates of the SP-TDT and WS-TDT tests. The sample size is 200 families with one affected child in each family
q 2 ppSignificance level = 0.05Significance level = 0.01
SP-TDTWS-TDTSP-TDTWS-TDT
  1. Note–q2 is the minor allele frequency in founders for the second subpopulation. pp is the proportion of nuclear families sampled from the first subpopulation.

0.11/20.0510.0560.0110.012
1/40.0510.0570.0090.009
1/60.0490.0530.0110.009
0.31/20.0490.0500.0090.007
1/40.0520.0480.0110.011
1/60.0600.0560.0130.014
0.51/20.0570.0450.0110.012
1/40.0540.0580.0110.013
1/60.0580.0540.0130.014
Table 2. Type I error rates of the SP-TDT and WS-TDT tests. The sample size is 200 families with at least two affected children in each family
q 2 pp Significance level = 0.05Significance level = 0.01
SP-TDTWS-TDTSP-TDTWS-TDT
  1. Note–q2 is the minor allele frequency in founders for the second subpopulation. pp is the proportion of nuclear families sampled from the first subpopulation.

0.11/20.0510.0610.0090.013
1/40.0470.0490.0110.011
1/60.0480.0490.0130.014
0.31/20.0490.0540.0120.011
1/40.0480.0490.0080.009
1/60.0520.0550.0090.011
0.51/20.0480.0570.0080.011
1/40.0430.0440.0070.01
1/60.0550.0580.0120.011

Power Comparisons

The main purpose of the simulation study is to see whether there is an improvement in power by combining the sequential peeling procedure with the standard TDT based on haplotype similarity comparison. We consider a candidate region of 12 markers with 0.1 cM map density. The disease-susceptibility locus is located in the middle of the interval between markers 6 and 7.

Assuming no population stratification, each study population is generated using the same procedure as described above, except that at the 51st generation, a disease-causing mutation is introduced at the disease locus on one or two randomly selected haplotypes. To generate the 200 founder haplotypes for a study population, we simulate alleles at each marker independently according to the allele frequencies. The minor allele frequency for each marker is chosen from a uniform distribution over the interval 0.1–0.5. To mimic a common disease, we only retain populations in which the relative frequency of the disease mutation is between 0.1 and 0.3. The above process generates study populations from which samples of carrier and non-carrier haplotypes can be drawn. For non-carrier haplotypes in the resulting study populations, the average linkage disequilibrium between adjacent markers as measured by D value is around 0.74.

There are many genetic factors influencing the power of TDT-type analyses. Two major factors are considered here. The first is the underlying disease model. The disease allele frequency is set to be pD= 0.2, with baseline penetrance fdd= 0.05. The relative risk of genotypes DD to dd, which is defined as RR=fDD/fdd, is varied from 2 to 6 for three disease models: recessive, additive, and dominant. The second factor is the level of founder heterogeneity, characterized by the number of ancestral haplotypes involved, and their corresponding cluster sizes. We consider scenarios of one and two ancestral haplotypes. In the situation where there are two ancestral haplotypes, one is assumed to be the common founder for 1/2 or 1/3 of the sampled carrier haplotypes.

Under each genetic model (defined by the above two factors), we generate 200 datasets from 200 randomly simulated study populations. Each dataset consists of 200 nuclear families with one affected child. In order to compare powers in studies of families with multiple affected children, we also simulate datasets consisting of 200 nuclear families with at least two affected offspring, ascertained from families having 8 children.

The methods we attempt to compare are summarized in Table 3. The significance level is set at 0.05. When analyzing families with only one affected child, the HS-TDT method of Zhang et al. (2003) is equivalent to the MILC method of Bourgain et al. (2000, 2001, 2002). If analyzing nuclear families with multiple affected offspring, there are two approaches using the SP-TDT and WT-TDT methods. One is to include all affected children from each family; the other is to randomly select a trio from each family. Power comparison results are given in Figures 1 to 6. Each point in the figures is based on 200 replicated datasets. From all the Figures it is clear that the SP-TDT has the best performance over other methods in all the situations considered. This shows that the sequential peeling procedure is a very useful strategy in dealing with founder heterogeneity. Also, it can be noticed that the WT-TDT method has the second best performance, which demonstrates the advantage of using the weighted similarity metric. By comparing Figure 2 with Figure 3, and Figure 5 with Figure 6, we can see that when there is no dominant ancestral founders (in this case, two founders are responsible for the same number of carriers), the advantage of the SP-TDT over the WT-TDT is not as obvious as that observed in situations where there is a dominant founder, although the SP-TDT still stands out as the best. Finally, when analyzing families with multiple affected children, it can be seen from Figures 4 to 6 that both the SP-TDT and WT-TDT methods perform better than the SP-TDT1 and WT-TDT1 that use randomly selected trios. This suggests that there are some advantages in including all the affected offspring in the analysis, rather than discarding some of them.

Table 3. Test Statistics Compared
Test StatisticDescription
SP-TDTTest proposed in the present article
WT-TDTTest statistic given by formula (3) using the weighted similarity metric
HS-TDTTest proposed by Zhang et al. (2003)
SP-TDT1When analyzing families with multiple affected children, apply the SP-TDT to families with one randomly selected affected child
WT-TDT1When analyzing families with multiple affected children, apply the WT-TDT to families with one randomly selected affected child
image

Figure 1. Power comparisons for three tests. The sample size is 200 families with one affected child in each family. There is only one ancestral haplotype involved.

Download figure to PowerPoint

image

Figure 2. Power comparisons for three tests. The sample size is 200 families with one affected child in each family. There are two ancestral haplotypes involved, one is the founder for one third of mutation carrier haplotypes, the other is the founder for the remaining two thirds of carrier haplotypes.

Download figure to PowerPoint

image

Figure 3. Power comparisons for three tests. The sample size is 200 families with one affected child in each family. There are two ancestral haplotypes involved, one is the founder for one half of mutation carrier haplotypes, the other is the founder for the remaining half of carrier haplotypes.

Download figure to PowerPoint

image

Figure 4. Power comparisons for five tests. The sample size is 200 families with at least two affected children in each family. There is only one ancestral haplotype involved.

Download figure to PowerPoint

image

Figure 5. Power comparisons for five tests. The sample size is 200 families with at least two affected children in each family. There are two ancestral haplotypes involved, one is the founder for one third of mutation carrier haplotypes, the other is the founder for the remaining two thirds of carrier haplotypes.

Download figure to PowerPoint

image

Figure 6. Power comparisons of five tests. The sample size is 200 families with at least two affected children in each family. There are two ancestral haplotypes involved, one is the founder for one half of mutation carrier haplotypes, the other is the founder for the remaining half of carrier haplotypes.

Download figure to PowerPoint

The order in which parental haplotype pairs are removed in the sequential peeling procedure depends on many factors, such as the percentage of transmitted haplotypes that are non-carrier, the similarity level between carrier and non-carrier haplotypes, and the similarity level between carrier haplotypes from different clusters if there is more than one ancestral haplotype involved. Based on our limited experience, in situations where one ancestral haplotype is responsible for the majority of carrier haplotypes, parents whose transmitted haplotypes are non-carriers or carriers from small clusters tend to be excluded during the early stage of the peeling process. In situations where two founders are responsible for the same number of carriers, no obvious deleting pattern can be observed. In general, the algorithm tends to preserve the cluster of carriers that shows the higher degree of similarity.

Leptin Gene and Obesity

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Studies
  6. Leptin Gene and Obesity
  7. Discussion
  8. Acknowledgements
  9. References

To assess the role of the leptin gene (LEP) in association with BMI and obesity, Jiang et al. (2004) conducted a family-based association study. They genotyped 29 SNPs spanning 240 kb across the LEP gene in 82 selected pedigrees. Their analysis showed that a number of SNPs had strong associations with both a dichotomous obesity characterization (OB) and a quantitative BMI residual measurement (BMI-R).

We focus on 20 SNPs within the LEP gene (markers number 8 to 27 listed in Table 2 of Jiang et al. 2004). The dichotomous obesity characterization (OB) described in Jiang et al. (2004) is used to define affected individuals. From 82 extended pedigrees, we select 67 nuclear families with multiple affected offspring. The criteria we use in the selection process is: from each extended pedigree, the nuclear family having the largest number of affected children is chosen. The 67 selected nuclear families are analyzed in two ways. First, tests are performed on nuclear families with all their affected offspring included, and then for trios, by including only the most affected offspring (with the largest BMI-R) in each family. Four tests are performed, including the ones described in the above simulation section, as well as the TRANSMIT method (Clayton, 1999; Clayton & Jones, 1999). The testing results are summarized in Table 4. When performing the SP-TDT, WT-TDT and HS-TDT tests, all 20 markers are considered simultaneously. When applying the TRANSMIT program, we use the same program tuning options as used by Jiang et al. (2004), and focus on a moving window of three markers. So a total of 18 tests for each dataset, based on the TRANSMIT analysis, are performed. The p-value reported in Table 4 for the TRANSMIT method is the smallest p-value observed in all 18 tests without multiple testing adjustment.

Table 4. Tests of association between LEP and OB
TDT-type analysisTesting Results (p-values)
Based on all affectedaBased on the most affectedb
  1. Note: a67 nuclear families with all affected offspring included in the analysis.

  2. b67 nuclear families with one most affected (based on BMI-R) offspring from each family included in the analysis.

  3. cUse of the TRANSMIT program on moving windows of three-marker haplotypes. The corresponding p-values listed are the smallest p-values in all 18 moving window tests without multiple test adjustment.

SP-TDT0.0410.029
WT-TDT0.9570.914
HS-TDT0.9720.969
TRANSMITc0.0470.068

From Table 4, it is interesting to notice that significant association (at the 0.05 level) is only detected by the SP-TDT method. The results from the TRANSMIT method would not be significant after multiple test adjustment. In this real data example, when applying the WT-TDT and HS-TDT methods to the whole dataset with all affected offspring included, the association signal cannot be detected, possibly due to the effect of founder heterogeneity. In fact, both the WT-TDT and HS-TDT methods result in negative statistics, which means that the average similarity level in the transmitted haplotypes is actually less than that observed in the non-transmitted haplotypes. With the help of the peeling procedure, we identify subsets of parents whose transmitted haplotypes demonstrate excessive allele sharing. The subset that shows the strongest association signal consists of 18 parents from 15 nuclear families. Based on this subset of parents, the largest difference in haplotype similarity comparisons occurs when the similarity is measured around the focal point, between markers 8 and 9. Both markers 8 and 9 locate in the promoter region of the LEP gene. More detailed analysis of those 15 nuclear families will be described elsewhere. When analyzing the data including only the most affected offspring, similar conclusions can be made.

In the analysis by Jiang et al. (2004), the TRANSMIT test (with a moving window of three markers) was applied for all affected individuals, as well as for male and female family members separately. They found that the association signals came exclusively from men. To follow their approach, we also carry out sex-specific analyses. No significant association is detected by the SP-TDT, WT-TDT and HS-TDT methods from male or female members (results not shown). One possible explanation for this discrepancy could be the following. The sample size of sex-specific analyses may be too small (only 36 nuclear families available for the male specific analysis) for haplotype similarity comparison. As mentioned before, the degrees of freedom for those tests in a broad sense equals the number of markers considered (20 in this example). The gain in power by gathering information from all 20 markers may not be large enough to compensate for the adverse effect caused by the increased degrees of freedom.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Studies
  6. Leptin Gene and Obesity
  7. Discussion
  8. Acknowledgements
  9. References

To address the problem of founder heterogeneity in the context of family based studies, we propose the SP-TDT, a method that combines a sequential peeling procedure with the haplotype similarity based TDT. The method is applicable to any size of nuclear families with or without ambiguous phase information. Simulation studies suggest that the SP-TDT method has the correct type I error rate in stratified populations, and enhanced power compared with the WT-TDT method and the HS-TDT method of Zhang et al. (2003). We also apply the proposed SP-TDT to the study of association between the leptin gene and obesity from the National Heart, Lung, and Blood Institute Family Heart Study (Jiang et al. 2004).

In studies of nuclear families with multiple affected children, a parental haplotype can be a transmitted haplotype for some of the affected children, but non-transmitted for the other affected offspring. To take this transmission status ambiguity into consideration, in the SP-TDT method we propose to use transmission frequencies {(δi (1), δi (2)), i= 1, … , 2N} in the calculation of haplotype similarity in (1) and (2). One alternative approach could be, for each parental haplotype pair (Hi (1), Hi (2)), we can designate the one with a larger transmission frequency as the “transmitted” haplotype, and the other one as the “non-transmitted” haplotype. If the two frequencies are equal, we randomly choose one as the “transmitted” haplotype. This approach is equivalent to using a new version of transmission frequencies {(δ*i (1), δ*i (2)), i= 1, … , 2N} in the calculation of the statistic S(F) given by (1) to (3), where δ*i (1) = 1 and δ*i (2) = 0 if δi (1) > 0.5; δ*i (1) = 0 and δ*i (2) = 1 if δi (1) < 0.5. The main advantage of this alternative approach is that it requires less computation in the calculation of (1) and (2). But some preliminary simulations (result not shown) indicate that this approach is not as powerful as the one used in the SP-TDT method. One possible explanation could be the following. The “hard” rule for designating transmitted haplotypes, used in this alternative approach, sometimes assigns a non-carrier haplotype as one transmitted haplotype, while in the SP-TDT that non-carrier haplotype is only counted as a fraction (equal to its transmission frequency) of a transmitted haplotype. Thus, that non-carrier haplotype may have a smaller adverse effect in the SP-TDT.

Knapp & Becker (2004) pointed out that genotyping errors may lead to an inflated type I error rate for haplotype similarity based TDT methods. The reason for this is that the transmitted haplotypes are partially checked for genotyping errors by Mendelian inconsistency, whereas there is no such checking at all for non-transmitted haplotypes. As a result of the unbalanced checking for genotyping errors, non-transmitted haplotypes appear less similar than transmitted haplotypes, which may lead to an inflated type I error rate. Recently Sha et al. (in press) have proposed a simple strategy. Their suggestion is to merge each rare haplotype to the most similar common haplotype. Through simulation results, Sha et al. (in press) show that this simple strategy can control the type I error inflation for a wide range of genotyping error rates. The rationale of this strategy is that LD between markers is expected when considering tightly linked markers within a candidate gene, and thus the total number of haplotypes across a set of tightly linked markers is not large. Consequently, one genotyping error occurring in a haplotype will most likely generate a new rare haplotype. By merging each of the rare haplotypes to the corresponding most similar common haplotype, most of the typing errors can be recovered. This strategy given by Sha et al. (submitted) can be adopted for the SP-TDT method.

We want to point out that the sequential peeling procedure described in this paper is a very general strategy that can be used to reduce the effect of founder heterogeneity. Through simulation studies, we have shown that there is power improvement by combining the sequential peeling procedure with the TDT-type of analysis based on haplotype similarity comparison. With some modifications, the same strategy can be used in conjunction with other TDT-type analyses, such as the Geary-Moran test suggested by Clayton & Jones (1999). Furthermore, since the concept of haplotype similarity comparison can also be applied to fine scale mapping (e.g. Molitor et al. 2003), we expect the similar strategy used in the SP-TDT method would be extended to disease locus estimations.

We have implemented the SP-TDT method using C language; the software is available from the corresponding author.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Studies
  6. Leptin Gene and Obesity
  7. Discussion
  8. Acknowledgements
  9. References

The work was supported in part by NHLBI grant 5R01HL056567-07, NIH grants U01 GM63340, and U01 HL54473. We thank the editor and referees for thoughtful comments that greatly improved an earlier version of the manuscript.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Methods
  5. Simulation Studies
  6. Leptin Gene and Obesity
  7. Discussion
  8. Acknowledgements
  9. References