Suppose that in a study of a qualitative trait, a candidate region/gene is evaluated with *K*+ 1 tightly linked markers, and that *N* nuclear families are sampled with a variable number of affected offspring in each family. A haplotype can be represented as a vector *H*= (*h*_{1} , … , *h*_{K+1}), here *h*_{k}∈{1, … , *t*_{k}} labels the *t*_{k} alleles at marker *k*, 1 ≤*k*≤*K*+ 1. The null hypothesis we wish to test is that there is no linkage or no association between the considered candidate region and the disease susceptibility locus. More specifically, assuming a disease-susceptibility locus with two alleles *D* and *d*, “no linkage” means that there is no linkage between the disease-susceptibility locus and any one of the *K*+ 1 markers considered. For “no association”, when there is no population stratification, it is expected that for any haplotype h, Pr(*D*, **h**) = Pr(*D*) Pr(**h**), where Pr(*D*, **h**), is the probability of a chromosome having both *D* and h, Pr(*D*) and Pr(**h**) are the disease allele and haplotype frequencies in the considered population. When population stratification exists, “no association” means: there is no association in any of subpopulations from which families are collected (Monks & Kaplan, 2000).

#### Known Phase Information

Here it is assumed that for each sampled family we know all four parental haplotypes, and how they are transmitted to their affected offspring. Suppose there are *N* sampled nuclear families. For the *n* th family, 1 ≤*n*≤*N*, we denote the two haplotypes of the father as (*H*_{2n−1}(1), *H*_{2n−1}(2)), and two from the mother as (*H*_{2n}(1), *H*_{2n}(2)). In the following discussion, all 2*N* pairs of parental haplotypes are denoted by **F**={(*H*_{i} (1), *H*_{i} (2)), *i*= 1, … , 2*N*}. For the *i* th parental haplotype pair (*H*_{i} (1), *H*_{i} (2)), the transmission status to the affected children can be summarized as transmission frequencies (δ_{i} (1), δ_{i} (2)), where δ_{i} (1) (or δ_{i} (2)) is the proportion of affected children receiving haplotype *H*_{i} (1) (or *H*_{i} (2)). For example, if (*H*_{i} (1), *H*_{i} (2)) is from a parent of a nuclear family with three affected offspring, *H*_{i} (1) is transmitted to one affected child, and *H*_{i} (2) is inherited by the other two, then δ_{i} (1) = 1/3, and δ_{i} (2) = 2/3. For a parent of a nuclear family with one affected offspring, its δ_{i} (1) or δ_{i} (2) is 0. Also we always have δ_{i} (1) +δ_{i} (2) = 1.

Under the null hypothesis of no linkage or no association, Zhang *et al.* (2003) proved that *H*_{i} (1) and *H*_{i} (2) have the same probability of being transmitted to an affected child. Thus, under the null hypothesis, haplotype similarity in transmitted haplotypes would be around the same level as that in non-transmitted haplotypes. Under the alternative where the disease-susceptibility locus falls into the considered region, and there are disproportionately large clusters of transmitted haplotypes which inherit the mutation from relatively few common founders, those transmitted haplotypes are apt to exhibit more allele sharing than non-transmitted haplotypes do at markers near the mutation locus. This is the rationale for TDT analyses based on haplotype similarity comparison. This type of analysis depends on a pair-wise similarity metric λ_{k} (**a**, **b**), which measures the similarity between any given pair of haplotypes **a** and **b** around the sub-interval between markers *k* and *k*+ 1, with 1 ≤*k*≤*K*.

There are a number of ways of defining the pair-wise similarity metric λ_{k} (**a**, **b**).Tzeng *et al.* (2003) gave a summary of them. For example, Bourgain *et al.* (2000) use the “identity length measure”, which is the length of the contiguous region around a focal point over which the two haplotypes are IBS (identical by state). In this paper, we use the “weighted similarity measure” of Yu *et al.* (2004a) as the similarity metric. Comparing this to other commonly used similarity metrics, Yu *et al*. (2004b) demonstrated through simulations the power advantage of using such a similarity metric in the context of population-based case control studies. The weighed similarity measure can be defined as

where λ^{L}_{k} (**a**, **b**) and λ^{R}_{k} (**a**, **b**) measure similarity to the left and right of the considered focal point through the maximum consecutive matching length. Specifically, we set λ^{R}_{k} (**a**, **b**) = 0 if *a*_{k+1}≠*b*_{k+1}, and

Here *I*^{R}_{k}={*k*+ 1, *k*+ 2, …} is the largest set of contiguous marker labels for which haplotypes **a** and **b** share the same allele. The weights *w*^{R}_{i} are chosen to favor IBS in rare alleles more than in common alleles, and are defined as

Here *A*_{i} is the allele at marker *i* on a randomly selected haplotype from the population. We estimate those probabilities based on non-transmitted haplotypes. λ^{L}_{k} (**a**, **b**) is defined similarly. If we let all *w*^{L}_{i}, *i*∈*I*^{L}_{k} and *w*^{R}_{i}, *i*∈*I*^{R}_{k} equal 1, then λ_{k} (**a**, **b**) is the number of consecutive IBS markers around the considered sub-interval, which is very similar to the identity length measure (Bourgain *et al.* 2000) if we assume markers are evenly spaced.

When analyzing families with multiple affected children, it is possible that one parental haplotype can be transmitted to some affected children but not to others. To take this transmission status ambiguity into consideration, for the parental haplotype pair (*H*_{i} (1), *H*_{i} (2)), we treat its “transmitted haplotype” as a mixture of *H*_{i} (1) and *H*_{i} (2), with δ_{i} (1) and δ_{i} (2) as their mixture proportions. Similarly, its “non-transmitted haplotype” can be regarded as a mixture of *H*_{i} (1) and *H*_{i} (2), with 1 −δ_{i} (1) and 1 −δ_{i} (2) as their mixture proportions. Then haplotype similarity between transmitted haplotypes from *i*th and *j*th parents can be estimated as

Based on that, we can measure the average similarity for transmitted haplotypes from the set **F** of all parents as

- (1)

Similarly, the average similarity for non-transmitted haplotypes from all parents can be calculated as

- (2)

The null hypothesis can be tested using the following statistic,

- (3)

When every sampled nuclear family has only one affected child, and the “identity length measure” of Bourgain *et al.* (2000) is used, the test statistic (3) becomes the MILC method of Bourgain *et al.* (2000), which is also equivalent to the HS-TDT of Zhang *et al.* (2003).

To evaluate the *p* value of the test given by (3), the permutation procedure proposed by Monks & Kaplan (2000) can be used. We simultaneously permute the transmitted and non-transmitted status of the parental haplotypes for all affected children in the family. By doing this, we can generate null datasets in the presence of linkage. This is equivalent to permuting randomly two haplotypes in each pair (*H*_{i} (1), *H*_{i} (2)), and leaving transmission frequencies (δ_{i} (1), δ_{i} (2)) unchanged, that is, if after the permutation, the haplotype pair becomes (*H*_{i} (2), *H*_{i} (1)), the transmission frequencies for *H*_{i} (2) and *H*_{i} (1) are set to be δ_{i} (1) and δ_{i} (2) respectively. In the following discussion, we call the test statistic (3) that uses the weighted similarity metric the weighted haplotype similarity based TDT method (WT-TDT).

Since founder heterogeneity almost always exists in studies of complex diseases, we expect some transmitted haplotypes to be non-carriers, or carriers from relatively small clusters. Those transmitted haplotypes would adversely affect the power of haplotype similarity based TDT methods by diminishing *S*(**F**). To increase the power of the WT-TDT method under founder heterogeneity, we adopt the idea of the sequential peeling method described in Yu *et al.* (2004b), and call the improved method SP-TDT. The goal of the SP-TDT method is to test the null hypothesis based on a subset of **F** that has a decreased degree of founder heterogeneity, and provides the strongest evidence against the null hypothesis. There are three steps involved in SP-TDT, which are very similar to those given in Yu *et al.* (2004b). Here we just provide a brief summary. First, we want to remove some parental haplotype pairs whose transmitted haplotypes most likely are non-carriers, or carriers from small clusters. This goal is achieved by sequentially peeling away some parental haplotype pairs in order to maximize the test statistic *S*(.) given by (3) for the remaining pairs. The peeling process generates a sequence of nested subsets **F**_{0}⊃**F**_{1} , … , ⊃**F**_{M}, each with *S*(**F**_{j}) as large as possible, where **F**_{0}=**F**, the original data. Here is the algorithm for the first step.

#### Ambiguous Phase Information

In the previous subsection we described the proposed SP-TDT method for data with known phase information. However, for real data application, parental haplotypes and their transmission statuses may not be uniquely determined. Next, we extend the approach described above to families with ambiguous phase information.

There are two commonly used approaches dealing with phase ambiguity. The first is to assign the most likely (based on estimated haplotype frequencies) haplotypes that are compatible with the observed genotypes, and then analyze the data as if the phase information were known. The second approach is to give weights based on estimated haplotype frequencies to each possible haplotype assignment, and then to compose a statistic based on those weights. Zhang *et al.* (2003) compared performances of two such approaches for the HS-TDT method, and found that they have similar powers. However, the second approach leads to extra computational burdens.

We adopt the first approach mainly for its simplicity. To estimate haplotype frequencies, the E-M algorithm under the restriction of family information (Abecasis *et al.* 2001; Rohde & Fuerst, 2001; Becker & Knapp, 2004a) can be used. For the *n*th nuclear family, 1 ≤*n*≤*N*, among all parental haplotype assignments that are compatible with observed family genotypes, we assign the four parental haplotypes (*H*_{2n−1}(1), *H*_{2n−1}(2)) and (*H*_{2n}(1), *H*_{2n}(2)) using the most likely assignment as described by Zhang *et al.* (2003) and Zhao *et al.* (2000). Then the transmission frequency δ_{2n−1}(1), for the parental haplotype *H*_{2n−1}(1) is calculated as

where *c*_{n} is the number of affected children in the *n*th family, ξ^{i}_{2n−1} (1) = 1 if only *H*_{2n−1}(1) in (*H*_{2n−1}(1), *H*_{2n−1}(2)) can be transmitted to the *i*th affected child (assuming the transmitted haplotype from the other parent is either *H*_{2n}(1) or *H*_{2n}(2)), and ξ^{i}_{2n−1} (1) = 1/2 if *H*_{2n−1}(1) and *H*_{2n−1}(2) are all compatible with the *i*th affected child. Other transmission frequencies are calculated similarly. After the parental haplotypes are assigned and their corresponding transmission frequencies are calculated, the WT-TDT and the SP-TDT methods described in the previous subsection can used.