The consensus approach to resolving heterogeneity among haplotype inferrals: a comment on


  • doi: 10.1111/j.1365-294X.2009.04146.x

Steven Hecht Orzack, Fax: 425 732 1926; E-mail:

Understanding the molecular basis of differences among individuals with respect to complex phenotypes likely requires the correct identification of the DNA sequence inherited from each parent (Hoehe 2003). Many algorithms for making such inferrals from diploid sequence data have recently been developed (see Gusfield & Orzack 2005 for review). Such methods take as input a set of ‘ambiguous’ diploid sequences; the output is a set of ‘haplotype’ pairs, there being one pair per individual, with each haplotype in the pair representing the sequence inherited from one of the two parents. These ‘haplotype inferral’ methods differ in their assumptions and can also produce different results for any given data set (Orzack et al. 2003, 2008; Adkins 2004; Marchini et al. 2006; Sabbagh & Darlu 2005; Zhang & Zhao 2006); in addition, it is known that different sample paths of the same algorithm can produce different results (Clark 1990 and Orzack et al. 2003). To this extent, how one reconciles different inferrals for any given ambiguous genotype becomes a central issue. Without such a reconciliation (or an a priori reason to choose one inferral over another), it is unclear how further analyses based on inferred haplotypes (such as association testing) should proceed at least in a simple sense. Choosing randomly from among competing sets of inferrals that may differ greatly in quality is an unsatisfactory prospect in this context (see below).

Huang et al. (2008) proposed a ‘consensus vote approach’ to resolving differences among competing inferrals so as to avoid an arbitrary choice from among competing sets of inferrals. They described their method as follows (p. 1932):

‘Therefore, if we compare the inference results of multiple algorithms on a given data set, the haplotypes and phase assignments that are completely agreed among all algorithms (thus with the highest consensus votes) should be most likely to be reliable; similarly, those with lower consensus votes should be less reliable, and hence need further experimental verification. We termed such an analysis “consensus vote” (CV) approach......

If proven effective, such an approach would allow us to confine experimental verification to those individuals whose phase assignment really deserves confirmation while maintaining the quality of statistical inference.’

Huang et al. cited Orzack et al. (2003) in regard to a small technical point (see p. 1943) but did not acknowledge that we had previously proposed this ‘consensus’ method in our paper (p. 915):

‘Accordingly, we explored consensus methods in which multiple inferrals for a given ambiguous genotype are combined to generate a single inferral; we show that the set of these “consensus” inferrals for all ambiguous genotypes is more accurate than the typical single set of inferrals chosen at random. We also use a consensus prediction to divide ambiguous genotypes into those whose algorithmic inferral is certain or almost certain and those whose less certain inferral makes molecular inferral preferable.’

Huang et al.'s implementation of consensus tallied results from a variety of algorithms, whereas our implementation tallied results from different sample paths of the same algorithm. (The term ‘sample path’ refers to a single execution of an algorithm; each is initiated by a different random number, which causes the sample path to be unique with respect to, say, the initial frequency distribution of haplotypes or the input order of genotypes.) Nonetheless, their approach and our approach share the essential central feature proposed by Orzack et al. (2003) that an ensemble of inferrals be assessed and a consensus ‘vote’ taken (see below and also Fullerton et al. 2004).

Both approaches are motivated by the understanding that assessing the heterogeneity among different sets of inferrals is essential for any stochastic inferral algorithm. Otherwise, the reliability of any given set of inferrals remains uncertain. The need for this assessment is typically ignored. For example, almost all published analyses of ambiguous genotype data using Clark's method rely on only one sample path (e.g. Xu et al. 2002) although Clark correctly noted the need to run more than one sample path. There are other algorithms for which different sample paths can generate different inferrals (see Fig. 6 in Orzack et al. 2003).

Huang et al. (p. 1931) and many others have referred to the method of Clark (1990) as a ‘parsimony’ method since Clark suggested (pp. 117–118) that the set of inferred haplotype pairs that has the fewest number of unresolved genotypes is the most accurate. Clark did not provide any proof of this claim and it is not true in general. For example, consider the application of his algorithm to the human ApoE genotypes presented by Orzack et al. (2003). Each of 10 000 sample paths started with a different input order of ambiguous genotypes as determined by a random number generator (further simulation details available upon request). Since the true haplotype pair for each genotype was known, one can assess the inferral accuracy of each sample path.

Each of the 10 000 sample paths resolves all of the ambiguous genotypes. Hence, all sample paths have the same number of unresolved genotypes (none); however, there is substantial heterogeneity among these sample paths with respect to inferral accuracy (see Table 1). Note that the declining relationship between number of haplotypes used and inferral accuracy suggests that sample paths using fewer haplotypes may often be more accurate (see Orzack et al. 2003); this relationship is expected if evolutionary dynamics are at least approximately those of the infinite-sites model.

Table 1.  Analysis of 10 000 sample paths of Clark's (1990) method for haplotype inferrals as applied to the 80 human ApoE genotypes presented by Orzack et al. (2003). There are 47 ambiguous genotypes in the data set. A solution is a set of inferred haplotype pairs for the entire set of genotypes. Accuracy is defined as the number of correctly inferred haplotype pairs. SE denotes standard error
No. of haplotypes usedNo. of sample pathsNo. of distinct solutionsAccuracy average (SE)
14   4  140.00 (0.00)
15  12  539.00 (0.26)
16 126 4435.17 (0.21)
17 50814031.71 (0.12)
18 90024329.58 (0.11)
19119131626.55 (0.15)
20165941522.29 (0.18)
21208744514.54 (0.20)
222332285 8.72 (0.17)
231140118 6.42 (0.21)
24  41 10 4.12 (0.14)

The power of the consensus method is revealed in Table 2. This method can be applied to the ‘full’ set of 10 000 sample paths or it can be applied to a restricted set, say, the 16 sample paths that have a set of inferrals containing the ‘minimum +1’ number of haplotypes (14 or 15 haplotypes, as shown in Table 1); the restricted choice is motivated by the declining relationship shown in Table 1. Each consensus inferral that surpasses a given threshold (e.g. 1000 or 12) is scored as correct (C) or incorrect (IC). Consider a full consensus threshold of, say, 5000, which indicates that 50% or more of the 10 000 sample paths produced the same predicted haplotype pair for a given ambiguous genotype. There are 20 individuals that surpass this threshold; 16 of these individuals have correctly inferred haplotype pairs. This 80% accuracy far surpasses the accuracy of a randomly chosen sample path. As shown in Table 1, more than half of the sample paths (5600) correctly infer at most 15 of the 47 ambiguous genotypes; this is approximately 33% accuracy. In this example, more stringent consensus thresholds (> 5000) generate perfect accuracy but at price of eliminating all but one of the ambiguous genotypes. However, for the same data set, these more stringent thresholds for full consensus did perform well in terms of balancing accuracy and inclusion of genotypes when applied to the results of another inferral algorithm (see Table 5 of Orzack et al. 2003).

Table 2.  Consensus calculations for Clark's algorithm. C denotes correct inferral. IC denotes incorrect inferral
 > 1000> 2000> 3000> 4000> 5000> 6000> 7000> 8000> 9000> 9500
Full consensus
IC201814 7 400000
 > 2> 3> 4> 5> 6> 7> 8> 9> 10> 11> 12> 13> 14> 15= 16
Minimum +1 consensus
IC 7 7 7 7 7 7 7 7 7 7 6 6 6 3 3

The performance of the minimum +1 consensus calculation is even better than for full consensus. Here, a 50% consensus threshold generates an accuracy of greater than 80% (40 correct, 7 incorrect). The accuracy improves to more than 90% (37 correct, 3 incorrect) when the most stringent threshold (all sample paths must agree) is used.

Widely applicable rules for choosing a consensus threshold are yet to be determined. However, it seems likely that basing consensus on a restricted set of sample paths (e.g. those with minimum +1 haplotypes) will generally produce sets of inferrals that have markedly superior accuracy as compared to that for a randomly chosen sample path (S.H. Orzack, unpublished results). At minimum, when confronted with competing inferrals for a given ambiguous genotype, the consensus approach provides a way to generate an inferral in a principled and transparent manner.

Finally, the heterogeneity of number of haplotypes used in different sample paths (see Table 1) also illustrates that Clark's method is not a ‘parsimony’ method in the sense that it provides a set of inferrals that minimizes the number of distinct haplotypes needed to generate an inferred haplotype pair for all individuals in a given sample (Gusfield & Orzack 2005; p. 18–7). His method has been commonly but incorrectly described as being a parsimony method in this sense (e.g. Niu et al. 2002; p. 158); Huang et al.'s discussion of parsimony compounds this confusion because they (p. 1931) conflate this claim about haplotype number with Clark's claim about parsimony with respect to number of unresolved genotypes (see above).


S. O. is partially supported by NSF SEI 0513910, NIH R01 DA015789-01A2, NIA P01-AG0225000-01, and NICHD R03 HD055685-01A2. A web interface for haplotype inferral and consensus calculation as described in Orzack et al. (2003) is available at

Steven Orzack's research interests include haplotype inferral, population genetics, phylogenetic methods, life history evolution, demography, philosophy of biology, and the history of biology.