Identity-by-Descent-Based Phasing and Imputation in Founder Populations Using Graphical Models

Accurate knowledge of haplotypes, the combination of alleles co-residing on a single copy of a chromosome, enables powerful gene mapping and sequence imputation methods. Since humans are diploid, haplotypes must be derived from genotypes by a phasing process. In this study, we present a new computational model for haplotype phasing based on pairwise sharing of haplotypes inferred to be Identical-By-Descent (IBD). We apply the Bayesian network based model in a new phasing algorithm, called systematic long-range phasing (SLRP), that can capitalize on the close genetic relationships in isolated founder populations, and show with simulated and real genome-wide genotype data that SLRP substantially reduces the rate of phasing errors compared to previous phasing algorithms. Furthermore, the method accurately identifies regions of IBD, enabling linkage-like studies without pedigrees, and can be used to impute most genotypes with very low error rate. Genet. Epidemiol. 2011. © 2011 Wiley Periodicals, Inc.35:853-860, 2011

We aim to estimate the Maximum A Posteriori configuration in the Bayesian network in Figure 1. The network has observed variables g a j ∈ {Hom 0, Het, Hom 1, M issing} for the genotypes and hidden variables h a j ∈ {00, 01, 10, 11} for the diplotypes and p a,b j for IBD relationships. The IBD indicator variables p a,b j can take values (1) IBD on first haplotypes, (2) IBD on first and second haplotype of a and b respectively, (3) IBD on second and first haplotype, (4) IBD on second haplotypes or (5) not IBD. The Bayesian network defines a probability distribution P(h, p|g) ∝ P(g|h, p)P(h, p) = P(g|h)P(p|h)P (h) (1) Equation (2) is the factorization defined by the Bayesian network in Figure 1. The only factor that is non trivial to compute is the last one, for the probability of an IBD state given the IBD state on previous marker and the diplotypes on current marker. This can be computed by observing that Equation (3) follows from the formula for conditional probability and Equation (4) from the chain rule. Equation (5) uses the Markov property of (p j ) in the numerator and marginalisation in the denominator. Finally Equation (6) follows from application of the chain rule and Markov property in a similar manner as in equations (4) and (5).
The prior probability of diplotype P(h i j ) is mostly irrelevant when the genotypes are observed since it is strongly dominated by the likelihood of the observed genotype. In the implementation the diplotype prior is essentially uniform, but with a small random noise added to break symmetries and to help convergence of the message passing algorithm. Since the noise level is low, this does not have a strong effect on the posterior but does provide an easy way to break symmetries in the model.
The "transition" probabilities P(p a,b j |p a,b j−1 ) are given by matrix exp(dQ) where d is the genetic distance between markers j − 1 and j and Q is the rate matrix with rate g of gaining an IBD relationship and rate l of losing IBD. This results in a rate matrix The expected length of a non-IBD IBS segment (the expected waiting time in the 5th state) is (4g) −1 and the expected length of an IBD segment is l −1 .
The probability P(IBD at site j|no IBD at state j − 1) is additionally bound from above with the kinship coefficient given as an input parameter (default 0.04).
Conditional joint probability of diplotypes P(h a j , h b j |p) is the product of the allele frequencies of the freely variable alleles (three frequencies in IBD states, four in the non-IBD state) when the IBD state is consistent with the diplotypes and zero when the IBD state is inconsitent with the diplotypes.
The allele frequencies are estimated from the data using the Beta-Binomial model with prior frequency distribution Beta(1, 1) which is the equivalent of one heterozygous "pseudoindividual". We find approximate MAP settings of the variables in Figure 1 by the Min-Sum algorithm [Kschischang et al., 2001] which is a variant of a general class of message passing algorithms similar to e.g. Cluster Variation Method [Kikuchi, 1951], LDPC [Gallager, 1962] and Turbo Code [Berrou et al., 1993] decoding, Forward-Backward, Viterbi [Durbin et al., 1998] and Loopy Belief Propagation [Pearl, 1988] algorithms. Instead of maximizing the product in the right hand side of Equation (2) the algorithm minimizes the negative logarithm counterpart of it The algorithm proceeds by iteratively updating "messages" sent along the edges between the variables (nodes) of the network in Figure 1. The algorithm is initialized by setting all messages to zero.
To describe the Min-Sum algorithm for Bayesian networks, we follow the presentation in [Kschischang et al., 2001] with notation transferred from the ( , ) semiring to (min, ). In Bayesian networks the messages are always a function of the parent variable. Let π Y (x) denote a message sent from parent X to child Y (down an edge) and λ Y (x) a message sent from child Y to parent X (up an edge). Intuitively the messages code for "beliefs" of the source variable about the values of the target variable. We denote the parents of a variable X by A(X), its children by D(X) and its value by x. For every variable X we have message λ X (a) to its parent variable and message π D (x) to its child variable D ∈ D(X) The messages are further shifted so that the minimum value of each message vector equals zero, i.e. min x π D (x) = min x λ A (x) = 0 for all D and A.
The final "beliefs" are obtained by calculating whose min approximates the −log maximum posterior probability and arg min is the most probable value for x. For proofs and motivation for the formulas, see [Kschischang et al., 2001].
One potential issue with the Min-Sum algorithm is ensuring convergence.
To improve it we do not use the messages calculated in Equations (9)-(10) as is, but instead dampen the update by setting the new message to be a weighted average of the previous and the newly calculated message. The weight, or damping factor, is given by the user (default 0.75) with higher values resulting in slower but more robust convergence [Murphy et al., 1999].
With these parameters, and the default expected non-IBD segment length of 1 cM and expected IBD segment length of 10 cM, we have observed the message updates to converge within our default of 30 iterations.

Message equations for SLRP
Derived from equations (9) and (10) the Min-Sum message update formulas for the Bayesian network in Figure 1 are The messages in Equations (12) [Murphy et al., 1999] The MAP phase for a diplotype H a j is found as which approximately corresponds to the max-marginal of the diplotype, i.e. most probable phase when other variables are set to their maximal value.
A diplotype phase is not called if the two most probable phases are within a defined MAP probability ratio of each other (default 2). The implementation stores only the current diplotype beliefs and the messages coming from the IBD variables to the diplotype variables.
The Min-Sum algorithm is executed only over markers and pairs of individuals that are found IBD during preprocessing.

Computational performance
To enable phasing of large datasets including tens of thousands of individuals, the implementation includes features to distribute the workload over multiple processors on shared and non-shared memory architectures using threads or MPI. Furthermore there are options to decrease the memory requirement by using single, instead of double, precision floating point numbers and by doing the message passing in a wave over the chromosome. Also, there is an option to to prune the set of plausible IBD segments such that each site is covered at least by given number of most likely plausible IBD segments. Briefly, in the pruning mode, we search, for each individual, all plausible IBD segments and retain the most likely such that each site is covered by the given number of segments, if possible. In essence, this method provides a soft lower bound for depth of coverage of plausible IBD segments. In practice, most sites seem to be covered by 2-3 times the limit coverage, when given sufficiently large number of individuals to phase. Neither the wave or the pruning modes were used in the experiments presented in the main paper.
In the wave mode, illustrated in Figure
The results for different thresholds (used in post processing) are given in Table 3. The results reported in the main text are for threshold 10 −6 which was recomended in the fastIBD paper.