Haplotype reconstruction for genetically complex regions with ambiguous genotype calls: Illustration by the KIR gene region

Advances in DNA sequencing technologies have enabled genotyping of complex genetic regions exhibiting copy number variation and high allelic diversity, yet it is impossible to derive exact genotypes in all cases, often resulting in ambiguous genotype calls, that is, partially missing data. An example of such a gene region is the killer‐cell immunoglobulin‐like receptor (KIR) genes. These genes are of special interest in the context of allogeneic hematopoietic stem cell transplantation. For such complex gene regions, current haplotype reconstruction methods are not feasible as they cannot cope with the complexity of the data. We present an expectation–maximization (EM)‐algorithm to estimate haplotype frequencies (HTFs) which deals with the missing data components, and takes into account linkage disequilibrium (LD) between genes. To cope with the exponential increase in the number of haplotypes as genes are added, we add three components to a standard EM‐algorithm implementation. First, reconstruction is performed iteratively, adding one gene at a time. Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group. Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates. A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies. We show that estimated HTFs are approximately unbiased. Our simulation study shows that the EM‐algorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias. Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse. In a real data analysis of KIR genotypes, we compare HTFs to those obtained in an independent study. Our new EM‐algorithm‐based method is the first to account for the full genetic architecture of complex gene regions, such as the KIR gene region. This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction. Combining information from multiple genes improves haplotype reconstruction.

haplotypes as genes are added, we add three components to a standard EMalgorithm implementation.First, reconstruction is performed iteratively, adding one gene at a time.Second, after each step, haplotypes with frequencies below a threshold are collapsed in a rare haplotype group.Third, the HTF of the rare haplotype group is profiled in subsequent iterations to improve estimates.A simulation study evaluates the effect of combining information of multiple genes on the estimates of these frequencies.We show that estimated HTFs are approximately unbiased.Our simulation study shows that the EMalgorithm is able to combine information from multiple genes when LD is high, whereas increased ambiguity levels increase bias.Linear regression models based on this EM, show that a large number of haplotypes can be problematic for unbiased effect size estimation and that models need to be sparse.In a real data analysis of KIR genotypes, we compare HTFs to those obtained in an independent study.Our new EM-algorithm-based method is the first to account for the full genetic architecture of complex gene regions, such as the KIR gene region.This algorithm can handle the numerous observed ambiguities, and allows for the collapsing of haplotypes to perform implicit dimension reduction.Combining information from multiple genes improves haplotype reconstruction.

| INTRODUCTION
The advance of DNA sequencing technologies has resulted in the generation of an enormous amount of genotype data which can be used to detect associations between genes and diseases or normal traits.Most often, this data only contains unphased genotypes, that is, not indicating which alleles lie on the same chromosome (Browning & Browning, 2011).Arguably, the analysis of genotype data is best exploited through phased haplotypes, because an allelic variation in a gene can alter upstream gene expression, potentially affecting gene functioning (Browning & Browning, 2011;Tewhey et al., 2011).Haplotypes can be obtained at considerable cost experimentally, by long-range sequencing or via genotyping of family members (Becker & Knapp, 2004;Böhringer & Pfeiffer, 2009;Dudbridge, 2003Dudbridge, , 2008;;Osoegawa et al., 2019).Alternatively, haplotypes can be reconstructed statistically (Excoffier & Slatkin, 1995).
Many of the haplotype reconstruction methods are based on an expectation-maximization (EM)-algorithm, which is able to handle diploid, unphased genotypes (Excoffier & Slatkin, 1995).The unobserved phase of haplotypes is treated as missing data and maximum likelihood estimates (MLEs) can be derived when Hardy-Weinberg equilibrium (HWE) is assumed (Browning & Browning, 2011;Dudbridge, 2003Dudbridge, , 2008;;Excoffier & Slatkin, 1995;Schäfer et al., 2017).Important for the EM-algorithm is to limit the number of possible haplotypes, to keep the algorithm computationally feasible.Iterative strategies have been devised, for example, so-called partition-ligation where a divide-and-conquer strategy is used (Qin et al., 2002).Another common approach is collapsing of rare haplotypes (Sinnwell & Schaid, 2018).It is also possible to exploit information from first-degree relatives to limit the number of possible haplotypes (Becker & Knapp, 2004;Solloch et al., 2020).
In many cases, haplotype reconstruction is an intermediate step in a disease association analysis, yielding posterior probabilities as input for outcome analyses.In such analyses, either the effects of alleles separately or of haplotypes can be estimated in regression models.Additionally, haplotype reconstruction can be helpful in resolving ambiguous genotype calls.Also Bayesian approaches are possible, which allow one to incorporate additional assumptions, such as sequence similarity, or additional, external information (Browning & Browning, 2011;Stephens et al., 2001).Incorporating such information can improve haplotype reconstruction under certain circumstances (Browning & Browning, 2009).
There is a general trade-off within and across all methods with regard to the computation costs and phasing accuracy.As a consequence, haplotype reconstructions might differ for a given data set, posing an additional analysis challenge (Al Bkhetan et al., 2019).Additionally, the stochastic nature of some algorithms requires care in choosing various parameters to ensure reproducible analyses (Becker & Knapp, 2004).Nevertheless, low-coverage next-generation sequencing (NGS) data combined with reference data might be informative enough to challenge conventional singlenucleotide polymorphism arrays in genotyping efficiency (Rubinacci et al., 2021).
Recently, more genetically complex gene regions, such as the killer-cell immunoglobulin-like receptor (KIR) gene region have been investigated.KIR receptors play an important role in regulating natural killer cell functioning.In the context of allogeneic hematopoietic stem cell transplantation (alloHCT), the KIR genotype of the stem cell donor might hence determine immune reactions against residual leukemic cells and thus influence the risk of relapse after transplantation.So far, research groups failed to validate predictive models for patient outcomes after alloHCT based on the absence or presence of single KIRs and their cognate ligands (Boudreau et al., 2017;Schetelig et al., 2020Schetelig et al., , 2021;;Venstrom et al., 2012).Full haplotype information, which describes the KIR gene repertoire more comprehensively, might be more informative to predict outcomes (Cooley et al., 2010;Guethlein et al., 2021).
There are 17 KIR genes described in the human genome.A haplotype usually contains between 7 and 12 genes and due to copy number variation (CNV), multiple or no gene copies per haplotype are possible, with up to four copies per gene having been reported (Jiang et al., 2012;Wagner et al., 2018).Additionally, extensive allelic polymorphism occurs, with diversity ranging between 16 and 158 alleles for a single gene (Béziat et al., 2016;Wagner et al., 2018).Due to sequence similarities, it is difficult for current typing approaches to genotype a KIR gene correctly on its allelic level, resulting in ambiguities.However, sequence differences between alleles can greatly influence receptor functioning (Béziat et al., 2016).The high diversity of the KIR gene region has been attributed to its role in the immune system in analogy with other gene regions, like, human leukocyte antigen (HLA) (Wagner et al., 2018).
Due to this complexity, a given genotype call results in a high number of potential diplotypes for each individual, which cannot be handled by conventional EM-algorithms.Additionally, due to limitations in the genotyping process, several forms of ambiguity remain in marginal genotype calls.Therefore, apart from haplotype phase, ambiguities have to be dealt with.To our knowledge current methods are infeasible for such type of data (Natsuhiko, 2012;Sakaue et al., 2022;Sinnwell & Schaid, 2018;Vukcevic et al., 2015).We therefore propose a method to reconstruct haplotypes for complex gene regions, such as the KIR loci (Juhos et al., 2016) that can accommodate various forms of ambiguity.Inferring the haplotype structure, on the one hand, allows one to help resolving observed ambiguities, on the other hand allows one to infer true, biological effects.Such biological implications are usually tackled in a second step after reconstruction, when haplotypes are associated with an outcome by means of regression analyses.It is therefore important to develop practical strategies to model haplotype associations when a high number of haplotypes can be used as predictors.
Our paper is structured as follows.In Section 2, we introduce the data and present an EM-based algorithm for genetically complex regions which can handle CNVs, ambiguity in genotype calls, high allelic polymorphism, and computationally scales to many genes.In Section 2, we also present simulation studies, which are used in Section 3 to evaluate the properties of the method under different scenarios.We analyze an extensive real KIR data set.We close with a discussion and give an outlook on future directions.

| Structure of genotype data
In their most general form, genotype calls contain the number of gene copies and the allelic variant of each gene copy.Different allelic variants of the same gene are indicated by an asterisk, for example, KIR2DL1*001.Further explanation of the KIR allele nomenclature can be found in Appendix A. Due to limitations in the genotyping process, different forms of ambiguity can occur (Wagner et al., 2018).First, allelic ambiguity denotes the case of several possible alleles for a single gene copy.Second, genotypic ambiguity corresponds to multiple possibilities for the complete genotype.Third, CNV implies that there are possibly none or multiple copies of a gene on a single haplotype.As there is no location information for multiple copies, we consider multiple copies arranged in tandem when present on the same chromosome.Notationally, these copies are then separated by a hat (" ˆ").Illustrations for these three forms are shown in Supporting Information Table S1.Fourth, allelic variants reported as "NEW" have not been described before.Fifth, a genotype call reported as "POS" implies that the number of gene copies and the possible allelic variants are unknown but that the gene is present.Sixth, a negative genotype call ("NEG") indicates that there are no copies of that gene in the donor's genome.Technically, "NEG" does not represent an ambiguity, but it is a nonstandard genotype call.Additionally, the well-known phasing ambiguities occur when combining two heterozygous loci.

| KIR data
The KIR genotyping data were collected from 4077 hematopoietic stem cell donors who were registered and provided written informed consent at the Collaborative Biobank (CoBi).Approval for reuse of the data in the current study was obtained from CoBi.Each of the donors had supplied a graft for an unrelated HLA compatible patient with Myelodysplastic Syndromes or Acute Myeloid Leukemia patient.More detailed information about these cohorts is described elsewhere (Schetelig et al., 2020(Schetelig et al., , 2021)).For each of the 17 KIR genes, a separate genotype call was obtained via highresolution short-amplicon-based NGS of exons 3, 4, 5, 7, 8, and 9. Data preprocessing was performed by combining genotyping results of the KIR2DS4 and KIR2DS4N into a single locus due to high sequence similarities, and by collapsing all allelic ambiguities that originate from nonsequenced exons into an artificial allele (indicated in results by a trailing "c") (Solloch et al., 2020).Subsequent data processing steps were performed to call genotypes as described elsewhere (Wagner et al., 2018).
In these data, all genotyped individuals have at least one genotype call that has one of the above-mentioned ambiguities, with on average ambiguities for 9.68 KIR genes (out of the 16 genes) on the allelic level.An overview of the frequencies of ambiguities is shown in van der BURG ET AL.
Table 1 and Supporting Information Table S2.For the analysis, a subset of seven genes was chosen based on their potential effects reported earlier (Boudreau et al., 2017;Cooley et al., 2010;Guethlein et al., 2021;Schetelig et al., 2020Schetelig et al., , 2021;;Venstrom et al., 2012).Donors that had a "POS" call for any of these seven genes were excluded to reduce complexity, leaving a total of 3547 donors in the data set.
A simple illustrative example of these genotype calls and how the information of genes is combined is shown in Supporting Information Table S3.For each gene, a set of diplotypes is compatible with the genotype call, which are combined to form a multilocus diplotype.In our diplotype notation, the two haplotypes of a diplotype are separated by a plus ("+") and the alleles of the different loci on a haplotype by a minus ("−").If there is no gene copy for a gene, it is indicated with "NEG."In the example of Supporting Information Table S3, ambiguities are omitted in the genotype calls which would otherwise make the diplotype lists of each gene more extensive.

| Statistical methods
For this study, the information of the independently obtained genotype calls are combined to form multilocus haplotypes and diplotypes.The DNA sequence variations observed at a single locus are called alleles, although a tuple of alleles spanning multiple loci on the same chromosome is called a haplotype.When considering a variable number of loci, we also include single alleles as a special case of a haplotype to simplify the notation.We number the distinct haplotypes j M = 1, …, .Each diplotype is a pair of haplotypes, that is, H H H = ( , ) 2 .We number the distinct diplotypes k D = 1, …, .We stress, that we treat CNV occurring within a locus as alleles, that is, the complete variation within a locus including tandem copies is enumerated each being treated as a separate allele.
Haplotype frequencies (HTFs) can be estimated by optimizing the likelihood (see below).Under the hypothetical complete data scenario, no ambiguities are present and the true diplotype of each individual is known.The maximum likelihood estimator is then given as the sample frequencies of the haplotypes.However, due to the genotype and phase ambiguities in the data, for each individual, a multitude of haplotypes and diplotypes is compatible with their observed genotype.In this observed-data scenario, all these configurations have to be considered and, due to the nonexistence of a closed-form maximum likelihood estimator, require numerical optimization.independently, identically distributed and HWE holds, the likelihood is given by

| Complete data likelihood
(1) is the parameter vector of HTFs and c h h for a heterozygote diplotype and 1 otherwise).After rearranging the terms by haplotypes, the loglikelihood becomes where I is an indicator function and x ij is the number of times haplotype j occurs in the diplotype of individual i, and 1 2 is a constant that does not depend on π.
The maximum likelihood estimator of π j is given by the population frequency: (3)

| Observed-data likelihood
Observed data are given by the (multilocus) (ambiguous) genotype calls of individuals, denoted with , where L is the number of loci.In general, the genotypes are ambiguous, that is, they represent a set of possible haplotype pairs.We denote with  h h g { } ~i 1 2 the set of possible diplotypes which are compatible with genotype g i .Constructing this set requires taking into account both genotype ambiguities and phase uncertainty.The observed-data likelihood is then given by

| The EM-algorithm
Algorithmically, we derive MLEs using an EM-algorithm and start with the form as introduced by Excoffier and Slatkin (1995).The algorithm iterates by computing expectations in the E-step and updating π l ( ) in the M-step (Dempster et al., 1977).
E-step: Denote the current estimate of π in iteration l by π l ( ) .Formally, we compute the expected complete data log-likelihood as , which gives and with Bayes' rule, we get Here, indicates whether genotype g i is compatible with diplotype H i .
M-step: The M-step is the same as the ML-estimation, plugging in . For the algorithm to start, initial HTFs are based on potential haplotype occurrences in the compatible diplotypes assuming a uniform distribution.Iterations are continued until the largest change in estimated HTFs of two subsequent iterations is smaller than 1 × 10 −5 or until a maximum number of 200 iterations is reached.On the basis of simulations (see below), this limit seems sufficient to guarantee convergence.The EM-algorithm guarantees to improve the observeddata log-likelihood with each iteration of the algorithm.It may, however, only converge to a local maximum.After convergence, individual-specific frequency vectors are obtained from Equation ( 6), quantifying how likely each diplotype is for each individual.
A main limitation of the EM-algorithm is the limited number of haplotypes that can computationally be handled.Especially, for complex gene regions, the number of haplotypes can quickly become too large, making the current EM-algorithm computationally not feasible.We address this problem by first, collapsing rare haplotypes into a special category and second, by building up HTF estimation iteratively, by adding one locus a time into the reconstruction.Collapsing of the rare haplotypes occurs after each addition of a new locus.Due to ambiguities, and unlike the standard EMalgorithm, HTFs estimated for a smaller set of loci previously estimated cannot be considered fixed when van der BURG ET AL.
| 7 adding a locus.This can lead to problems with the frequency estimation of the collapsed group as explained below.Therefore, a third modification of the EMalgorithm is required in that the frequency of the collapsed group is fixed during estimation, leading to a profile EM-algorithm.

| Dimension reduction
To limit the number of haplotypes in the EM-algorithm, haplotypes with a frequency below a predefined threshold ϵ will be collapsed.Collapsing is performed after the EM-algorithm converged with the loci currently under reconstruction.To this end, all haplotypes to be collapsed are regrouped into a new, collapsed haplotype.Next, when a new locus is added into the reconstruction and HTFs were estimated, the next round of collapsing is executed.The HTFs estimated in previous steps thus cannot be assumed fixed, but are changing in general due to additional information from the added locus that might help resolve ambiguities observed in the previous reconstruction.A full optimization of all HTFs is therefore required for every step.
A newly occurring problem is that, when collapsing haplotypes into a group, full information on ambiguities cannot be retained in all cases.This happens, for example, when two individuals have ambiguous genotypes and both carry a different low-frequency haplotype in some reconstructions.If these rare haplotypes are collapsed, the ambiguities are no longer uniquely defined, because now both haplotypes are considered to be the same.As all HTFs are reoptimized in every step, this can lead to bias in estimates and lead to changes in marginal HTFs as compared with previous steps (spillback).By absorbing more ambiguities, the rare haplotype group tends to increase in frequency in subsequent steps and this problem has to be dealt with by profiling.

| Profiling collapsed haplotypes
As the collapsed group can contain hundreds of haplotypes, the problem of spill-back mentioned above can have substantial influence on HTF estimates.To avoid this, the frequency of the collapsed group is fixed in each iteration at the sum of the HTFs in this collapsed group as estimated in the previous iteration of the EMalgorithm.We call this the profile EM-algorithm.
As ambiguities cannot be carried over for the collapsed haplotypes in all cases, also the linkage disequilibrium (LD; see Section 2.2.8) is affected by collapsing.To be able to reconstruct marginal HTFs from the final reconstruction, we additionally require marginal HTFs to be unaffected by the collapsing.To this end, after adding a new gene (gene K ) into the reconstruction, the newly estimated HTFs are reparametrized into the marginal HTFs of the previously considered loci (K − 1 genes), the allele frequencies (AFs) of the newly added locus (K th gene) and the LD values between the previous loci and the new locus (Appendix B).
Subsequently, LD values for haplotypes falling into the collapsed group with other haplotypes are set to 0. Because the collapsed group consists of a large number of haplotypes it is assumed to average out.Because these haplotypes are rare this adjustment only has a small impact.

| Reconstruction algorithm
The algorithm described so far provides a reconstruction for a single, fixed order of loci.As the collapsing depends on the order in which loci are added, in principle, all orderings should be considered, leading to exponential complexity.We use the following heuristic approach to achieve quadratic complexity.For a given reconstruction, we add each of the remaining loci in turn.Each time, HTFs are marginalized back to the starting reconstruction.We then compare how much the HTFs of the starting reconstruction have changed with the new locus, by evaluating ≔  ( ) where π j (1) and π j (new) are HTFs of the starting gene with the starting reconstruction and with a newly added locus, respectively.The reconstruction is continued with the locus with maximal T new .The added locus is therefore the most informative for the reconstruction and should help resolve ambiguities that have not yet been resolved.
To start the algorithm, all loci are considered as starting locus.In Figure 1, we show all steps of the haplotype reconstruction for a fixed starting locus.

| Kullback-Leibler divergence (KLD)
To evaluate estimation performance, we use the KLD.KLD is a nonsymmetric measure indicating how well two probability distributions, π and π R , are alike.For HTFs, the KLD is given by LD is the covariance between pairs of alleles of different loci: δ i a j b ( , ),( , ) , where i j , denote the two loci, a b , are alleles at these loci.Fixing loci and alleles, we use δ for brevity.To compare LD values of different combinations, the LD is standardized, max , where δ is the unstandardized LD and δ max the theoretical maximum of δ for fixed AFs.When δ′ is zero, there is no association, although a positive (negative) δ′ means the alleles occur more (less) frequently in phase than expected under independence.

| Information R 2
One major goal is to evaluate whether adding loci to the reconstruction can improve the resolution of ambiguities observed for a given locus, that is, can ambiguities be resolved using information from additional loci?To this end, we fix a given locus and marginalize the estimated frequency vector per individual to get AFs per individual for the given gene.This marginalization is performed after each addition of a locus and allows one to evaluate accuracy gain (or loss) as loci are added to the reconstruction.We quantify this change by calculating an information R 2 in analogy to an imputation accuracy, that is, , where I Y is the observed Fisher information and I X is the expected complete Fisher information of all probabilities estimated for each individual in the EMalgorithm (Appendix C).Informally, the information R 2 can be interpreted as the proportion of information retained in the observed (i.e., incomplete) data compared with data with known haplotypes, that is, the effective sample size.

| Majority vote (MV) analysis
To determine the benefit of applying the profile EMalgorithm in association analyses, we could not compare it to other methods mentioned in the introduction as they cannot cope with ambiguities and CNVs simultaneously.Instead, we compare with a simpler method.In this socalled MV method each individual gets assigned a single diplotype thus creating a complete data set which is simple to analyze.The choice of diplotype is based on the haplotype reconstruction of the profile EM-algorithm (with all rare haplotypes collapsed).To this end, we define the compatibility count of diplotypes as the count of how often a diplotype was compatible with observed genotypes across all individuals.The diplotype with the highest compatibility count in the reconstruction is then considered to be the individual's actual diplotype.When an individual has two or more diplotypes with the same highest frequency, a uniform, random draw is performed.

| Simulations
Our simulation studies are designed based on the aims, data-generating mechanisms, estimands, methods, and performance measures (ADEMP)-structure discussed in Morris et al. (2018), which provides a structured approach for the planning and reporting of a simulation study.

| Aims
The aim of the simulations is to evaluate the profile EMalgorithm and to determine the effect of varying LD values and/or the degree of genotype ambiguities on the precision of haplotype reconstruction.As a secondary goal, we want to assess the added value of haplotype reconstruction through the profile EM-algorithm introduced in this paper in estimating regression coefficients in an outcome model for allele effects, where the simulated outcome represents a continuous outcome of a patient after alloHCT, such as immune reconstitution of patient cells.

| Data-generation mechanism
Two types of simulations were performed, each with a different underlying data-generating mechanism.For both studies, diplotypes consisting of three loci were simulated, based on AFs for each locus and certain LD values between the loci.
For diplotype simulation, loci were added iteratively, where after each addition, the LD for the matching haplotypes was altered by changing it with fractions θ to . The LD patterns applied to the data differ between the two simulation studies.Simulated diplotypes were converted to genotypes.After data generation, ambiguities were added to these genotype calls based on varying ambiguity scenarios.An outcome was simulated by generating a normally distributed outcome based on the truly simulated diplotypes and allele effects.
In simulation study I, the added ambiguities and LD fractions were chosen to explore the full possible spectrum, varying between being absent and very high.Due to this variation, the effect of more extreme conditions on diplotype reconstruction can be analyzed.For the second simulation study, AFs, ambiguities, and LD values were directly obtained from the analysis of KIR2DS1, KIR2DL1, and KIR2DL3 in the real KIR data set.These genes were selected because of their different numbers of alleles, ambiguities, and AF patterns.To determine the effect of the LD between genes, LD fractions were varied around the point estimate of the real data.
The two simulation studies and the simulation of the outcome are further described in Appendix D. In short, each scenario consists of three genes, each gene with 4-7 alleles.A design matrix with intercept and all potential alleles, pairwise and three-locus haplotypes is generated, which is based on true, simulated diplotypes.Alleles are indicated by a single digit or number (e.g., "A," "001," or "NEG1"), where haplotypes are combinations of such alleles.Because all haplotypes have a fixed structure, their first allele is always from the first gene, their second allele from the second gene, and so forth.Together with regression coefficients the design matrix is used to simulate a linear outcome.For the analysis of the outcome, the EM-algorithm or MV are used to derive a design matrix which will be used for the regression.Because of the added ambiguities in the data, the design matrices of the outcome simulation and analysis are not necessarily the same.

| Estimands
The main estimand of the simulation studies are HTFs.By comparing estimated with true HTFs, the potential of the profile EM-algorithm to resolve observed ambiguities can be judged.A second estimand is effect sizes in the linear regression models.Often, association analyses based on genetic information are of interest.Our simulations give insight into such applications.

| Methods
For each scenario, 100 independent replications (n sim ) were run, each with a sample size of 2000 (n obs ).Each simulated data set was analyzed with the profile EMalgorithm and the MV analysis method.For both, the threshold for collapsing rare haplotypes was set at 1 × 10 −7 , as derived in the real data analysis (Section 4).Each data set consisted of three simulated genes.To facilitate valid comparisons between all replications, the order in which the genes were added into the reconstruction was fixed as given in Supporting Information Tables S4 and S5.Additionally, to ensure that in each replication the effect sizes can be compared, all haplotypes with a theoretical probability >0.05 (the candidate list haplotypes) were never collapsed into the collapsed haplotype, even if their frequency was estimated to be under the threshold.
The output of the EM-algorithm contains expected counts of diplotypes; these are used as predictors in linear regression with the simulation outcome per individual.More information about modeling choices is given in Appendix D.

| Performance measures
Impact of the effect of the different scenarios and methods on the estimation of the HTFs is assessed via the KLD and the root mean square error (RMSE) where M′ is the number of haplotypes for which estimates are available, and the matrix Z contains these estimates.The RMSE is estimated over all 100 replications and incorporates both the bias and the variance of the estimated HTFs.Standardized RMSE (sRMSE) values were calculated as For the comparison of the estimated HTFs with their theoretical value, KLD was estimated with the haplotypes that occur in all 100 independent replications (Section 2.2.7).Additionally, to determine whether the addition of genes into one analysis enhances the haplotype reconstruction, the information R 2 (Section 2.2.9) is estimated for all alleles of the starting gene that occurred in at least 5% of the replications.
To quantify simulation uncertainty due to using n = 100 sim , Monte Carlo Standard Errors (MCSEs) are estimated for the Mean Square Error (MSE) of the HTFs of each scenario.We did not choose the number of replications according to expected MCSE (Morris et al., 2018), due to computational complexity.Instead, we justify n sim empirically.Replicating high-ambiguity scenarios 10 times, the MCSE did not exceed 5 × 10 −4 , which we deem acceptable.
For the linear regression models, RMSE and bias were estimated for the effect size estimates.Additionally, the estimated prediction error (EPE) is calculated to evaluate the accuracy of the model estimates: with y i c the true outcome and y ˆi the predicted outcome.
3 | RESULT OF SIMULATIONS

| Simulation study I
Figure 2 illustrates that the EM-algorithm estimates are consistent throughout the different scenarios: with an increase in ambiguities the sRMSE increases, which is partly compensated by increasing the LD.In contrast, the MV analysis shows less consistency, because increasing ambiguities result in both higher and lower sRMSEs, depending on the allele or haplotypes.It is clear that the HTFs are estimated more accurately in the profile EM-algorithm than that in the MV analysis, although this is mainly prominent in the two-and three-gene haplotypes, because the sRMSE values for a single gene are consistently low and not influenced by LD in the data.The sRMSE values are also plotted for all 16 scenarios in a nested-loop plot (Supporting Information Figure S1), where the lines below the plot indicate the parameter combinations of the different scenarios (Rücker & Schwarzer, 2014).Here, KLD values are also shown which quantify the benefit of the profile EMalgorithm compared with the MV analysis.How diplotype frequencies estimated with the EM-algorithm change with the addition of genes are illustrated in Supporting Information Figure S2.
For the information R 2 calculations, the differences after each addition of a new gene are estimated and shown in Table 2. To compare the information R 2 for a given allele of the starting gene, the larger reconstruction is marginalized back to the AFs of gene 1.In principle, these differences should always be positive as new information is added, however, collapsing and lack of LD might also lead to loss of information.Lack of LD induces additional variability, potentially making HTF estimation less precise.HTFs are different between scenarios due to varying LD values, and due to sampling variations not all haplotypes occur in each replication.
Table 2a shows the information R 2 without any added LD or ambiguities.The truly simulated alleles-the alleles included in the diplotype simulation (i.e., the green-colored alleles in the table)-all already have a high amount of available information in the first gene analysis, which does not increase with additional genes.In contrast, most nonsimulated alleles-which should be estimated with frequency zero but the occurrence of which could not be excluded during reconstruction due to ambiguities-benefit from the second and third gene, that is, their absence becomes more apparent.These observations become more pronounced in the scenario with high LD (Table 2b).
With high ambiguity added to the data, the results are different (Table 2c).As expected, the amount of information in the one-gene analysis is much lower, and the impact of van der BURG ET AL.
F I G U R E 2 Comparison of HTF estimation.Line graphs of sRMSEs for HTFs for the four scenarios of simulation study I with (a) no LD and (b) medium LD.Left are the four ambiguity scenarios estimated with the profile EM-algorithm, right the four ambiguity scenarios estimated with the majority vote analysis.Each line indicates the sRMSE progression of a specific allele or HT over the four ambiguity scenarios.The four alleles plotted ("A," "B," "C," and "NEG1") are all from gene 1, where the haplotypes are combinations of genes 1 and 2 (when there are two alleles separated by a "−," e.g., "A-E") or a combination of genes 1-3 (when there are three alleles, e.g., "A-E-NEG3").These eight scenarios are together with the scenarios with low and high LD plotted together in Supporting Information Figure S1.EM, expectation-maximization; HT, haplotype; HTFs, haplotype frequencies; LD, linkage disequilibrium; sRMSE, standardized root mean square error.
T A B L E 2 Information R 2 simulation study I.
Note: For the scenarios with (a) no LD and no ambiguity is added; (b) high LD (0.8) and no ambiguity; (c) no LD and high ambiguity; (d) high LD and high ambiguity.The information R 2 is estimated for all haplotypes of the starting gene; average values are shown over the retained haplotypes in the independent replications.The first column of each heatmap shows the percentage of replications where the haplotype is retained, only haplotypes that are retained in at least 5% of the one-gene analyses are included.Haplotypes are ordered based on the AFs in the one-gene analysis, which are denoted in the second column.Haplotypes used for the diplotype simulation are colored green, haplotypes not used for the simulation are denoted in black.The average information R 2 of the one-gene analysis is shown in the third column, with the difference in the two-and three-gene combinations to these values shown in the fourth and fifth columns, respectively.The sixth column shows the amount of information for each haplotype in the three gene combinations.The color spectrum indicates whether the amount of information increases (green) or decreases (red) in the combination with respect to the one gene analysis.Completely white boxes indicate that there is no difference.The information R 2 heatmaps of the other 12 scenarios are displayed in Supporting Information Table S6.
| 13 adding genes into the reconstruction on the increase in information is attenuated.For most truly simulated alleles now a relative small, but positive, effect is observed with additional genes being added.Supporting Information Tables S6A and S6B show the information progression with high LD/low ambiguity and low LD/high ambiguity, respectively.The low level of ambiguity added to the data has a big effect on the benefit of adding genes, where the addition of a low level of LD only has a small impact.Overall, the combination of LD and ambiguity is beneficial for the simulated haplotypes, with throughout beneficial, albeit small, effects.
This LD/ambiguity effect is best observed when both are high (Table 2d).Haplotypes D, A ˆB, and C ˆD show a clear increase in the amount of information with multiple genes, with the amount of information for haplotype D doubling in the three-gene combination.The truly simulated alleles show a clear increase in the amount of information with multiple genes, especially compared with the scenario without any added LD.In contrast, the nonsimulated haplotypes do not seem to benefit from the added LD, compared with their values with only high levels of ambiguity.
Linear models were run based on expected allele/ haplotype counts (Supporting Information Figure S3).Two observations stand out.First, most RMSE values vary between 0 and 0.6, without any big differences between the profile EM-algorithm and the MV analysis.Second, for some haplotypes in most EM-algorithm scenarios the RMSE values are very high.These outliers are caused by some extreme estimated effect sizes in some replications, originating from the relation between the main-and the two-gene combination haplotypes.For example, the four extreme haplotypes in the first column (illustrated by the triangles on top) are "B" and all twogene combinations of "B" ("B-E," "B-F," and "B-NEG2").In one replication, "B" gets attributed an effect size of 30, although its two-gene combinations all have effects of −30.Because a donor with "B" must also have any of the three other haplotypes, the effects balance out.
This balancing out is also apparent when examining the EPE, which is comparable for the two methods (Supporting Information Figure S3).For both, higher EPE values are observed with more ambiguity in the data, although extra LD decreases the error.Interestingly, with the highest added level of LD, hardly any extreme RMSE values occur.When comparing the median values of the absolute bias of each replication, the impact of the extreme effect sizes is decreased (Supporting Information Figure S4).With low ambiguity added, the median values are comparable for both methods.However, with higher values, the bias with the profile EM-algorithm is considerably lower.Increasing the LD between genes again lowers the bias.
To determine whether the findings of the three-gene simulation study can be extrapolated to analyses with more genes (like the real data analysis with seven KIR genes), the scenario with medium LD and medium ambiguity was extended with four additional genes.Details are given in Appendix E. The resulting information R 2 plot with seven genes (Supporting Information Figure S5) shows that the trend of increasing information by adding genes continues for these additional genes.For almost all truly simulated alleles there is a clear increase in the amount of information, which is much higher than observed in the three-gene simulation studies.

| Simulation study II
Except for two special scenarios where the LD between all observed haplotype combinations equals −0.2 or −0.5 (Figure 3b), the effect of the LD adjustments has little impact on performance measures (Figure 3).These two scenarios perform worse than the rest.Possibly, this is because of lower "NEG" AFs due to the altered LD values for all three genes.In general, fewer "NEG" alleles mean more ambiguity in the data and thus more uncertainty in the estimates.
The profile EM-algorithm performs better for all scenarios, particularly because of higher RMSE values for certain haplotypes (e.g., "001-003-002") in the MV analysis.This is probably because of their low frequency, which is just above the candidate list threshold.This shows a fundamental difference between the two methods.The MV method chooses one diplotype for each individual, thereby having a bias towards frequently occurring haplotypes.As a consequence, rare haplotypes, which are more abundant with higher levels of ambiguities, are lost.In contrast, the EMalgorithm does not ignore rare haplotypes, resulting in more moderate errors.The sRMSE values are also plotted in a nested-loop plot (Supporting Information Figure S6).Again, KLD values are given to quantify the benefit of the profile EM-algorithm over the MV analysis.
Table 3b shows the information R 2 differences for one of the special scenarios with LD equal to −0.5.In contrast to the latter scenarios, most alleles of the special scenario are lost with the addition of a second gene.The heatmaps of the standard scenarios are similar to those of simulation study I (Table 3).Truly simulated haplotypes already have a high amount of information in the onegene analysis, although the nonsimulated haplotypes benefit from additional genes.More results are shown in the supplement (Supporting Information Table S7).Left are the five scenarios estimated with the profile EM-algorithm, right the five scenarios estimated with the majority vote analysis.Each line indicates the sRMSE progression of a specific allele or HT over the five LD-altered scenarios.The four alleles plotted ("001," "002," "005," "099," and "NEG1") are all from gene 1, where the haplotypes are combinations of genes 1 and 2 (when there are two alleles separated by a "−," e.g., "001-003") or a combination of genes 1-3 (when there are three alleles, e.g., "001-003-NEG3").These nine scenarios are together with the scenarios where LD is only altered for one haplotype combination plotted together in Supporting Information Figure S6.LD, linkage disequilibrium; HTFs, haplotype frequencies; sRMSE, standardized root mean square error; EM, expectation-maximization; HT, haplotype.
van der BURG ET AL.
For all 17 different scenarios, linear models were run to estimate the haplotype effect sizes (Supporting Information Figure S7).Interestingly, the main effects of both models are biased.These biases are observed in all scenarios and not limited to haplotypes with a nonzero effect size (Supporting Information Figure S8).This is also the reason why none of the haplotypes has a RMSE close to zero, as is the case for the first simulation Note: For scenario (a) the LD is not altered, the LD of all haplotypes is decreased by (b) high LD (0.5), (c) low LD (0.1), or increased by (d) high LD (0.5).The information R 2 is estimated for all haplotypes of the starting gene; average values are shown over the retained haplotypes in the independent replications.The first column of each heatmap shows the percentage of replications where the haplotype is retained; only haplotypes that in at least 5% of the one-gene analyses are retained are included.Haplotypes are ordered based on the AFs in the one-gene analysis, which are denoted in the second column.Haplotypes used for the diplotype simulation are colored green, haplotypes not used for the simulation are denoted in black.The average information R 2 of the one-gene analysis is shown in the third column, with the difference in the two-and three-gene combinations to these values shown in the fourth and fifth columns, respectively.The sixth column shows the amount of information for each haplotype in the three gene combinations.The color spectrum indicates whether the amount of information increases (green) or decreases (red) in the combination with respect to the one gene analysis.Completely white boxes indicate that there is no difference, whereas gray boxes indicate that the corresponding haplotype is completely collapsed.The information R 2 heatmaps of the other 13 scenarios are displayed in Supporting Information Table S7.
Abbreviations: AFs, allele frequencies; LD, linkage disequilibrium.study.Moreover, the extreme effect sizes for some haplotypes in the EM-algorithm are again noticeable, which are especially prevalent for haplotypes containing the "004" allele of the second gene.These haplotypes have relatively low frequencies, and therefore their effect sizes have a high variance.Due to the biases in the main effects, it is not surprising that the extreme effect sizes are also observed when looking at the median effects (Supporting Information Figure S9).As before, these extreme values are not observed in the EPEs of these linear models, which for most scenarios are comparable, with ratios slightly above one.

| Analysis design
Each of the seven selected genes (KIR2DS1, KIR2DS4, KIR3DS1, KIR2DL1, KIR2DL2, KIR2DL3, and KIR3DL1) of the 3547 selected donors was used as a starting gene in the profile EM-algorithm and for the MV analysis.Both analyses were run with varying threshold values (1 × 10 i − for ∈ i {04, 05, 06, 07, 08, 10, 12, 15}) for the collapsing of rare haplotypes.The choice of a threshold value is a trade-off between computational speed, bias, and the amount of information retained.
After each analysis, the final gene sequence was converted into one common gene order to facilitate comparisons between the analyses with the different starting genes.The optimal threshold value is determined by estimating the dissimilarity between the HTFs of all subsequent analyses with the KLD for the haplotypes that occur in both analyses.
HTFs are compared with those obtained in the MV analysis, and to the estimates from another study (Solloch et al., 2020), taken as a rough comparison.In this study, HTFs are estimated in a different sample with a comparable ethnical background with a lower number of participants.This study limited the number of compatible diplotypes by including the parent-offspring genetic relationship, and the complexity is further reduced by only allowing diplotypes where the two haplotypes both met copy number patterns of 92 previously described KIR haplotypes.Participants that did not have any remaining options were discarded from the study.Differences with the HTFs in our study are therefore expected but HTFs should be similar.

| Results
Of high importance for the analysis is the chosen threshold value because this has a big impact on the number of haplotypes retained and the estimated frequencies, for example, with a threshold value of 1 × 10 −4 the analysis with starting gene KIR2DS4 ends up with 19 uncollapsed haplotypes, although with 1 × 10 −15 471 haplotypes are uniquely retained.In general this also results in higher running time and requires more main memory (Supporting Information Figure S10), potentially exceeding available resources.
Conceptually, the optimal threshold value implies correct reconstruction of "important" haplotypes and collapsing of others.We determine this value via a sensitivity analysis for the profile EM-algorithm.Figure 4 shows the HTF dissimilarity between the subsequent analyses.Especially the differences between the higher threshold values are evident, with a high dissimilarity between the analyses with 1 × 10 −4 , 1 × 10 −5 , and 1 × 10 −6 as threshold value.For two of the seven genes, dissimilarities still decrease when running the analysis with threshold 1 × 10 −7 , where even lower threshold values do not change HTFs any more.On the basis of the HTF differences, 1 × 10 −7 seems to be the optimal threshold value for the real KIR data set and is therefore used for all ensuing analyses.
Table 4 shows the average values and standard deviations over the analyses with different starting genes for both analysis methods of the 20 haplotypes with the highest frequency in the profile EMalgorithm analysis.Although there are some apparent frequency differences between the two methods, that is, the frequency of the top two haplotypes differ by 0.009 and 0.013, respectively, the frequency order of the haplotypes closely matches.The standard deviation of the estimates in the MV method are somewhat higher than those from the profile EM-algorithm.However, for both methods the SDs are still low, indicating that the estimated HTFs are robust against the choice of starting locus.Additionally, the HTFs of a family-based study in a similar population are included (Solloch et al., 2020), which provides a rough comparison with some notable differences.For example, certain allelic variants which were unambiguously determined to be present in our data set did not occur in the rough comparison study.For some haplotypes, their frequencies are closer to those estimated in the profile EM-algorithm, whereas others coincide more with the frequencies from the MV analysis.Overall, the HTFs of the comparison study are comparable to those observed here, suggesting that with the collapsing of rare haplotypes, realistic HTFs can be obtained.
The information R 2 is estimated for all alleles of the starting gene after each reconstruction step to determine whether adding genes into the reconstruction enhances HTF estimation.Table 5 shows the information R 2 for the analysis with KIR2DL1 as the starting gene, the information R 2 progressions for the other six starting genes are given in Supporting Information Table S8.As observed above, haplotypes with a moderate or high HTF in general already have a high amount of information in the one-gene analysis, although haplotypes with low frequencies start with little available information, and benefit most from adding genes.However, also alleles with already a sufficient amount of information present can benefit from adding genes, that is, KIR2DL1*032N and KIR2DS1*005 (Supporting Information Table S8), where the differences clearly increase after the inclusion of the first and sixth genes, respectively.
The addition of genes can also have a negative effect on the amount of information in the analysis.For certain haplotypes, that is, KIR2DL1*003c ˆ004 and KIR2DL1*020, the amount of information retained in the analysis is reduced, leading to lower information R 2 differences than observed in the single starting gene analysis.This reduction can be compensated for at later steps, as can be seen for KIR2DL1*003c ˆ004.Moreover, certain haplotypes are lost (indicated by NA) due to the collapsing of the rare haplotypes.

| DISCUSSION
In this study, we developed an EM-algorithm to estimate HTFs for KIR genotype data when genotype calling is ambiguous.We allow for a wide range of ambiguities, including CNVs, thereby making this algorithm applicable to all human genetic regions.To our knowledge, other haplotype reconstruction methods do not allow for this wide range of ambiguities in the data.Our algorithm is important even if there is no intention to conduct haplotype-based analyses as the reconstruction of a single locus can be improved by taking into account additional loci.This is a major difference from the situation when genotypes are measured unambiguously (without error) where haplotype reconstruction can only contribute to impute missing genotypes.This benefit, however, can only be realized when sufficient LD is present in the data.
To deal with the high complexity of the data, our algorithm contains both algorithmic optimizations and heuristic modifications.First, we take an iterative approach, starting the reconstruction with a single locus and adding further loci one at a time.This approach is also taken by standard HTF estimation algorithms (Balliu et al., 2019;Sinnwell & Schaid, 2018).Unlike the standard approaches, our algorithm is potentially sensitive to the order of reconstruction, which is determined  Note: Average HTFs (SD) in the profile EM-algorithm and the majority vote (MV) analysis, together with the HTFs observed in a study, here used as a rough comparison (Solloch et al., 2020).Analyses were run for each of the seven starting genes with 1 × 10 −7 as a threshold value and haplotype names were recoded into a common order (KIR2DS1-KIR2DS4-KIR2DL1-KIR2DL2-KIR2DL3-KIR3DS1-KIR3DL1).The top 20 haplotypes are ordered numerically based on the estimates of the profile EM-algorithm.
in a greedy way.However, our analyses suggest that already with moderate threshold values for collapsing rare haplotypes this is not the case in practice.Second, we introduce collapsing of rare haplotypes similar to other algorithms (Sinnwell & Schaid, 2018).In our case, collapsing destroys information about ambiguities, so that potential bias is introduced.We show in simulations that bias can be minimized when the collapsing threshold is kept low, that is, smaller than the frequency associated with a single occurrence of a haplotype in the present data ( n 1 2 ).Running the reconstruction with such a low threshold allows for faithful reconstruction, keeps the complexity low, and avoids inaccuracies when computing marginal frequencies, that is, computing AFs based on HTFs spanning several loci.The collapsing step makes a third modification necessary, which we call the profile EM.When ambiguous genotype calls are compatible with a rare, collapsed haplotype, full information about ambiguities cannot be retained.As the full vector of HTFs is updated in every iterative step, marginal HTFs of previously estimated loci can change, leading to accumulation of error in the EM-algorithm (spill-back, Section 2.2.4).We avoid this by fixing the frequency of haplotypes collapsed so-far in the EM-algorithm.After convergence, both the set of collapsed haplotypes and its frequencies are updated.We justify this step by using the knowledge that the collapsed haplotypes are indeed very rare (probably not present in the sample), so that this knowledge can be safely used in a subsequent step.In summary, we can reconstruct haplotypes in large data sets containing millions of potential haplotypes such as in the analysis of seven KIR genes.Estimates of HTFs are unbiased and are more accurate than those obtained with the MV approach.One stated goal was to "borrow" information across loci, that is, use loci in LD to reduce ambiguities at a given locus.Our simulations show that benefits are greatest when both the level of ambiguities and LD are high.As a caveat, we point out that we parameterized LD in terms of δ′ which is not directly comparable across pairs of loci.For the real KIR van der BURG ET AL.
| 19 data application, we could see improvements in reconstruction for less frequent haplotypes.The gain in overall accuracy therefore clearly depends on the data set and the intended downstream analysis.
The HTFs in the first simulation study are estimated with greater precision with the profile EM-algorithm than with the MV analysis.Increasing the ambiguity in the data reduces this precision, which can only partly be T A B L E 5 Information R 2 for KIR2DL1.
Note: Column (1) shows the HTFs estimated in the one-gene analysis where the amount of information in this analysis is denoted in column (2).The differences in the amount of information with an increasing number of genes in the combination, compared with this one-gene analysis, are denoted in columns (3)-( 8).The color spectrum indicates whether the amount of information increases (green) or decreases (red) in the combination with respect to the one-gene analysis.Completely white boxes indicate that there is no difference, whereas gray boxes indicate that the corresponding haplotype completely collapsed.
Abbreviation: HTFs, haplotype frequencies.compensated by high LD.In simulation study II, the amount of ambiguity was fixed and data-derived levels of LD were already present.Small LD alterations had no noticeable effect, although large adjustments can lead to biased HTF estimation.Both methods have a bias towards frequently occurring haplotypes.However, because the EM-algorithm gives a frequency vector of HTFs for each subject, the rare haplotypes are better represented than in the all-or-nothing approach of the MV method.The downside of this is observed in the information R 2 heatmaps, where the nonsimulated haplotypes benefit more from adding loci than the truly simulated haplotypes.This is especially the case with high LD and smaller amounts of ambiguities, although the impact of the nonsimulated haplotypes remains small.Higher ambiguity levels partly attenuate this effect.
We have also investigated the behavior of linear regression models with the reconstructed haplotypes as predictors and investigated KLD and RMSEs of parameter estimates.A motivation for this simulation component is that an ideal EM-algorithm should also take into account outcome data.Such an EM-algorithm should not be subject to elaborate regression modeling to allow algorithmically driven analysis, justifying a saturated model.We show that using reconstruction based on the EM-algorithm in certain cases leads to less accurate regression coefficient estimates as compared with the MV reconstruction.This surprising result can be mainly explained by the fact that the MV reconstruction method tends to ignore rare haplotypes, thereby performing an implicit penalization.Also, keeping relatively rare haplotypes in the model can induce collinearity by inducing negative correlations.This interpretation is supported by the fact that ill-conditioned design matrices most frequently appear for simulation scenarios with low LD and/or high ambiguity, both being cases where haplotype diversity is high.A major conclusion concerning model building is that models should be kept sparse or that a penalized model should be used, although for the latter statistical testing becomes challenging (Böhringer & Lohmann, 2022).

| Computational aspects and method performance
The presented profile EM-algorithm can deal with a high diversity of scenarios.It only requires a list containing all diplotypes that are compatible with an individual's genotype which can be prespecified or automatically constructed.Some computational cost has to be expected; Supporting Information Figure S10 displays the time and required memory to run the analyses with varying number of genes, sample size, and threshold values.As a proof-of-concept we have also run simulations with seven genes which took roughly 110 GiB memory and 24 h of running time per analysis.This proof-of-concept simulation study shows that the borrowing of information of the profile EM-algorithm continues with a higher number of genes in the analysis.Currently, the EMalgorithms are run starting with a uniform distribution on compatible diplotypes.Other values can be supplied, for example, by using estimated HTFs from previous reconstructions, which might speed up convergence.

| KIR analysis
One major motivation for this study is to better understand the role of KIR genes in allogeneic stem cell transplantation.Reconstruction of haplotypes with our EM-algorithm creates new opportunities to test for the impact of donor genotype information on patient outcomes.A comparison with HTF estimates from a previous, family-based study shows good agreement (Solloch et al., 2020).Discrepancies can be explained by sampling fluctuation, but also by differences in methodology, as our algorithm is the most comprehensive in handling ambiguities so far.Albeit more validation is desirable, we are not aware of public data sets which would contain all the ambiguities considered here.

| Future work
Several extensions of our current algorithm seem interesting.First, incorporating data with a known phase would be straightforward but has not yet been implemented.Second, we have assumed that missingness is random, that is, the chance of an ambiguity only depends on the genotype, but not on the underlying diplotype.In practice, this might not be the case.For example, a given diplotype might be more difficult to call as compared with another diplotype that is also compatible with the given genotype.Although conceptually this scenario is likely, we are not aware of experimental quantification of such calling biases with respect to diplotypes.Including such knowledge would lead to Bayesian models that could include the a priori calling biases for which previous knowledge would be required.
Third, the inclusion of outcome data into the reconstruction algorithm is interesting.This might further improve accuracy in the presence of an association signal.Such an extension would be challenging for at least two reasons.First, as our outcome simulations van der BURG ET AL.
| 21 show, outcome models might be difficult to fit when larger models are considered, requiring some regularization.Second, as HTFs are updated for all loci in every iteration of the algorithm, convergence issues might arise.We see this as a challenging but promising line of future research.
In conclusion, we have developed an EM-based algorithm to reconstruct haplotypes in genetically complex regions and demonstrated robust, unbiased estimation of HTFs in all relevant scenarios.

AUTHOR SUMMARY
For certain patients with hematological malignancies, stem cell transplantation is the only curative treatment.Important for this immunological procedure is a good match between patient and donor, thus a lot of research focuses on improving this match.However, the genotyping process of promising genes can be subject to uncertainties and ambiguities (e.g., that of the KIR gene region), making it difficult to evaluate their role in patient survival.In this article we introduce an algorithm that imputes these uncertainties and ambiguities by incorporating the information of neighboring genes.Because of the dimensionality of the problem three alterations had to be made to the classical algorithm to ensure valid probability estimation.In this introduction paper, we show that the probabilities estimated in our algorithm are comparable to those observed in an independent study.With reliable probabilities, the effect of gene regions that are difficult to genotype on patient survival can be mapped in a better and more comprehensive way, which can then potentially be included in the patient-donor selection criteria.

SOFTWARE
Software is available as an R package to conduct all analyses performed in this study (https://github.com/lljvanderburg/haploemcnv). We briefly outline the main functions: • all_options: to recode the per gene genotype calls into diplotype lists which are required as input for the analysis.• EM_algorithm: to run a single EM-algorithm round for the supplied data.• HT_reconstruction: to run the profile EM-algorithm for the supplied number of diplotype lists.Multiple parameters control the reconstruction, for example, the number of starting genes used and the threshold value below which rare haplotypes are collapsed.
At the moment, installation requires the devtools packages and is performed via: remotes::install_github("lljvanderburg/haploemcnv) Input has to be provided as character strings as multilocus genotypes as follows (cf.GLSC, 2022): + is the separation of gene copies (in genotype) and haplotypes (in diplotype).− is the separation of alleles that are in the haplotype phase.ˆis the separation of multiple alleles of the same gene on a single haplotype; the so-called CNVs./ denotes allelic ambiguity in the genotype call. denotes genotype ambiguity in the genotype call.
Where the digits of the last three points are increasing in the order in which the allelic variant is described.In this paper, the interest is on the allelic variants with nonsynonymous substitutions.Thus, alleles with only differences in the last two points are considered to be the same allelic variant.
The genomic organization of KIR haplotypes is structured relative to the fixed position of the so-called framework genes.KIR3DL3 and KIR3DL2 mark the boundaries of the KIR gene cluster and KIR3DP1 and KIR2DL4 are placed centrally, between which a recombination hotspot is located (Jiang et al., 2012).This divides the gene cluster into a centromeric and a telomeric motif, with each motif expressing a specific set of KIR genes (Cooley et al., 2010).

APPENDIX B: REPARAMETRIZATION
Assume that the set of loci is partitioned into K − 1 loci, the haplotypes of which are indexed by i and a single, remaining locus, the alleles of which are indexed by j.HTFs at all loci are therefore given by the vector π ( ) , ) .We can now derive marginal HTFs and LD parameters as follows: The mapping ↦ π π π δ ( ) (( ) , ( ) ), ( ) ) i j i j i i j j i j i j ( , ) ( , ) , ( , ) is bijective and thereby defines a reparametrization.

APPENDIX C: INFORMATION R 2
To derive the R 2 , we focus on a single locus and dichotomize the problem by analyzing one allele and considering all other alleles to be the alternative allele.This implies that for each allele a separate R 2 is calculated.Let X X X = ( , … ) N 1 a dichotomized genotype vector for individuals i N = 1, …, , where X X X X = ( , , ) are the expected genotype counts (0, 1, 2) for individual i.Under HWE assumptions and with AF θ, the likelihood is then given by X θ θ θ ( , ) = ( , ) = 2 + − 2 (1 − ) .
For the second derivative of the log-likelihood we have θ X X θ θ .

F
I G U R E 3 Comparison of HTF estimation.Line graphs of sRMSEs for HTFs for the five scenarios of simulation study II with (a) positive LD or (b) negative LD added to all haplotypes.The scenario with no alterations is plotted in both (a) and (b).
T A B L E 3 Information R 2 simulation study II.
Genotype information and ambiguities for the seven KIR genes, given separately.

1
Profile EM-algorithm for full reconstruction.π ˆcontains MLEs for the starting gene.Collapsed versions are stored in π ˆcoll .loci in reconstruction _ _ contains the currently reconstructed set of loci.i iterates the number of loci and the iteration indexed by l performs the search for the next locus to choose.T l stores statistic (7) for the locus candidates.l _ next gene is the number of the locus to continue the reconstruction with.Collapsing haplotype frequencies (HTFs) imply collapsing of haplotypes in the data.EM, expectation-maximization; MLE, maximum likelihood estimate.
van der BURG ET AL.