Author contributions: P.G.: conception and design, collection and/or assembly of data, data analysis and interpretation, and manuscript writing; L.G.: collection and/or assembly of data, data analysis and interpretation, and final approval of manuscript; M.G.: data analysis and interpretation and final approval of manuscript; and M.P.: conception and design, data analysis and interpretation, and manuscript writing.
Disclosure of potential conflicts of interest is found at the end of this article.
First published online in STEM CELLSEXPRESS November 1, 2011.
Among the tools of regenerative medicine, induced pluripotent stem cells (iPSCs) are interesting because the donor genotype can be selected. The construction of banks of iPSC cell lines selected from human leukocyte antigen (HLA) homozygous donors has been proposed to be an effective way to match a maximal number of patients receiving cell therapy from iPSC lines. However, what effort would be required to constitute such a bank for a worldwide application has remained unexplored. We developed a probabilistic model to compute the number of donors to screen for constituting banks of best-chosen iPSC lines with homozygous HLA haplotypes (haplobanks) in four ancestry backgrounds. We estimated what percentage of the patients would be provided with single HLA haplotype matched cell lines. Genetic diversity leads to different outcomes for the four sets in all terms. A bank comprising iPSC lines representing the 20 most frequent haplotypes in each population would request quite different number of donors to screen, between 26,000 for European Americans and 110,000 for African Americans. It would also match different fractions of the recipient population, namely, more than 50% of the European Americans and 22% of African Americans. Conversely, a bank comprising the 100 iPSC lines with the most frequent HLA in each population would leave out only 22% of the European Americans, but 37% of the Asians, 48% of the Hispanics, and 55% of the African Americans. The constitution of a haplobank of iPSC lines is achievable through a large-scale concerted worldwide collaboration. STEM CELLS 2012; 30:180–186.
As pluripotent stem cells (PSCs) are virtually immortal and capable of giving rise to any cell phenotype, they potentially provide almost unlimited resources for cell therapy. Among them, induced PSCs (iPSCs) [1, 2] are a specific type of PSC produced from somatic cells of donors. The ability to select the donors before developing iPSC lines opens the possibility of choosing discrete genetic characteristics. Although autologous customized iPSC grafts have been envisioned, they do not seem either technically or economically sound. Allogeneic alternatives are considered [3–5] based on banks of iPSC lines from specific donors selected to limit immune cell responses.
Among various determinants, the degree of response from allogeneic transplants strongly correlates with the extent of matching of human leukocyte antigen (HLA) genes. HLA genes are codominantly expressed and the ones deemed the most important for matching are HLA-A, HLA-B, and HLA-DRB1. Matching unrelated donor iPSC lines with recipients for each HLA allele seems beyond reach because billions of phenotypes are theoretically possible. At least, it is known from the attempt to match donors and patients in unrelated hematopoietic stem cell (HSC) transplantation that several millions of donors would not be enough. One way to considerably decrease this number, possibly down to a few thousands cell lines, has been proposed by Taylor et al.  who stressed the value of hemi-similarity match using cell lines from donors exhibiting homozygosity for each of the HLA genes. Because of the high linkage disequilibrium between HLA genes, this situation would actually consist in achieving the matching on only one of the two HLA haplotypes of the recipient in the absence of allele mismatches for the other haplotype due to the (haplo)-homozygosity of the donor cell line's HLA (Supporting Information Fig. S1). The repository of the cell lines would thus be a bank of iPSC lines with homozygous HLA haplotype, that is, a haplobank.
Preliminary studies of HLA data allele frequencies from selected population [4, 6, 7] surprisingly suggested that collection of cell lines from a few dozens of so-called triple-homozygous donors (i.e., homozygous donor for HLA-A, HLA-B, and HLA-DRB1 genes) would be sufficient to provide HLA matches for most of the patients in Japan  and the U.K. . However, how such numbers would evolve when one considers population with less constrained ancestry backgrounds has remained unexplored up to now.
In this study, we explored the size, optimal content, and feasibility of an iPSC bank with a single haplotype match with reference to the human genetic diversity. We have developed mathematical models using HLA haplotype frequencies from large unrelated HSC donor registries in the U.S. that identify the most beneficial recruitment of haplo-homozygous donors and evaluated the matching properties of the corresponding iPSC banks for population with different ancestry backgrounds and their benefit for the patients, leading to a probabilistic model for an international iPSC lines repository.
The mathematical model here developed to guide construction and evaluation of the haplobanks of iPSC lines is based on HLA haplotype frequencies. A haplotype describes the co-occurrence of genetic polymorphisms of HLA antigens/alleles along a single chromosome. HLA haplotype frequencies were estimated using a maximum likelihood method implemented in an Expectation Maximization algorithm. The same method was applied to four large samples of unrelated individuals spanning genetic ancestry backgrounds commonly encountered in North America, namely, African American (AFA; 2N = 4,778), Asian American (ASI; 2N = 3,516), European American (EU; 2N = 15,734), and Hispanic Americans (HIS; 2N = 3,968). HLA typing was achieved at “high-resolution typing” by sequence-based techniques, which means that the result is defined as a set of alleles that specify and encode the same protein sequence for the peptide binding region of an HLA molecule and that excludes alleles that are not expressed as cell-surface proteins. Publicly available aggregated frequency data were used . Additional details about recruitment of the participants, HLA typing, haplotype estimation, and ancestry assessment can be found in Maiers et al. .
The detailed formulae to compute five parameters of the mathematical model are given in Table 1. These are indexed by haplotype frequencies sorted in descending order. “Ni” represents “the discovery sample size,” that is, how many individuals should be screened to find a haplo-homozygous individual for a given haplotype among them. All other parameters correspond to measurements of the ability of a bank of iPSC lines to match patients. “Ki” represents “the raw benefit” that is the proportion of the patient population that may benefit from the iPSC line of the ith most frequent haplotype. “Oi” is “the overlap,” that is, the proportion of the patient population benefiting from the iPSC line of the ith most frequent haplotype that would also benefit from an iPSC line of a more frequent haplotype (and, therefore, should not be counted twice). Because patients with two frequent haplotypes would benefit from one or the other of the two corresponding cell lines (Ci), the usefulness of a haplobank is inferior to the sum of the usefulness of each of its iPSC lines. Oi is only used to compute a related cumulative parameter, “Ci,” which is the “multiple choice,” the percentage of the patient population that would benefit from several of the iest most frequent HLA haplotype iPSC lines. Illustrated examples are presented in Supporting Information Figure S2 for the three most frequent European American haplotypes. Finally, “Si” is the “cumulated adjusted benefit,” that is, the proportion of the patient population that would benefit from at least one iPSC line contained in a bank encompassing up to the iest most frequent HLA haplotypes.
Table 1. Definition of the main parameters modeling the human leukocyte antigen component of haplo-homozygous induced pluripotent stem cell line banks
The model assumes that matching is achieved when at least one haplotype is shared between the iPSC line and the recipient. HLA haplotype accounts for HLA-A and HLA-B in the class I region and HLA-DRB1 in class II region. Other alleles at different HLA genes were not considered (Supporting Information Fig. S1). All estimations are expected number of donors or expected percentages of the patient population. All computations were performed in R Statistical software; modeling was achieved using functions to compute the model from any haplotype distributions .
Figure 1 presents the computations of the discovery sample size Ni parameters, that is, the donor population of each of the four specific ancestry backgrounds that should be screened to find a haplo-homozygous donor for the 20 most frequent haplotypes (See Supporting Information Table S1 for numerical details). Estimations are clearly different from one population to another. The EU population was altogether the most genetically homogenous. A donor homozygous for the most frequent haplotype (HLA-A*01:01–HLA-B*08:01–HLA-DRB1*03:01) was expected to be found in a screening sample as small as approximately 182 individuals. For the 10 most frequent haplotypes, the screening sample size was 10,952 and for the 20 most frequent haplotypes, 26,528. Similarly, in the population of Asian ancestry (ASI), potential donors homozygous for the 10 most frequent haplotypes were predicted to occur in a screening sample of 15,838 individuals; by contrast, almost twice larger samples were needed to identify donors homozygous for the 10th to 20th most frequent haplotypes. In the population of Hispanic ancestry (HIS), the distribution was again different because the 10 most frequent haplotypes required 32,640 individuals to screen and a comparable number for the 10 haplotypes with following decreasing frequencies. Finally, the largest screening sample size at all points was predicted for the population of African origin (AFA): 111,326 unrelated individuals would be needed to identify haplo-homozygous donors for the 20 most frequent haplotypes.
The Ki parameter, that is, the proportion of patients who would benefit from the haplo-homozygous cell line for the ith most frequent haplotype, provides a different view of the interpopulation genetic heterogeneity (Fig. 2A). Although all Ki distributions are exponentially decreasing, there is a striking difference between the results of the EUs and all others, mainly for the 10 most represented haplotypes, which are much more represented in the former population than in the others. The most frequent haplotype would, by itself, give single haplotype matching to 14.8% of the EU population and the second 6.93%, with the following ones showing progressively decreasing values. The curves for the other three populations are relatively similar to one another, with HIS and AFAs being almost superimposed. In all those three cases, none of the iPSC lines would serve 5% of the population. As a consequence, cumulative values (Fig. 2B) show that 20 iPSC lines would match more than half the EU population (53%), whereas a similar number of cell lines would match only 34% in the population of Asian ancestry, 25% in Hispanics, and 22% in African Americans. Beyond the range of the graph, the model suggests that a bank comprising the 100 iPSC lines with the most abundant haplotypes would leave out without a match only 22% of the EU but 37% of the ASI, 48% of the HIS, and more than 55% of the AFA patients (Supporting Information Table S1).
All computations of Si in Figure 2B were obtained after introducing a correction factor termed the overlap factor, Oi, which is indicated in a cumulative way by Ci in Figure 2C. Whereas some patients will not have one cell line with a single haplotype match (the corresponding percentage can be computed from Fig. 2B: 1 − Si), Ci patients can benefit from two different cell lines, one corresponding to each of their two haplotypes. Figure 2C suggests that this effect mainly affects estimations for the EU population because 8.85% of that population will show such an overlap with the bank of 20 most abundant cell lines. Because of their wider genetic diversity, overlap is less consequential in population with other ancestry backgrounds (for 100 iPSC lines, 10.44% [HIS], 15.20% [ASI], and 7.40% [AFA]; Supporting Information Table S1).
In the four ancestry backgrounds, Figure 3 plots the link between Ni, the number of individual screened to identify haplo-homozygous individuals and Si, the proportion of the population what would benefit from the iPSC lines derived from the haplo-homozygous donors found. While 50% of the European population would be served by a haplobank with 17 iPSC lines resulting from the screening of 22,000 European donors, screening 100,000 individuals will not provide a match for 50% of the other populations (35% [HIS], 45% [ASI], and 22% [AFA]; Supporting Information Table S1).
How best-ranked cell lines from one ancestry background matched those of another was secondarily considered. The extent of haplotype-sharing taking the 20 EU most frequent haplotypes as the reference is given in Table 2. Complementary computations taking as a reference results for the 20 most frequent haplotypes in Hispanic, Asian, and African population are provided in Supporting Information Tables S2, S3, and S4, respectively. The most striking result of these computations is the population specificity of the most frequent haplotypes. Only 13 of the 20 most frequent EU haplotypes were present in the 50 most frequent haplotypes in Hispanics, eight in AFAs, and three in Asians. Six of the 20 best EU haplotypes were altogether not present in Asians and most others were very rare. Similar results were obtained when Hispanics were used as a reference, with six shared haplotypes with AFAs and only two with Asians (and 12 altogether not observed), or with Asians as reference for which results dropped to 0 match with AFAs (and 12 altogether not observed).
Table 2. View of the haplotype extent of sharing between the most frequent HLA-A, -B, and -DRB1 haplotypes in European American ancestry and other ancestry backgrounds
Getting into details of the haplotypes, HLA-A*01:01–HLA-B*08:01–HLA-DRB1*03:01, HLA-A*03:01–HLA-B*07:02–HLA-DRB1*15:01, and HLA-A*29:02–HLA-B*44:03–HLA-DRB1*07:01 were frequent in Americans of European, Hispanic, and African ancestries. HLA-A*33:01–HLA-B*14:02–HLA-DRB1*01:02 and HLA-A*23:01–HLA-B*44:03–HLA-DRB1*07:01 were additionally shared by European Americans and Hispanic ancestries. The only matches with the Asian population were HLA-A*01:01–HLA-B*57:01–HLA-DRB1*07:01 and HLA-A*30:01–HLA-B*13:02–HLA-DRB1*07:01, which were also frequent in European Americans of EU and Hispanic ancestry backgrounds.
We took the opportunity to work with results of another study  in which 500 individuals of EU ancestry were genotyped for HLA (four loci at high resolution) to test our predictions on a fully independent dataset. All results were coherent. Four homozygous haplotypes were found (0.8%), representing three of the five most frequent haplotypes predicted by our analyses in a population of European descent, namely, two for HLA-A*01:01–B*08:01–(–C*07:02) DRB1*03:01 (2.7 phenotypes expected in the model among 500 individuals); one for HLA-A*03:01–B*07:02–(–C*07:02) DRB1*15:01 (0.6 expected); one for HLA-A*29:02–B*44:03–(–C*16:02) DRB1*07:01 (0.2 expected).
The main result of this article is the determination of the size and specific content of a bank of iPSC lines that would best cover the needs for hemi-similar HLA matching in cell therapy for multiple genetic origins likely to represent a large proportion of the world population. Using an innovative probabilistic model in four ancestry backgrounds, calculations were based on the paradigm of hemi-similarity HLA matching achieved by a single haplotype, as provided by iPSC lines from donors who are homozygous for HLA-A, HLA-B, and HLA-DR. Our estimations indicate that such a paradigm may indeed provide a basis for constructing cell banks of fewer than 100 lines that would meet the needs of the majority of the population of different ancestry backgrounds, confirming the theoretical models proposed previously [4, 7]. However, our comparative analysis of the haplotypes that would be the most abundant in the four studied population underscores the need for active international collaboration to meet the challenge of the extreme immunogenetic diversity of humans.
Haplotype frequencies have been used for managing various aspects of the bone marrow donor registries [11, 12]. We have derived our analysis of HLA haplotype frequencies from those established applications to develop a population genetics-based model for estimating the need of iPSC lines banks. The presented probabilistic model can use any source of HLA haplotype distributions. In this study, we favored the comparison of a single source of data; all samples were recruited within the U.S. National Marrow Donor Program. HLA typing was performed by the same methods, samples sizes were ≥2,000 individuals, and estimation of haplotype frequencies was achieved by the same maximum likelihood software. There are no analytical methods to provide confidence intervals for parameters derived from estimated haplotype frequencies by maximum likelihood method in Expectation Maximization algorithm, Gaussian approximation of standard errors for proportions would neglect both variability due to Maximum likelihood estimation and sampling fluctuations [13, 14]. Nevertheless, additional sources of haplotype frequencies are available and could be used as initial parameters by our model. HLA haplotype frequencies have been analyzed to provide fine views of the HLA diversity worldwide for European  or Asian population [16–18] as well as other ethnic backgrounds that were not considered in the present study . As shown in bone marrow donor recruitment, regional diversity may occur within a given area of donor recruitment. It can be considered to optimize recruitment strategies [20, 21]. Our observation on the very limited sharing of HLA polymorphism between ancestry backgrounds is perfectly coherent with numerous previous descriptions . Although not analyzed in the same way, it is still used for tracing movements of human population, within the framework of anthropologic studies [23, 24]. Although the most frequent haplotypes are almost all specific to a given ancestry, it is worth noting that these haplotypes are often found in several populations with very different and contrasting frequencies . Our study and the existing HLA population genetic data support a strategy of targeted recruitments in regions with ancestry-specific HLA haplotype enrichment. This would efficiently contribute to overall haplobank content optimization and would be particularly beneficial to patients with mixed ancestry background.
Our mathematical model has technical limits, particularly related to the frequencies of HLA haplotypes. Being particularly robust for the most frequent haplotypes, the exponentially decreasing distribution of the haplotype frequency is common whatever the ancestry backgrounds, but estimations of lowest frequency HLA haplotypes are affected by sampling fluctuations . As a consequence, estimations of the model for rarer haplotypes may be less accurate and more difficult to validate. Our limited empirical validation of the model-based estimated counts additionally demonstrates that the use of haplotype frequencies at high resolution implies a match for additional loci of the major histocompatibility complex (MHC) region (6p21.3); in our data, all the HLA-A, HLA-B, and HLA-DRB1 haplo-homozygous donors are also homozygous for HLA-C.
To our knowledge, hemi-similarity matching was first presented in the context of the use of PSC lines in regenerative medicine by Bradley et al. in 2002 . This concept highlights the fact that a bank of human pluripotent HLA homozygous stem cells may help reduce the number of donor cell lines needed to get around the issue of HLA mismatch with any recipient (Supporting Information Fig. S1). Indeed, in those conditions only half the HLA alleles will be constraining the matching, which drastically increases the odds of finding a donor cell line for any recipient. For example, in the population of EU ancestry that was used in the present study and that was previously published,  the 4,801 most frequent phenotypes (resulting from 5,050 different pairs of the 100 most frequent haplotypes) only represented 30.3% of the population. By contrast, hemi-similarity by single haplotype for cell lines from the same 100 most frequent haplotypes matched approximately 120,000 phenotypes, representing 78% of the entire population. However, it is important to mention that this concept does not apply to HSC transplantations due to the graft-versus-host effect; the effector cells of the immune systems, from an HLA homozygous donor cell line, would recognize as nonself HLA alleles of the recipient, which are not involved in the hemi-similarity match.
Building on their conceptual proposal that triple-homozygous PSC lines may be instrumental to overcome the problem of immune rejection, Bradley and coworkers published a seminal paper in 2005  that revealed large biases introduced by population effects. Accordingly, the number of homozygous haplotypes required to meet the need of a large proportion of the British population appeared relatively small and altogether not beyond reach of a dedicated effort. A similar result was presented later by Nakatsuji et al. for the Japanese population .
However, these pioneer works were only speculations in the absence of a viable technique to choose donors with a specified genotype of interest to create PSC lines of the desired haplotypic phenotypes. Potential techniques to obtain a selected set of triple-homozygous cell lines were then still unproven, for example, parthenogenesis  or nuclear transfer . Nakatsuji et al.  rightfully underlined a major change in that situation as soon as the groups of Yamanaka  and Thomson  obtained human iPSCs by genetic reprogramming, since iPSC lines can indeed be produced experimentally from any selected donor. Whether iPSCs may rapidly be used for cell therapy and regenerative medicine is under discussion, and there is still a debate about the extent of changes in gene expression that is introduced by the reprogramming process , including some that may elicit immune responses , and calls are placed for additional bench work before iPSC enter a clinical trial phase . In addition, all HLA haplotypes themselves may contribute to modification in gene expressions, as a recent study using hybrid tiling and splice junction microarray suggests that haplotype-specific variation occurs in the spliceo transcriptome of the MHC . Nevertheless, there is general agreement that they eventually will, because reprogramming techniques evolve toward nonintegrative genetic  and even nongenetic methods . Therefore, it appears timely to establish in which way a bank could be built in the most economically sustainable manner, highlighting how the optimal use of the concepts of HLA “hemi-similarity” would meet the overall need of the world population.
Our results clearly highlight the feasibility of a haplobank approach. However, our results also underline both a very different intragroup frequency of HLA haplotypes, depending on the ancestry background under study, and a wide intergroup HLA haplotype diversity between the four large ancestry backgrounds analyzed. Therefore, our results clearly call for a concerted international effort of regional recruitment for creating a common haplobank that would be a clinically relevant biologic resource eventually allowing worldwide implementation of regenerative medicine.
We acknowledge Jorge Oksenberg for critical reading of the manuscript and providing HLA data to empirically validate the model generated through grant NIH-U19AI067152.