The HLA‐B landscape of Africa: Signatures of pathogen‐driven selection and molecular identification of candidate alleles to malaria protection

Human leukocyte antigen (HLA) genes play a key role in the immune response to infectious diseases, some of which are highly prevalent in specific environments, like malaria in sub‐Saharan Africa. Former case–control studies showed that one particular HLA‐B allele, B*53, was associated with malaria protection in Gambia, but this hypothesis was not tested so far within a population genetics framework. In this study, our objective was to assess whether pathogen‐driven selection associated with malaria contributed to shape the HLA‐B genetic landscape of Africa. To that aim, we first typed the HLA‐A and ‐B loci in 484 individuals from 11 populations living in different environments across the Sahel, and we analysed these data together with those available for 29 other populations using several approaches including linear modelling on various genetic, geographic and environmental parameters. In addition to relevant signatures of populations’ demography and migrations history in the genetic differentiation patterns of both HLA‐A and ‐B loci, we found that the frequencies of three HLA alleles, B*53, B*78 and A*74, were significantly associated with Plasmodium falciparum malaria prevalence, suggesting their increase through pathogen‐driven selection in malaria‐endemic environments. The two HLA‐B alleles were further identified, by high‐throughput sequencing, as B*53:01:01 (in putative linkage disequilibrium with one HLA‐C allele, C*04:01:01:01) and B*78:01 in all but one individuals tested, making them appropriate candidates to malaria protection. These results highlight the role of environmental factors in the evolution of the HLA polymorphism and open key perspectives for functional studies focusing on HLA peptide‐binding properties.


| INTRODUCTION
The genetic history of Africa is one of the most fascinating in evolutionary genetics. First, this continent is assumed to be the homeland of modern humans according to both fossil records going back between 195,000 and 300,00 years (Aubert et al., 2012;Hublin et al., 2017;McDougall, Brown, & Fleagle, 2005;Richter et al., 2017) and genetic evidence (Campbell, Hirbo, Townsend, & Tishkoff, 2014; for a review); second, it is currently inhabited by 1.2 billion people speaking more than 2,000 languages (Lewis, Simons, & Fennig, 2016) and living in highly diverse environments, from arid deserts to savannahs, tropical rain forests and extended highlands. The genetic structure of African populations is thus expected to reveal highly heterogeneous genetic signals due both to a complex demographic history characterized by thousands of years of populations' migrations, assimilations and/or replacements (Excoffier, Pellegrini, Sanchez-Mazas, Simon, & Langaney, 1987;Poloni et al., 2009;Tishkoff et al., 2009), and to various kinds and/or intensities of natural selection operating in response to peculiar pathogenic environments and resources conditioning human survival and lifestyles (Gomez, Hirbo, & Tishkoff, 2014;Triska et al., 2015).
In this context, the analysis of the major histocompatibility complex (MHC)-encoded molecules, named human leukocyte antigens (HLA) in our species, is particularly stimulating as they assume vital functions in the organism by presenting pathogen-derived peptides to T cells as the main initial step of the adaptive immune response (Parham & Janeway, 2015). This is the reason why the huge level of polymorphism observed at HLA genes is generally regarded as a consequence of balancing selection favouring heterozygous individuals in pathogen-rich environments, an early hypothesis (Doherty & Zinkernagel, 1975;Hughes & Nei, 1988) that was further sustained, at the population level, by neutrality tests (Buhler & Sanchez-Mazas, 2011;Solberg et al., 2008) and correlation studies (Prugnolle et al., 2005;Qutob et al., 2012;Sanchez-Mazas, Lemaitre, & Currat, 2012).
However, single HLA (more generally, MHC) loci may also display reduced levels of diversity that would not necessarily reduce the fitness of individuals thanks to the high polymorphism existing at other loci (Buhler, Nunes, & Sanchez-Mazas, 2016;de Groot & Bontrop, 2013). In addition, specific HLA alleles are known or supposed to confer resistance or susceptibility to various auto-immune (e.g., IDDM, coeliac disease) and infectious (e.g., HIV, tuberculosis, malaria) diseases (reviewed by Naranbhai & Carrington, 2017;Sollid, 2017;and Trowsdale & Knight, 2013;among others), indicating that directional selection may also drive the evolution of HLA genes. As an example, the HLA-DRB1*12:02:01 allele has been suggested to have increased in frequency in a Mongolian population that migrated from the north to the southwest of China about 700 years ago as a consequence of its settlement in a new environment where this allele would have been protective (Sun et al., 2015).
In Africa, a study conducted some 25 years ago in Gambia showed that one particular allele of the HLA-B locus, HLA-B*53, exhibited higher frequencies in a healthy control sample than in patients suffering from malaria and concluded that this allele was protective against Plasmodium falciparum (P. falciparum hereafter) infection (Hill et al., 1991). However, probably because HLA data for African populations remained very scarce since then, it is not known whether this putative protection had significant effects on HLA-B frequencies in this continent. Actually, the patterns of HLA genetic variation observed either worldwide or within continents are generally well predicted by geography (Buhler & Sanchez-Mazas, 2011;Cao et al., 2004;Currat, Poloni, & Sanchez-Mazas, 2010;Di & Sanchez-Mazas, 2011, 2014Di, Sanchez-Mazas, & Currat, 2015;Mack et al., 2000;Sanchez-Mazas, 2001; Sanchez-Mazas, , suggesting that natural selection did not blur the signatures of human migration history. However, because pathogens are also geographically distributed, demographic history and natural selection may generate confounding effects that may be difficult to disentangle on simple patterns of interpopulation differentiation (Meyer, Single, Mack, Erlich, & Thomson, 2006). This motivates the use of more powerful approaches where various kinds of genetic, geographic and environmental parameters are simultaneously taken into account.
In this study, our objective was to assess whether directional selection driven by P. falciparum infection could have shaped the HLA genetic landscape of Africa, particularly at the HLA-B locus where one allele, HLA-B*53, was suggested to be protective to malaria. To address this question, we first typed the HLA-A and -B loci in almost 500 individuals from 11 populations located in distinct areas of the Sahel (transitional ecoregion lying between the Sahara Desert to the north and the wooded savannas to the south) where HLA variation had seldom been investigated-thus largely improving the data set suitable for this study-and where malaria exhibits contrasted levels of endemicity. We then analysed these new data together with those published for 29 other populations from different regions of Africa through several approaches including linear modelling, which allowed to take into account the prevalence of P. falciparum malaria across the whole continent in addition to other genetic, geographic and environmental parameters. We analysed HLA-A simultaneously to HLA-B in the same population samples because the former locus is assumed to be less intensely submitted to selection than the latter (Buhler & Sanchez-Mazas, 2011;Di et al., 2015;Sanchez-Mazas et al., 2013;Solberg et al., 2008) and is thus expected to reveal contrasted results in case of substantial selection acting on HLA-B. Finally, because our results uncovered several HLA alleles as putative targets of P. falciparum-driven selection, we determined the precise nucleotide sequence of these alleles using a highthroughput sequencing technology.

| DNA extraction and HLA typing (Sahel set)
All individuals from the Sahel set were healthy unrelated volunteers who gave their informed verbal consent for the study and whose sampling was authorized by the appropriate authorities of their respective country, as described elsewhere (Podgorn a et al., 2015).
DNA was extracted from saliva samples using Oragene TM DNA (OG 500) collection kits as described in the protocols provided by the supplier of the technology used (DNA Genotek, Ottawa, ON, Canada) and typed for HLA-A and -B exons 2 and 3 (peptide-binding region) by reverse PCR-SSO typing on microbeads arrays (LABType, One Lambda Inc., Canoga Park, CA, USA). Whenever necessary, ambiguities were resolved by PCR-SSP (Olerup, Stockholm, Sweden).
In a further step, B*53-, B*78-and A*74-positive individuals (except some B*78 and A*74 carriers for which initial typings already revealed the precise allele) were retyped by a high-throughput sequencing technology (Holotype HLA X2, Omixon Biocomputing Ltd, Budapest, Hungary and NXType NGS, One Lambda Inc., Canoga Park, CA, USA) to identify the exact molecular variant.

| Environmental parameters
A detailed map of P. falciparum malaria prevalence in Africa provided by the Malaria Atlas Project (http://www.map.ox.ac.uk Gething et al., 2011) was used in this study. This map was drawn from a database of P. falciparum parasite rates (PfPR) corresponding to the proportions of individuals sampled in each location for which the pathogen was detected in the blood, standardized for an age range comprised between 2 and 10 years old. The data were then interpolated to provide these values in a grid of 5 9 5 km illustrated by a continuous coloured map for the whole continent. To analyse the relationship between HLA allele frequencies and P. falciparum malaria prevalence in Africa, we assigned PfPR values to the locations of the 40 population samples used in this study (Table S1). This assignment step was made using the QGIS  (i.e., before malaria control efforts were undertaken, Bhatt et al., 2015) to account for the fact that the samples analysed in this study were composed (to few exceptions) of individuals born before year 2000. The map is shown in Figure 1, where the 40 HLA-typed populations are also located based on their geographic coordinates.
To assess a possible relationship between the HLA diversity of each population and its potential exposure to pathogens, we also used the number of infectious diseases (NID) observed in each African country (as provided by the GIDEON online database, http://www. gideononline.com, accessed on March 2016) as a measure of pathogen richness and we assigned these values to the populations tested in this study (Table S1).

| Statistical analyses
Due to the scarcity of African population samples typed for HLA at resolution levels higher than the first-field (which corresponds to allele groups (or lineages) according to the official HLA nomenclature (Robinson et al., 2015)), we used that same resolution level for the newly typed populations, thus allowing to compare them with as many other data as possible.  Litwin, 1989) using standardized residuals.
Interpopulation analyses were then performed on the total set of 40 populations. We first plotted nonmetrical multidimensional scaling (NMDS, Kruskal, 1964) analyses based on Prevosti, Ocaña, and F I G U R E 1 Map of Africa showing the prevalence of Plasmodium falciparum malaria (estimated by P. falciparum parasite rates PfPr, coloured areas) according to the Malaria Atlas Project (http://www.map.ox.ac.uk) and the location of the populations analysed in this study. Population names (see Table S1 for  All statistical analyses were performed with R (R Core Team 2013) (also described by Dalgaard, 2008) environment packages RE-SHAPE2 (Wickham, 2007), GGPLOT2 (Wickham, 2009), MASS (Venables & Ripley, 2002), VEGAN (Oksanen et al., 2016), NCF (Bjornstad, 2015) and GEOSPHERE (Hijmans, 2015), except the population substructure (ANOVA) analyses for which we used the ARLEQUIN software (Excof-

| Overall HLA-A and -B genetic landscape of Africa
The HLA frequencies of the total set of 40 populations, that is, the 11 typed in this study and the 29 taken from published studies (reported in Table S2) were used to compute genetic distances and plot the two-dimensional (2D) NMDS shown in Figure 2a  and that the geographic regions are more often discriminated (F CT > F SC ) according to HLA-A, which slightly increases the overall geographic structure at this locus.

| Geographic variation of HLA-A and -B alleles
We performed additional analyses to assess whether particular HLA alleles were more significantly associated than others to this overall genetic pattern. Spatial autocorrelation was found to be significant for many alleles at both loci (Table S5) (Di & Sanchez-Mazas, 2011, 2014Di et al., 2015), we also checked for patterns potentially associated with latitude and longitude. Spearman's q and Pearson's r tests revealed high (above 0.5) and significant correlations (for at least one of the two coefficients q and r, see

| Associations of HLA diversity to environmental parameters
Because HLA genes are known to be targets of natural selection through heterozygous advantage and resistance or susceptibility to many diseases (see Section 1), we further explored possible effects of environmental parameters on the observed HLA genetic patterns.
We first checked whether the heterozygosity estimated in the stud-    Table S1 (Table S8).  (Table S9), although the amount of linkage disequilibrium across the HLA region appears to be low in Africa (Testi et al., 2015) compared to Europe (Bugawan, Klitz, Blair, & Erlich, 2000;Sanchez-Mazas et al., 2000), as generally observed for other genetic markers (Campbell & Tishkoff, 2008 Allele names denote the estimated allele frequencies, kB is the number of alleles observed at locus HLA-B, the coefficients denote how much a 1% increase in frequency changes PfPR (e.g., in the retained model, an increase of 1% of the B*78 frequency is associated with an increase of 5.366% of PfPR, with a standard error of 1.405, the ratio of the estimate to its standard deviation being a t test statistic, highly significant in this case). R 2 is the coefficient of determination and the adjusted version compensates for the number of parameters of the model. The initial and retained models are not significantly different. Significant coefficients and F statistics of the retained model are in bold and marked with stars. *p < .1; **p < .05; ***p < .01; df = degrees of freedom.

| Candidate HLA alleles to malaria protection
F I G U R E 3 Plot of allele frequency as a function of Plasmodium falciparum malaria prevalence (estimated by P. falciparum parasite rates PfPR). The slope of the straight lines corresponds to the regression coefficients of the three alleles B*53, A*74 and B*78 (see Table S8 Eritrea and Port Sudan (Young, 1996), migrated from Saudi Arabia at the end of the 19th century and remained isolated from the Cushitic populations (in particular the Beja) dwelling in the same territory. The BEJ (whose name is related to the Arabic word "Bedawi" meaning Bedouin, and who speak their own language) are composed of several endogamous groups, among which the Hadendowa (camel herders) typed in this study (Gudrun, 2006;Manger, 1996). Interestingly, the FUL (included in the 29 population samples taken from published studies), who also appear as outliers in our analyses and whose genetic divergence has been observed for other genetic markers . Interestingly, it has also been stated that the FUL are less susceptible to malaria than neighbouring populations of Burkina Faso (Modiano et al., 1996), showing a higher immune reactivity which would not be due to the "classical" malaria-resistance genes (Modiano et al., 2001) but to a functional deficit of T regulatory cells (Torcia et al., 2008) and higher levels of cytokines (Bostrom et al., 2012). This would explain why studies (Poloni et al., 2009;Tishkoff et al., 2009). In contrast, the populations located in the WSFAR (except the FUL) and SSAFR regions, here all represented by Niger-Congo (NC) speakers, form two distinct population groups. This is fully consistent with past genetic differentiations of Western and Southern NC populations resulting from two extensive migrations westwards (between 8,000 and 5,000 years ago) and southwards (between 5,000 and 3,000 years ago, Bantu expansion) from an original location situated around Nigeria and Cameroon, as advocated by archaeological and linguistic studies (Diamond & Bellwood, 2003;de Filippo, Bostoen, Stoneking, & Pakendorf, 2012;Li, Schlebusch, & Jakobsson, 2014 polymorphisms, as previously indicated by computer simulations Di et al., 2015).

| Signatures of natural selection on African HLA genetic patterns
Two kinds of natural selection were addressed in this study, that is, balancing selection due to heterozygous advantage (in the general sense), which is expected to maintain high levels of populations' heterozygosity, and directional selection due to higher fitness of genotypes more protective to particular diseases, which is expected to increase the frequencies of specific alleles (see, e.g.  (Table S2) is consistent with the hypothesis that locus B and is more prone to balancing selection than locus A (Buhler & Sanchez-Mazas, 2011;Sanchez-Mazas et al., 2013;Solberg et al., 2008). However, we did not find any correlation between heterozygosity and pathogen richness at any of the two loci; therefore, our results do not support a model where heterozygous individuals would, in general, be favoured (Prugnolle et al., 2005;Qutob et al., 2012;Sanchez-Mazas et al., 2012). One possible reason is that our analysis lacked power to reveal such a correlation, either because the number of infectious diseases (NID) used in our study as a proxy of pathogen richness is a poor estimator of pathogen diversity, or because we used HLA data defined at the 1st-field level of resolution, which corresponds to specificities defining sets of molecules with possibly distinct peptidebinding properties, likely lowering the sensibility to detect relevant associations. Another likely explanation with regards to our results, however, is that either pathogen-driven positive selection or genetic drift, by increasing the frequencies of specific alleles in peculiar environments (e.g., where malaria is endemic) or populations (e.g., isolated (semi-)nomadic pastoralists), respectively, blurred the signatures of heterozygous advantage (in the general sense) at the continent level. Actually, at local geographic levels heterozygotes for both a malaria-protective allele and an allele that would protect to another prevalent disease in the same environment would probably better survive, and such regional heterozygote advantages would not necessarily be related to the overall pathogen richness prevailing in the corresponding location.
Malaria is due to infection by one of the five known Plasmodium species, P. falciparum being however largely predominant and the most virulent in sub-Saharan Africa (Snow, Guerra, Noor, Myint, & Hay, 2005). Different kinds of resistance to malaria (reviewed by Hedrick, 2011;Lopez, Saravia, Gomez, Hoebeke, & Patarroyo, 2010) are assumed to be conferred by diverse genetic markers related to haemoglobin disorders (HbS, HbE, HbC, thalassaemia), erythrocyte polymorphisms (ovalocytosis, DARC (see also Demogines, Truong, & Sawyer, 2012)), enzymopathies (e.g., G6pD), and immunogenetic variants (e.g., CR1, HLA). Although attention has recently been given to a putative role of HLA-G in the regulation of the response to P. falciparum infection (Garcia et al., 2013;Sabbagh et al., 2013), the strongest relationships between HLA and malaria protection are found for HLA-A and -B (Garamszegi, 2014), and the main effect is explained by the recognition of a nonamer peptide derived from P. falciparum liver stage antigen-1 by HLA-B*53 restricted T cells (Hill et al., 1991(Hill et al., , 1992. In addition, it has been suggested that B*53:01, carrier of the Bw4 epitope, has the capacity of interacting with NK cells, notably through the receptor KIR3DL1, which would contribute to its protective effects (Norman et al., 2013).  Ghosh (2008), and B*35(:01), which has been suggested to be protective to malaria in Ghana (Yamazaki et al., 2011) and Sardinia (Contu et al., 1998), share a strong affinity for amino acid F at peptide position 9 (accommodated by pocket F of the peptide-binding groove), this binding being then also putatively significant against this disease. Overall, the observation that several HLA-B alleles, with either similar or distinct peptide-binding profiles, would be protective to malaria suggests that pathogen-driven selection acting on the HLA-B locus would be of the kind defined as soft selective sweep (Hermisson & Pennings, 2005;Messer & Petrov, 2013) by which several alleles from the standing genetic variation (here B*53:01:01 and B*78:01 proposed in this study, and possibly others) would have become beneficial at the onset of a specific infectious disease (in this case P. falciparum malaria). Together with different forms of balancing selection (i.e., heterozygous advantage, frequency-dependent selection and selection fluctuating in time and/or space) to which the HLA loci are probably submitted (Hedrick, 2002;Meyer & Thomson, 2001), this mechanism would have also contributed to maintain a high level of polymorphism at this locus, rather than to decrease its diversity as expected from the "classical" form of selective sweep (Smith & Haigh, 1974).
Finally, one HLA-A allele, A*74, was also found to be positively associated with PfPR, speaking in favour of this allele being an additional protective factor. This finding is supported by the study of Migot-Nabias et al. (2001) showing that specific immune responses operating at the pre-erythrocytic stage of development of P. falciparum would be regulated by HLA-A19, the latter being the broad antigen serotype to which A*74 belongs. By performing additional sequencing of A*74-positive individuals in our new samples, we also found that among the two identified alleles, A*74:01 was the only one observed in malaria-endemic areas (Western Africa) where it could have been positively selected, whereas A*74:03 was predominantly observed in the (semi-)nomadic RAS population from Sudan (six individuals on a total of nine A*74:03 carriers), where its frequency could have increased by genetic drift. On the other hand, the peptide-binding profiles predicted for these two alleles (Fig. S5) show a strong affinity for amino acid R at position 9, that is, the same affinity as that predicted for A*33:01 (and A*33:03) which has been proposed as a susceptibility, rather than a protective factor to a severe form of malaria (cerebral malaria) in Malian children (Lyke et al., 2011). Also, our linear models indicated that the frequency of A*74 was correlated not only with P. falciparum malaria prevalence but also with latitude (which was not the case for B*53 and B*78), unveiling a potential confounding effect possibly related to the amalgamation of the two different alleles A*74:01 and A*74:03. The hypothesis that A*74 would be protective to malaria is thus currently less supported than for the two candidate alleles proposed at the HLA-B locus.
To conclude, the present study emphasizes the multiple interests of performing a large-scale population genetic analysis of HLA genes taking jointly into account various genetic, geographic and environmental parameters. In addition to provide a better understanding of the evolutionary mechanisms that shaped the HLA polymorphism, in this case the HLA-B genetic landscape of Africa, our results disclosed two main candidate alleles to malaria protection at this locus, B*53:01:01 and B*78:01, the immunological role of which would now need to be investigated by direct functional analyses focusing on HLA peptide-binding properties. In a broader perspective, we expect further analyses focusing on the numerous highly polymorphic HLA genes to unveil other crucial associations between their molecular diversity and the wide range of environmental conditions constraining the subsistence of human populations.

ACKNOWLEDG EMENTS
This work was supported by grant #31003A_144180 of the Swiss National Science Foundation to ASM and by the Grant Agency of the Czech Republic (13-37998S-P505) to VC. We sincerely thank two anonymous reviewers for their help to improve a first version of this manuscript.

DATA ACCESSIBILI TY
All the data used in this study are available within the article and in the Supporting Information.