Evolutionary structure of Plasmodium falciparum major variant surface antigen genes in South America: Implications for epidemic transmission and surveillance

Abstract Strong founder effects resulting from human migration out of Africa have led to geographic variation in single nucleotide polymorphisms (SNPs) and microsatellites (MS) of the malaria parasite, Plasmodium falciparum. This is particularly striking in South America where two major founder populations of P. falciparum have been identified that are presumed to have arisen from the transatlantic slave trade. Given the importance of the major variant surface antigen of the blood stages of P. falciparum as both a virulence factor and target of immunity, we decided to investigate the population genetics of the genes encoding “Plasmodium falciparum Erythrocyte Membrane Protein 1” (Pf EMP1) among several countries in South America, in order to evaluate the transmission patterns of malaria in this continent. Deep sequencing of the DBLα domain of var genes from 128 P. falciparum isolates from five locations in South America was completed using a 454 high throughput sequencing protocol. Striking geographic variation in var DBLα sequences, similar to that seen for SNPs and MS markers, was observed. Colombia and French Guiana had distinct var DBLα sequences, whereas Peru and Venezuela showed an admixture. The importance of such geographic variation to herd immunity and malaria vaccination is discussed.


| INTRODUCTION
There is convincing evidence that Plasmodium falciparum originated in Africa and spread to the rest of the world by human migration (Anderson et al., 2000;Duval et al., 2010;Joy et al., 2003;Liu et al., 2010;Prugnolle et al., 2010Prugnolle et al., , 2011Yalcindag et al., 2012). Strong founder effects, resulting from global migration, have led to P. falciparum geographic variation in single nucleotide polymorphisms (SNPs) and microsatellites (MS) with the greatest diversity being observed within Africa. This spatial variation is particularly striking in South America where two main genetic clusters, previously shown through SNPs and MS variation, suggest independent introductions of P. falciparum from Africa through the transatlantic slave trade about 500 years ago (Yalcindag et al., 2012). Yalcindag and colleagues provided evidence for structuring of the parasite population into a northwestern cluster (Colombia) and a southeastern cluster (French Guiana/ Brazil/Bolivia; Yalcindag et al. 2012). This structure is believed to have originated through distinct human population movements related to the subdivision of the continent into the Portuguese and Spanish empires and is maintained today through the admixture of populations (Venezuelan and Peruvian) between these two clusters. The genetic markers used in the study of P. falciparum population structure in South America by Yalcindag et al. were putatively neutral allowing for inference of population history as well as demographic events (Yalcindag et al., 2012). They, however, do not define characteristics of parasite fitness (Kirk & Freeland, 2011). By comparison, both drug resistance markers and variant antigen encoding loci provide information about parasite behavior in relation to drug and/or immune selection forces.
These biomarkers are the key to pathogen diagnostic surveillance as they allow for the prediction of epidemics with different behaviors over space and time.
We decided to investigate the evolution of var genes, encoded by the major blood stage variant surface antigen, "Plasmodium falciparum Erythrocyte Membrane Protein 1" (PfEMP1), among several countries in South America to evaluate the population structure of these genes at the scale of a continent, and thus describe the transmission patterns of malaria. The P. falciparum genome is composed of up to 60 var genes with each representing a different antigenic form. To achieve clonal antigenic variation within the host, PfEMP1 is expressed sequentially in a mutually exclusive manner (Dzikowski, Frank, & Deitsch, 2006;Scherf, Lopez-Rubio, & Riviere, 2008;Voss et al., 2006). This remarkable biological feature enables P. falciparum to evade the human immune response and establish chronic infections linked to specific cellular interactions. Indeed, PfEMP1 also binds to host endothelial tissues with different variants exhibiting specific adherence characteristics for tissues, which in turn are associated with distinct disease manifestations (Avril et al., 2012;Claessens et al., 2012;Kraemer & Smith, 2003). PfEMP1 is thereby considered a virulence factor. The var multigene family is highly diverse among parasite genomes in natural parasite populations as well as in clinical cases (Mugasa et al., 2012;Rorick et al., 2013;Sulistyaningsih et al., 2013;Warimwe et al., 2013).
The var gene family contains several semi-conserved domains with specific structural characteristics called Duffy Binding-Like (DBL) domains. Among the different DBL domains characterized, DBLα is the most conserved and is involved in the adherence of infected red blood cells (RBCs) to uninfected RBCs (Chen et al., 1998;Vogt et al., 2003).
Bioinformatic analyses have shown that var gene diversity is of ancient origin and maintained by balancing selection, and that through the process of shuffling homology blocks during recombination var genes are able to diversify (Rask et al., 2010;Zilversmit et al., 2013).
Moreover, it has been demonstrated that var gene repertoire diversity within and between parasite genomes is also generated by meiosis during sexual recombination (Chen et al., 2011;Zilversmit et al., 2013) and by mitotic recombination during asexual division (Bopp et al., 2013;Claessens et al., 2014;Duffy et al., 2009). The high levels of mitotic recombination observed within cloned lines have led to a view that var genes may not be a stable marker for molecular surveillance.
However, population genetic studies from malaria endemic regions like Africa, Papua New Guinea, and South America have demonstrated that sequencing the highly conserved DBLα domain of var genes is an effective approach to both characterize and monitor P. falciparum diversity spatially and longitudinally (Barry et al., 2007;Chen et al., 2011;Day et al., 2017;Scherf et al., 2008;Tessema et al., 2015).
To date, limited population sampling of var genes from Venezuela and Brazil in South America has revealed restricted diversity both locally and regionally, as compared to African populations (Albrecht et al., 2006(Albrecht et al., , 2010Chen et al., 2011;Tami et al., 2003). Structuring of var genes across local populations on the South American continent, however, has not been established. In this study, using next-generation 454 sequencing, we have successfully genotyped the var DBLα domains of 128 P. falciparum clinical field isolates collected from four countries in South America between 2002 and 2008 and for which SNP and MS data were available for comparison (Yalcindag et al., 2012). Population genetic analysis using var genes has allowed us to address the following questions: (i) What are the limits of var gene diversity in these South American populations? (ii) Does population and Venezuela showed an admixture. The importance of such geographic variation to herd immunity and malaria vaccination is discussed.

K E Y W O R D S
evolutionary structure, Plasmodium falciparum, Plasmodium falciparum Erythrocyte Membrane Protein 1, population genomics, var genes structuring at var loci exist between/among the South American populations?, and (iii) Are var gene population genetics a reflection of geographic population structure on the South American continent when compared to SNPs and MS markers? The answers to these molecular epidemiological questions would allow us to predict the potential for epidemic transmission of P. falciparum clones not previously observed across South America.

| Ethical statement
Ethical clearance was obtained from the local ethics committees in each country sampled. The informed consent procedure for the study consisted of a presentation of the aims of the study to the community followed by invitation of individuals (or their parents/guardians) for enrollment. At the time of sample collection, the purpose and design of the study was explained to each individual and verbal informed consent was collected by a minimum of two people. The verbal consent process was consistent with the ethical expectations for each country at the time of enrollment, and the ethics committees approved these procedures.
All the samples collected from French Guiana and analyzed in this study were from blood collections that were required as standard medical care for any patient presenting with a fever on admission to the hospital. According to French legislation (Article L.1211-2 and related, French Public Health Code), biobanking and the secondary use of remaining human clinical samples for scientific purposes is possible if the corresponding patient is informed and has not objected to such use. This requirement was fulfilled for the present study; each patient was informed via the hospital brochure entitled "Information for Patients," and no immediate or delayed patient opposition was reported to the Malaria NRC by the clinicians.

| Study samples
The 128 P. falciparum isolates typed in this study (Table 1)

| Genotyping
Single nucleotide polymorphisms (SNPs) and microsatellite (MS) genotyping were previously performed in the study of Yalcindag et al. (2012). These data were used in the present analyses for comparison. and eluted in 100 μl of elution buffer per 200 μl of whole blood or per dried filter blot. The conserved var DBLα domain has previously been used as a marker for var gene diversity in other global studies (Barry et al., 2007;Chen et al., 2011;Day et al., 2017;Tessema et al., 2015).

| Var DBLα sequence analysis
A custom pipeline was developed to demultiplex, denoise, and remove PCR and sequencing artefacts from the DBLα domain reads.
The first part of the pipeline is available as the Multipass web server: http://www.cbs.dtu.dk/services/MultiPass-1.0, and the following cleaning steps described below are implemented in a python script available here: https://github.com/454data/postprocess. The sfffiles obtained from each region on the 454-plate were divided into smaller isolate-specific sff-files by identification of reads with exact matching MID sequences in both ends using BioPython v1.57 (Cock et al., 2009). Ambiguous primer sites were then identified (exact match) and trimmed off the flowgrams, reverse reads were reverse complemented, and a dat-file (AmpliconNoise format) with the resulting flowgrams was created for each isolate, using BioPython v1.57 (Cock et al., 2009). By combining the forward and reverse reads, this method takes advantage of bidirectional amplicon sequencing, since the forward reads will have highest quality in the 5′-end of the target sequence, and the reverse reads will improve the 3′-end quality. where h is the length of the homopolymer that gave rise to the flow value s. So the log-normal probability density function is: Parameter extrapolation was performed to obtain expected flow distributions for homopolymer lengths that were rare in the control isolates (Fig. S1).
2σ 2 total number of types in a population using frequency data on types seen once only and types seen twice only (Chao, 1984). In the setting of equal probability of distribution and sampling of types, the Chao1 estimator yields a point estimate of richness. These estimators cannot predict a probable maximum richness.

| Similarity indices
Ecological indicators of similarity were calculated to quantify the relatedness between the var gene repertoires identified from two isolates and among the South American populations using the DBLα domain.

| Pairwise type sharing
Pairwise type sharing (PTS), analogous to Sørensen's Index (quotient of similarity, or QS), is a useful statistic to analyze diversity and determine the proportion of DBLα types shared between isolate repertoires and among the South American populations (Barry et al., 2007;Chao, Chazdon, & Shen, 2005 Additionally, the PTS statistic was used to generate a distance matrix with genetic distance being defined as pairwise type distance (PTD). PTD was calculated as follows: PTD AB = 1 − PTS AB . PTD values range from 0 to 1, where a PTS score of 0 signifies that two isolate repertoires or populations are genetically identical, while 1 signifies that they are genetically distinct. The relationships were visualized using a neighbor joining tree constructed using Clearcut v1.0.9 (Barry et al., 2007;Sheneman, Evans, & Foster, 2006). A tree was also constructed using Jaccard distances (Jaccard, 1912); however, the tree based on PTD captured the geographic structure more clearly as it accounted for (i) the large number of DBLα types in the population and (ii) was weighted based on the abundance of DBLα types (i.e., DBLα types shared between isolates have more weight).

| Chao-Sørensen's Index
Chao-Sørensen's Index provides another estimate of similarity among the South American populations. This index adjusts for the abundance of each DBLα type (not just presence or absence of the DBLα type) in the population, as well as adjusting for the effect of unseen shared DBLα types in conditions of under sampling (Chao et al., 2006). These calculations were performed using EstimateS v9.1 (Colwell, 2013).
Similar to PTS (or QS), the Chao-Sørensen's Index was also used to determine genetic distance and the level of genetic differentiation between the South American isolates/populations using DBLα types.

| Genetic diversity and genetic differentiation
The distribution of genetic diversity of DBLα types among the South American populations was investigated using the same methods and tools previously described in Yalcindag et al. (2012). For these analyses, each DBLα type was considered a locus and each isolates' multilocus genotype was the sum of the presence (coded as 1) or absence (coded as 2) of each DBLα type. For each isolate, all unique DBLα types identified were included.

| Genetic differentiation
Pairwise Weir and Cockerham's F ST estimates (Weir & Cockerham, 1984) between the South American population were computed for the SNPs, MS, and DBLα types using the FSTAT V.3.7 software (updated from (Goudet, 2001)). To explain patterns of isolation by dis-

| Principal component analyses
Principal component analyses (PCA) were performed on the matrix of binary allele profiles using the R-package "Adegenet" (Jombart, 2008). These analyses were completed to obtain further understanding on the genetic structure of the South American isolates and populations.

| STRUCTURE analyses
We used the Bayesian clustering method implemented in STRUCTURE v.2.1 (Pritchard, Stephens, & Donnelly, 2000) to identify population structure. We ran models allowing for admixture, with the number of clusters or populations (K), ranging from K = 1 to the number of South American populations included (K = 5) ( (Raymond & Rousset, 1995;Anderson et al., 2000), depending on the dataset). All simulations used 100,000 Markov chain Monte Carlo (MCMC) generations in the burn-in phase and 100,000 generations in the data collection phase. Ten independent runs were performed for each specified K to verify convergence in the estimates of posterior probabilities. The optimal number of clusters was estimated using the method proposed in Evanno, Regnaut, and Goudet (2005) (Evanno et al., 2005).

| Summary of sequencing results
The var DBLα domains of the 128 clinical field isolates collected previously in South America (Yalcindag et al., 2012) were successfully sequenced using the next-generation 454 sequencing (Roche) approach.
After the application of the custom pipeline for DBLα sequence analy-

| Assembly of reads into var DBLα sequences
Within each isolate, sequences were clustered into nonredundant DBLα sequences using the flowgram clustering method (Section 2).
The method resulted in 5,699 DBLα sequences among the 128 South American isolates with 29.8 sequence reads per DBLα sequence.
Among the South American populations, between 352 and 2,048, nonredundant DBLα sequences were identified (Table 1) and represent the dataset on which the analyses were performed.

| Definition of var DBLα types, frequency distribution, and richness estimates
To determine the number of unique DBLα types shared between iso-

| Analysis of isolate var DBLα repertoires
At the isolate level, the median repertoire size (number of unique DBLα types per isolate) was 42 (range = 13-92); however, the majority of isolates (N = 125, 97.7%) had repertoires composed of ≥20 DBLα types (Fig. S4); that is, they were sufficiently well sampled to permit comparison of DBLα types between isolate repertoires. To Guiana clustered together (i.e., higher median PTS scores, darker shading on the PTS heat map) and were distinct from those isolates sampled in Colombia (i.e., lower median PTS scores, lighter shading on the PTS heat map) (Figure 4b). The Peruvian and Venezuelan isolates' median PTS scores were distributed between these two geographic clusters, as they showed transitional DBLα repertoire overlap (i.e., moderate shading on the PTS heat map) with both the French Guiana and Colombian isolates (Figure 4b).

| Geographic structuring of the South American P. falciparum isolates
The study of the structural genetic organization of the South American   (Table 2).
To further understand this pattern of DBLα type sharing, comparisons were made to evaluate the effects of isolation by distance (IBD) between the South American populations. Figure  nature of the Peruvian and Venezuelan populations was further evident from the neighbor joining tree (Fig. S6).

| DISCUSSION
Molecular surveillance of diverse pathogens like P. falciparum that are constantly evolving is critical to achieve control. In this context, detection of genetic variation in the genes encoding the major surface antigens is important for disease surveillance as immunity to such antigens drives the dynamics of the transmission system and allows for epidemics to be antigenically characterized. For example, the single copy hemagglutinin gene of the influenza virus is used to predict spatial and temporal patterns of disease (Munster et al., 2007;Nelson & Holmes, 2007;Plotkin, Dushoff, & Levin, 2002;Rambaut et al., 2008). In contrast, for P. falciparum monitoring antigenic diversity is very complex. Numerous polymorphic surface antigen encoding genes in different life cycle stages (e.g., rif, stevor, msp1, msp2) have been utilized as diagnostic markers of diversity (Baruch et al., 1995;Bull et al., 2005;Cheng et al., 1998;Florens et al., 2002;Gardner et al., 2002;Sam-Yellowe et al., 2004;Scherf et al., 2008;Smith et al., 1995;Su et al., 1995;Woehlbier et al., 2010). Specifically, the major P. falciparum variant surface antigen of the blood stages, PfEMP1, is a key marker as it is a virulence factor and immunity to this antigen determines the dynamics of infection within and between hosts (e.g., (Artzy-Randrup et al., 2012)). To date few malaria surveillance studies have used var genes encoding PfEMP1 due to the extreme diversity and the complexity of undertaking population genetics with this multigene family (Albrecht et al., 2010;Artzy-Randrup et al., 2012;Chen et al., 2011;Day et al., 2017;Tessema et al., 2015). Here, using the 454 high throughput sequencing approach to obtain well-sampled populations, we show that the conserved DBLα domain of var genes constitutes a promising biomarker to infer population structure, and more generally for epidemiological disease surveillance. Indeed, we demonstrate that DBLα types were spatially variable and geographically structured in South America.
In contrast to P. falciparum in Africa, we observed "limited" var DBLα type diversity within the local South American populations, which was consistent with previous surveys (Albrecht et al., 2010;Chen et al., 2011). One hypothesis to explain this "limited" var DBLα type diversity could be linked to relatively lower transmission and hence fewer coinfections/superinfections (i.e., multiple genotypes within an isolate) existing in the South American human population. This would lead to less frequent outcrossing (mating between two genetically distinct parasites) during meiotic recombination in the mosquito phase of the parasite life cycle. The observed limited local diversity of DBLα types in South America raises the possibility that the highly immunogenic PfEMP1 could be a vaccine target in this location.

The South American east/west geographic differentiation in var
DBLα types mirrors the population structure reported previously using SNPs and MS markers in the same populations (Yalcindag et al., 2012). Like the neutral/non-selected markers, var DBLα type structuring was also consistent with an isolation-by-distance model. The significant genetic differentiation obtained could be explained by the "limited" genetic diversity observed in the populations under study (Hedrick, 2005), the epidemic characteristic and small effective size of South American P. falciparum populations (Anderson et al., 2000) as well as multiple independent introductions of P. falciparum into South America (Yalcindag et al., 2012).
Whether the underlying driver of geospatial structuring of the var loci is due to regional adaptation of P. falciparum genomes to unique mosquito vectors and/or to human demography/population history remains to be further examined. If the latter, then we suggest that epidemic transmission of P. falciparum could occur across South America through the importation of novel variants not previously seen in the region. Indeed, the low shared diversity (few var DBLα types) and/or mixing among the South American countries suggests the necessity to develop local strategies for vaccination vs.
a pan-continental approach. To conclude, the use of var population genomics in South America allowed for the description of the genetic complexity in the reservoir of infection as well as a better understanding of P. falciparum epidemiology in relation to differences in parasite antigenic variation.

ACKNOWLEDGMENTS
We would like to thank the participants for their willingness to be involved, as well as the field teams for their expertise and coordina-

CONFLICT OF INTEREST
None declared.

DATA ACCESSIBILITY
All DBLα sequences analyzed in this study are publicly available for download in GenBank: KX845707-KX851405.