Genome‐wide mosaicism in divergence between zoonotic malaria parasite subpopulations with separate sympatric transmission cycles

Abstract Plasmodium knowlesi is a significant cause of human malaria transmitted as a zoonosis from macaque reservoir hosts in South‐East Asia. Microsatellite genotyping has indicated that human infections in Malaysian Borneo are an admixture of two highly divergent sympatric parasite subpopulations that are, respectively, associated with long‐tailed macaques (Cluster 1) and pig‐tailed macaques (Cluster 2). Whole‐genome sequences of clinical isolates subsequently confirmed the separate clusters, although fewer of the less common Cluster 2 type were sequenced. Here, to analyse population structure and genomic divergence in subpopulation samples of comparable depth, genome sequences were generated from 21 new clinical infections identified as Cluster 2 by microsatellite analysis, yielding a cumulative sample size for this subpopulation similar to that for Cluster 1. Profound heterogeneity in the level of intercluster divergence was distributed across the genome, with long contiguous chromosomal blocks having high or low divergence. Different mitochondrial genome clades were associated with the two major subpopulations, but limited exchange of haplotypes from one to the other was evident, as was also the case for the maternally inherited apicoplast genome. These findings indicate deep divergence of the two sympatric P. knowlesi subpopulations, with introgression likely to have occurred recently. There is no evidence yet of specific adaptation at any introgressed locus, but the recombinant mosaic types offer enhanced diversity on which selection may operate in a currently changing landscape and human environment. Loci responsible for maintaining genetic isolation of the sympatric subpopulations need to be identified in the chromosomal regions showing fixed differences.

Multilocus microsatellite genotyping analysis of P. knowlesi infections revealed that human infections in Malaysian Borneo comprise two major genetic subpopulations that are, respectively, associated with long-tailed and pig-tailed macaque reservoir hosts (Divis et al., 2015), with significant divergence confirmed by whole-genome sequence analyses of parasites in human infections (Assefa et al., 2015). In most areas of Malaysian Borneo, the number of human clinical infections of the parasite subpopulation type associated with longtailed macaques (Cluster 1) is higher than those having the type associated with pig-tailed macaques (Cluster 2) . Further analyses of additional samples have subsequently revealed a third divergent subpopulation of P. knowlesi (Cluster 3) on the mainland of South-East Asia which includes Peninsular Malaysia Yusof et al., 2016). So far, only P. knowlesi parasites of Cluster 3 have been studied in infections of laboratory monkeys (Assefa et al., 2015), and one strain of this type has been adapted to efficiently invade human erythrocytes in culture (Lim et al., 2013;Moon et al., 2013).
To develop laboratory studies on the other two major zoonotic populations will require establishment of parasite isolates in controlled monkey infections, or ideally into culture with erythrocytes. Analysis of P. knowlesi samples from human clinical infections is relatively straightforward, as most of these are not mixed with other species, whereas most natural P. knowlesi infections in macaques occur together with other primate malaria parasite species (Lee et al., 2011).
The first large-scale whole-genome sequence analysis of P. knowlesi infections contained clinical samples that were mostly of the Cluster 1 type (N = 38), yielding results indicating that this has undergone long-term population growth, with additional evidence of selection on particular loci (Assefa et al., 2015). There were only 10 Cluster 2 type infections sequenced in the study, which limited investigation of the demographic history of that subpopulation, but these were sufficient to indicate that the level of intercluster divergence varied across the genome, some loci having a concentration of apparently fixed differences and others showing more shared polymorphism (Assefa et al., 2015). A separate simultaneous study reported data from another six infections, confirming the divergence between sympatric subpopulations (Pinheiro et al., 2015), but this did not cumulatively give a much deeper sample. In agreement with the initial study (Assefa et al., 2015), a recent secondary analysis of the previously published data confirmed the existence of genomic regions with shared polymorphisms (Diez Benavente et al., 2017), but did not include any new data.
For a more informed comparison of these important zoonotic parasite subpopulations, a much larger sample of Cluster 2 type P. knowlesi genome sequences was obtained in this study. Combining the new data with samples sequenced previously (Assefa et al., 2015;Pinheiro et al., 2015) yielded a total of 34 Cluster 2 genome sequences that enables a more comprehensive analysis of genomic polymorphism and divergence between the subpopulations. This provides new understanding of the genome-wide variation in divergence of these two sympatric P. knowlesi subpopulations, essential for understanding their long-term maintenance and potential for future adaptation. Hygiene and Tropical Medicine. Leucocytes were removed by allowing 10 ml of blood to pass through a CF11 cellulose column, to enrich for erythrocytes and thereby increase the proportion of parasite compared to host DNA. Genomic DNA was extracted using QIAamp DNA Mini kits (Qiagen, Germany), and all infections were confirmed to contain only P. knowlesi by nested PCR assays testing for all locally known malaria parasite species (Lee et al., 2011). Determination of the genetic subpopulation cluster of each DNA sample was conducted by microsatellite genotyping , and 21 samples of the Cluster 2 type that had sufficient DNA were selected for whole-genome sequencing. These were mostly single genotype infections as determined by microsatellite typing   (Li, 2013). This generated file in the SAM (sequence alignment/map) format, and followed by the conversion into a BAM (binary alignment/map) format using the SAMTOOLS package version 0.1 (Li et al., 2009). Due to the possible effect of PCR amplification bias introduced during the DNA library preparations, read duplications were removed using the "MarkDuplicates" command from the Picard toolkit (https://github.com/broadinstitute/picard). The average depth coverage was analysed by the BEDTOOLS version 2 package using the "genomeCoverageBed" command (Quinlan & Hall, 2010).
Re-mapping of short read genome sequences generated from previous studies (Assefa et al., 2015;Pinheiro et al., 2015) against the version 2.0 of P. knowlesi strain H reference genome was also performed in the analysis ( and five laboratory isolates ("Nuri" SRA numbers ERR019406, "Hackeri" SRR2221468, "Malayan" SRR2225467, "MR4-H" SRR2225571 and "Philippines" SRR2225573). The reference H strain sequence belongs to Cluster 3 (Assefa et al., 2015), which is approximately equally divergent from Clusters 1 and 2, so no bias is expected in the efficiency of mapping of the sequences to this reference.

| Single nucleotide polymorphism calling and filtration
The calling of high-quality single nucleotide polymorphisms (SNPs) was performed using several steps, following procedures described previously (Assefa et al., 2015). For each isolate, SNPs were first identified from the BAM file using SAMTOOLS/BCFTOOLS with the fol- Further filtration involved the removal of positions that contained ambiguous sequences (represented as a long stretch of unknown nucleotides "N") in the reference genome. The SICAVar, KIR, and pkfam-a to pk-fam-e multigene families (Pain et al., 2008) and the subtelomeric regions were also filtered out to avoid ambiguous alignments, which may cause false-positive SNP calls. Subtelomeric regions were here determined by visually inspecting the whole-genome synteny mapping of P. knowlesi with the P. vivax homolog using the PlasmoDB GBrowse v2.48 (plasmodb.org/cg-bin/gbrowse/plasmodb/), with the boundaries of subtelomeric regions defined as sequences adjacent to the first conserved protein-coding gene (Table S2). After exclusion of subtelomeric regions and the large multigene families, 21.2 Mb (92%) of the 23.0 Mb corresponding to the reference nuclear genome was analysed from each sample.

| Genomic diversity and population structure
To measure the amount of polymorphism within the parasite population, the average pairwise nucleotide diversity (p) among the sequences from the individual infection samples was calculated. The skewness in allele frequency distributions was estimated by Tajima's D index. Both indices were calculated using the same genome-wide SNP data set in nonoverlapping window sizes of 10 kb and performed using the DIVSTAT software (Soares, Moleirinho, Oliveira, & Amorim, 2015). To illustrate the population substructure, the matrix of pairwise DNA distance among individuals was calculated and the Neighbour-Joining tree was constructed using the APE package version 3.4 in the R environment (Paradis, Claude, & Strimmer, 2004). An independent population structure evaluation was also conducted using principal coordinate analysis (PCoA) with SNPs having no missing data, using the APE package.
To estimate the divergence between the subpopulations, the genome-wide distribution of the fixation index (F ST ) between the two-subpopulation clusters was computed with SNPs having minor allele frequencies (MAFs) above 0.1, and above 0.3, using cus- and z-scores > 0.5) were taken into consideration in determining the junctions. Each candidate region was demarcated by first and last SNPs that fell within the merged windows, except for HDRs where SNPs with elevated F ST values were used as start and endpoints.
Patterns of polymorphisms (nucleotide diversity summarized by p and allele frequency spectrum summarized by Tajima's D) in all genomic regions were evaluated using DIVSTAT software. Test runs were performed in nonoverlapping window sizes of 10 kb for each subpopulation. Nonparametric Kruskal-Wallis tests were used to compare among the genomic regions as well as against the genome-wide background.

| Extra-chromosomal genomes
Population structure and relationships of the sympatric P. knowlesi subpopulations were further analysed using the extranuclear DNA, consisting of the nonrecombining genomes of mitochondria and plastid-like apicoplast. The 5.9-kb mitochondrial DNA sequences were obtained from the present whole-genome sequence data and previously published sequences (Assefa et al., 2015;Jongwutiwes et al., 2005;Lee et al., 2011;Pinheiro et al., 2015). Complete mitochondrial sequences were obtained from GenBank database, consisting of 26 haplotypes from human isolates (Accession nos. EU880446-EU880470) and 20 haplotypes from macaque isolates (EU880471-EU880474, EU880477-EU880486, EU880489-EU880493 and EU880499) in Kapit of Malaysian Borneo, and one human isolate from Thailand (AY598141). Three species, P. coatneyi (AB354575), P. cynomolgi (AB434919) and P. vivax (AY791551), that have close evolutionary relationships with P. knowlesi were included in the analysis as out-groups. For the apicoplast genome of P. knowlesi, 30.6 kb of the DNA sequences that had clear alignment was extracted from the present whole-genome data set as well as from previous data (Assefa et al., 2015;Pinheiro et al., 2015) following mapping and base quality checks as mentioned above.
The derived mitochondrial and apicoplast genome sequences were separately aligned using the CLUSTALX programme version 2 (Larkin et al., 2007), following which nucleotide diversity (p) and haplotype diversity (Hd) was determined using the DNASP version 5 software (Librado & Rozas, 2009). A maximum-likelihood tree was inferred with 1,000 bootstrap replicates and gaps treated as missing data using the PHANGORN packages in R (Schliep, 2011), with the ModelTest algorithm used to determine the best-fit nucleotide substitution model, which was GTR+I+G (General Time Reversible model with a proportion of invariable sites and gamma distribution). For the mitochondrial sequences, major haplotypes were determined with gaps treated as missing data, and the statistical parsimony haplotype network was constructed using the TCS version 1.21 software (Clement, Posada, & Crandall, 2000).

| Generation of new whole-genome sequences and SNP genotyping
Paired-end Illumina sequencing of 21 new P. knowlesi clinical infection samples, selected on the basis of microsatellite genotyping as belonging to Cluster 2 (the type previously associated with pig-tailed macaque as well as human infections), yielded a mean of 6.95 million high-quality reads per sample, which were mapped against the P. knowlesi H strain version 2.0 reference genome sequence (Table S3). The mean depth of sequence coverage genome-wide was 52.3-fold (range from 28.7-to 80.3-fold) per sample. In addition, Illumina short read sequence data from another 59 P. knowlesi isolates obtained previously (Assefa et al., 2015;Pinheiro et al., 2015) were remapped against the P. knowlesi H strain version 2.0 reference genome using the same assembly parameters (Table S1), followed by SNP calling. In the combined data set of 80 infection sequences, a total of 2,109,937 SNPs were identified in the nuclear genome. Following exclusion of those in subtelomeric regions or in the KIR or SICAVAR multigene families, or that had more than two alleles, 1,669,533 SNPs remained, of which 1,186,073 high-quality SNPs with less than 10% missing calls in all isolates were used for population genomic analyses.  (Figures 1a and S1). Together with previous data, this yielded an overall sample of 34 Cluster 2 isolate sequences, to achieve a similar sample size as previously available for Cluster 1. As is visually apparent from the Neighbour-Joining tree based on the pairwise genetic distances (Figure 1a), the Cluster 2 infections are less genetically diverse (p = 3.43 9 10 À3 ) than the Cluster 1 infections (p = 5.78 9 10 À3 ). Furthermore, the Cluster 1 subpopulation demonstrated a homogenous pattern of sequence diversity across the 14 chromosomes (Kruskal-Wallis, p = .23), in contrast with Cluster 2 that showed heterogeneous levels of diversity across the chromosomes (Kruskal-Wallis p < 10 À16 ) ( Figure S2). In Cluster 2, nucleotide diversity of entire chromosomes ranged from 2.25 9 10 À3 (for chromosome 7) to 4.38 9 10 À3 (for chromosome 5), but all had a lower diversity than in Cluster 1 (Wilcoxon signed rank p < 10 À16 ). In a majority of nonoverlapping 10-kb windows genome-wide, nucleotide diversity (p) indices were lower in Cluster 2 ( Figure 1b). Large regions of chromosomes showed contiguous stretches in which diversity was much higher in Cluster 1, and also contiguous stretches in which the diversity was more similar (Figure 1c).

| Genomic regions of high and low divergence
The genome-wide variation in diversity in Cluster 2 suggested that there might be variation in levels of intercluster divergence.
Analysing SNPs with overall minor allele frequencies above 10% (193,068 SNPs), the mean genome-wide fixation index indicated substantial divergence between the two subpopulations (mean The relative level of population differentiation of all windows of 500 contiguous SNPs across the genome was evaluated by considering standard deviations from the mean genome-wide F ST value (zscore). Genomic regions were identified that contained contiguous windows defining low divergence regions (LDR with z-score < À0.5) and high divergence regions (HDR with z-score > 0.5). This revealed large genomic blocks of high or low divergence (Figure 2c; Table S4).

| Intracluster diversity in genomic regions with contrasting levels of divergence
The relationship of intercluster divergence with the varying nucleotide diversity (p) in Cluster 2 across the genome (Figure 1c) was investigated. Comparing between the two subpopulations, the differences in nucleotide diversity were higher in the HDRs than in the LDRs or in the rest of the genome (Figure 3; Mann-Whitney U p < 10 À16 for both comparisons). Most of the highly differentiated F I G U R E 1 Population structure of Plasmodium knowlesi indicated by whole-genome sequence data. (a) Neighbour-Joining tree based on a pairwise single nucleotide polymorphism (SNP) difference matrix of 80 P. knowlesi isolates. The 21 new genome sequences are indicated with stars, yielding a total sample size for Cluster 2 (N = 34) that is similar to that of Cluster 1 (N = 41). The scale bar indicates proportions of all SNPs differing between samples. (b) Scatterplot of nucleotide diversity (p) in individual nonoverlapping 10-kb windows of the genome, comparing data for Cluster 1 and Cluster 2 subpopulations. (c) Differences in nucleotide diversity between Cluster 1 and Cluster 2 subpopulations (p-diff) in each of the 10-kb windows of the genome [Colour figure can be viewed at wileyonlinelibrary.com] regions were those in which nucleotide diversity was substantially lower in Cluster 2 (Figure 3).
Reduced nucleotide diversity in HDRs compared to the rest of the genome was specifically seen in Cluster 2 (mean p in HDRs = 2.08 9 10 À3 ; Mann-Whitney p < 2.2 9 10 À16 ), and not in Cluster 1 (mean p in HDRs = 5.80 9 10 À3 ; Mann-Whitney p = 0.25). Similarly, higher nucleotide diversity in LDRs compared to the rest of the genome was seen specifically within Cluster 2 (Mann-Whitney p = 2.2 9 10 À16 ), and not in Cluster 1 (Mann-Whitney p = .77).
Both subpopulations showed strong skew towards low-frequency variants, with mean Tajima's D values of 10-kb windows of the genome for the Cluster 2 subpopulation being even lower than for the

| Phylogeny and introgression of extrachromosomal genomes
The analyses of population structure were extended using the maternally inherited extra-chromosomal genomes. Combination of the 5.9kb mitochondrial sequences generated in this study with previously published sequences yielded a sample size of 129 in total and identification of 77 SNPs. These mitochondrial sequences had a global average nucleotide diversity (p) of 7.9 9 10 À4 , with higher values in samples from parasites in Cluster 1 (p = 6.8 9 10 À4 , n = 74) than in Cluster 2 (p = 4.9 9 10 À4 , n = 46). The genealogical network of mitochondrial genomes contained 56 different haplotypes ( Figure 5). The most common and central core haplotype was detected mainly in parasites of the Cluster 1 subpopulation (25 of 28 isolates). A second common haplotype that was more peripheral in the network was seen mostly in the Cluster 2 subpopulation (15 of 21 isolates), while the third common haplotype was distantly related to this and detected only in Cluster 1 (nine isolates). Most of the closely related haplotypes to each of these were also seen only in the corresponding subpopulation clusters, but there is a group of closely related haplotypes internal in the network seen in parasites of Cluster 1 (13 isolates) which is embedded in part of the network that is otherwise only seen in Cluster 2 parasites ( Figure 5). Conversely, a few Cluster 2 isolates have haplotypes that are related to those only seen in Cluster 1. A separate branch of haplotypes was seen in laboratory isolates that had mostly been collected from Peninsular Malaysia. Maximum-likelihood phylogenetic analysis yielded a similar pattern, with haplotype clades being associated but not completely fixed between the Cluster 1 and Cluster 2 subpopulations ( Figure S3). Polymorphism in 30.6 kb of the apicoplast genome could be characterized using the Illumina short read sequence data to identify 520 polymorphic SNPs. With these data, 65 of the 80 isolates were analysed in detail as they had less than 20% missing SNPs, while the remaining 15 samples with more missing SNP data were excluded. The overall nucleotide diversity (p) was 1.79 9 10 À3, and this was higher among the Cluster 1 samples (p = 1.77 9 10 À3 ) than Cluster 2 samples (p = 1.12 9 10 À3 ). Two major lineages were seen, one of which consisted predominantly of Cluster 1 samples, and the other mainly of Cluster 2 samples ( Figure S4), although there were several isolates that had haplotypes of the opposite type to that expected for each cluster.

| DISCUSSION
This study analyses the largest ecological sample of sequences representing different subpopulations of a zoonotic eukaryotic parasite F I G U R E 3 Sequence diversity of Plasmodium knowlesi Cluster 2 is lowest in regions of the genome that have highest fixation indices in comparison with Cluster 1. Scatterplots show nucleotide diversity in discrete 10-kb windows genome-wide, with red points (left) showing windows in high divergence regions (HDR) and blue points showing windows in low divergence regions (LDR). For HDR, mean p = 5.80 9 10 À3 for Cluster 1 and 2.08 9 10 À3 for Cluster 2. For LDR, mean p = 5.60 9 10 À3 for Cluster 1 and 4.14 9 10 À3 for Cluster 2 [Colour figure can be viewed at wileyonlinelibrary.com] species. Whole-genome sequencing of new samples from one of the major genetic subpopulations of P. knowlesi has clearly revealed the genome-wide patterns of divergence between the sympatric subpopulations, which illuminates aspects of their population history and is essential for understanding their adaptive potential. This provides the most informative overall analysis of population structure of P. knowlesi to date, extending the understanding of defined subpopulation clusters that were previously described (Assefa et al., 2015;Divis et al., 2017). These results confirm the distinctness of the two sympatric divergent P. knowlesi subpopulations in Malaysian Borneo, supporting the occurrence of independent zoonotic cycles associated with different macaque reservoir host species (Divis et al., 2015;Muehlenbein et al., 2015).
The high differentiation between these two sympatric subpopulations indicates minimal or no ongoing gene flow occurring between them, and a large number of SNPs showed complete fixation of alternative alleles. However, the pattern of divergence was heterogeneous and bimodally distributed, with large regions of exceptionally high or low divergence interspersed throughout the genome.
Reduced genetic diversity of the Cluster 2 subpopulation in highly diverged regions suggests there may have been an initial bottleneck in the formation of this subpopulation. The overall allele frequency spectra were negativly skewed for both subpopulations, signifying long-term population growth, although this was more extreme for the Cluster 2 subpopulation. This gives a more detailed perspective than that previously obtained by analysis of mitochondrial genome sequences, which had already indicated a historical population expansion (Lee et al., 2011). The mitochondrial and apicoplast gen- Despite the differences at the genomic level, it is not yet known whether these two major sympatric subpopulations exhibit significant phenotypic differences, apart from the previously described association with different macaque reservoir host species (Divis et al., 2015Lee et al., 2011). Human P. knowlesi infections have been associated with a wide spectrum of disease (Cox-Singh et al., 2010;Daneshvar et al., 2009;Rajahram et al., 2012;William et al., 2011), and there is recent evidence that asymptomatic infections may be more common than previously expected (Fornace et al., 2015;Lubis et al., 2017;Siner et al., 2017), so conducting detailed clinical studies on individuals infected with each parasite subpopulation type is now a priority.
A recent study suggests a link between local deforestation and incidence of P. knowlesi infections in an area of Sabah state within Malaysian Borneo (Fornace et al., 2016). Of relevance to this, long-tailed macaques and pig-tailed macaques show different habitat ranges in forested and nonforested areas (Moyes et al., 2016), suggesting that there may be micro foci of infection for each subpopulation cluster, and highlighting the need to examine changes over time. It is clear that future research should include monitoring the proportions of the different P. knowlesi subpopulations over time, and potential changes in their genetic composition.
Sequencing of P. knowlesi genomes from natural macaque infections would be more challenging, given that these are usually coinfections together with other primate malaria parasite species (Lee et al., 2011), although new methods of sequencing genomes from single parasites could be adapted to address the issue (Trevino et al., 2017). This would ideally be done alongside sampling of infections in local mosquito vector species that could potentially be maintaining the separate zoonotic transmission cycles.
The genome-wide mosaicism, showing bimodal levels of divergence as well as limited discordant occurrence of extra-chromosomal genome lineages, indicate that introgression is likely to have occurred recently between these parasite subpopulations. The recombinant genomes that are now circulating offer a great diversity on which selection may operate, but there is no evidence yet of specific adaptation at introgressed loci. A recent re-analysis of previously published data identified a common shared haplotype in a chromosomal region with low divergence between the subpopulations (Diez Benavente et al., 2017), although an observation that the region had a slightly higher than background proportion of genes predicted to be expressed at a particular developmental stage may not be relevant, as an extended haplotype may result from selection on a single locus rather than on multiple genes.
In contrast, it is likely that at least one of the chromosomal regions showing fixed differences between the clusters contains a locus responsible for maintaining genetic isolation of the sympatric subpopulations, potentially due to transmission in different mosquito vectors, as well as likely adaptation to the different reservoir macaque hosts. Parasites from these sympatric subpopulations have not yet been studied in laboratory infections or adapted to culture, which will be necessary to define phenotypes and enable experimental analyses of differences between them. Despite major

DATA ACCESSIBI LITY
Paired-end short read genome sequence data for the new parasite infection isolates listed in Table S3 have been deposited in the European Nucleotide Archive, Accession nos. ERS2037781-ERS2037801.