Whole genome sequencing shows sleeping sickness relapse is due to parasite regrowth and not reinfection

Abstract The trypanosome Trypanosoma brucei gambiense (Tbg) is a cause of human African trypanosomiasis (HAT) endemic to many parts of sub‐Saharan Africa. The disease is almost invariably fatal if untreated and there is no vaccine, which makes monitoring and managing drug resistance highly relevant. A recent study of HAT cases from the Democratic Republic of the Congo reported a high incidence of relapses in patients treated with melarsoprol. Of the 19 Tbg strains isolated from patients enrolled in this study, four pairs were obtained from the same patient before treatment and after relapse. We used whole genome sequencing to investigate whether these patients were infected with a new strain, or if the original strain had regrown to pathogenic levels. Clustering analysis of 5938 single nucleotide polymorphisms supports the hypothesis of regrowth of the original strain, as we found that strains isolated before and after treatment from the same patient were more similar to each other than to other isolates. We also identified 23 novel genes that could affect melarsoprol sensitivity, representing a promising new set of targets for future functional studies. This work exemplifies the utility of using evolutionary approaches to provide novel insights and tools for disease control.


Introduction
Human African trypanosomiasis (HAT), also known as sleeping sickness, is endemic in many parts of sub-Saharan Africa. The chronic and most prevalent form of the disease is caused by the protist Trypanosoma brucei gambiense (Tbg) in central and west Africa. There is no mammalian vaccine, and the disease is almost always fatal, if untreated. Few drugs are available to treat HAT, particularly once the trypanosomes have crossed the blood-brain barrier in the late stage of disease, and each comes with significant drawbacks. Melarsoprol is an arsenic-based drug that, in the absence of any alternative, has been used to treat late stages of the disease for decades, despite causing encephalopathy in 6% of individuals, which is fatal 40% of the time (Blum et al. 2001). Today, the recommended first line therapy for late-stage gambiense HAT is a combination of nifurtimox and eflornithine (Priotto et al. 2009;Simarro et al. 2012). Although these drugs are less toxic than melarsoprol, this combination therapy (NECT) is difficult to administer , and resistance might soon be an issue, as up to 2% of cases relapse after NECT treatment .
Given this, and the absence of new drugs, melarsoprol is likely to remain in use for the foreseeable future. However, similar to many other drugs, melarsoprol resistance has become problematic. Beginning in the 1990s, several foci in HAT endemic regions have reported a high rate of melarsoprol treatment failure (Matovu et al. 2001a; Moore and Richer 2001;Stanghellini and Josenando 2001;Maina et al. 2007;Mumba Ngoyi et al. 2010). While melarsoprol treatment failure rate is typically less than 10%, areas in the Democratic Republic of the Congo (DRC) have reported rates as high as 58.8% (Moore and Richer 2001;Mumba Ngoyi et al. 2010). It is therefore critical to understand the reason for these failures. While several factors may contribute to the treatment failure rate, trypanosome resistance to melarsoprol is likely to be key.
Recently, a collection of 85 Tbg isolates from a hospital in the DRC reporting a high percentage of melarsoprol treatment failure was established (Mumba Ngoyi et al. 2010;Pyana et al. 2011). This collection includes strains isolated before and after treatment with melarsoprol from the same patients that had relapsed infections. These paired strains therefore present a unique opportunity to: (i) gain insights on the dynamics of relapses and (ii) systematically identify genetic determinants of melarsoprol resistance in natural Tbg strains. We sequenced the whole genomes of 19 Tbg isolates from this collection, including 4 pairs of strains derived from relapsed patients before and after melarsoprol treatment (Table 1). This enabled us to determine whether the strains isolated after treatment failure represent regrowth of the initial infecting strain, or alternatively, if they represent infections by a new strain. Taking advantage of this replicated design and a comparative approach, we analyze the patterns of genetic diversity among pre-and post-treatment strains to make inferences on the evolutionary dynamics of the system, discuss the role of genetic variation in genes previously reported to be associated with melarsoprol sensitivity (Maser et al. 1999;Shahi et al. 2002;Alsford et al. 2012;Baker et al. 2012), and identify novel candidate genes that could affect melarsoprol sensitivity, thereby providing direction for future functional studies.

Samples
Information about the Tbg strains used in this study is summarized in Table 1. All the clinical Tbg samples came from the cryobank of the World Health Collaboration Center for Research and Training on Human African Trypanosomiasis Diagnostics and originated from patients enrolled in a longitudinal study of HAT at Dipumba Hospital in Mbuji-Mayi, DRC. Details of this study are in Mumba Ngoyi et al. (2010) and the methods of trypanosome isolation and adaptation to mouse hosts are described in detail in Pyana et al. (2011). To summarize the methods found in these studies, blood or cerebrospinal fluid (CSF) containing trypanosomes was taken from the patient when they were first admitted, before treatment (BT). Patients were then given three types of treatment: a 10-day melarsoprol regimen (M10), a combination of melarsoprol and nifurtimox, or a 14-day eflornithine regimen (E14). If the patient subsequently relapsed and was readmitted, another sample was taken (AT). Trypanosomes were preserved by immediately freezing blood or CSF mixed with a cryomedium in liquid nitrogen. To establish a stable, rodent-adapted trypanosome line, samples were thawed and inoculated in one of three rodent species: Grammomys surdaster, immuno-suppressed Mastomys natalensis, or immuno-deficient Mus musculus. Trypanosome strains were maintained in hosts, by subpassaging if necessary, until parasite density reached 10 6.9 /mL . At this point, a blood sample was drawn and frozen in liquid nitrogen. This cryostabilized sample was then thawed and injected into a M. musculus host. Once parasite density reached 10 7.8 /mL, the trypanosome isolate was considered adapted to the mouse host. All but one isolate used in this study (346 BT) were adapted to mice. Mouse-adapted strains were given four doses of melarsoprol (10 mg/kg). If the mouse relapsed after this treatment, a blood sample was drawn and trypanosomes isolated. These samples were labeled AT-RE or BT-RE, depending on if the original isolate came from a patient before (BT) or after (AT) treatment. DNA from all isolates was extracted using a Qiagen DNA blood kit (Venlo, Netherlands).

Sequence and clustering analysis
DNA extracted from trypanosome isolates was submitted to the Yale Center for Genome Analysis for sequencing on the Illumina Hi-Seq 2000 platform (San Diego, CA, USA). Quality of the reads was checked through FastQC (Andrews 2010). In addition to the 19 Tbg clinical samples, we included in this study raw sequencing data from the Tbg strain 1829-Aljo (Sistrom et al. 2014), which was used for comparison, as it was isolated in the 1970s from a patient in the DRC.
Raw reads were aligned to the large chromosomes of the Tbg DAL 972 reference genome (Jackson et al. 2010), using Bowtie2 (Langmead and Salzberg 2012). Single nucleotide polymorphisms (SNPs) were called using the samtools (v1.1) mpileup, using the default quality cutoff (Li et al. 2009). Only SNPs with a minimum read depth of 10 were retained. SNP calling was also performed using the Genome Analysis Toolkit (GATK) (McKenna et al. 2010). Realignment around indels and base quality recalibration was performed following the best practices guidelines (Van der Auwera et al. 2013). For SNPs obtained from both pipelines, SNPs with minor allele frequency less than 0.05 were discarded. RepeatMasker (Smit et al. 2013(Smit et al. -2015 was used to identify repetitive elements in the Tbg DAL972 reference genome, and any SNPs found in these elements were eliminated. Coding sequences of variant surface glycoprotein (VSG) genes in the DAL972 genome were identified using gene text search function of the tritrypdb.org website . SNPs appearing in these sequences were eliminated, as the highly repetitive nature of VSG genes would make accurate SNP calling difficult. The ape (3.0-11) package in R (v.3.0.3) was used to conduct a cluster analysis and construct a neighbor-joining tree with 10 000 bootstrap replicates from the filtered set of SNPs (Jombart et al. 2010;Jombart and Ahmed 2011). Chi-square tests comparing the frequency of heterozygous SNPs between patient pairs were performed using the chisq.test command in R. F ws was calculated following Manske et al. (2012). The filtered set of SNPs was placed in 10 bins, according to their MAF. The heterozygosity at each SNP (equal to 1 À (F 2 1 þ F 2 2 ), where F 1 and F 2 are the allele frequencies) for the population as a whole and for each individual sample was calculated. For each bin, the average heterozygosity of the sample was plotted against the average heterozygosity of the population at each bin, and a linear regression was performed in R. The F ws for each sample was calculated as 1 minus the slope of the regression.
To complement the clustering analyses, we also carried out a multivariate analysis, discriminant analysis of principal components (DAPC) using the R package adegenet (1.3-9.2) (Jombart et al. 2010). This method constructs principal components (synthetic combinations of variables, in this case SNP genotypes) that best differentiate two or more groups of individuals. As a first step in the DAPC analysis, we used k-means clustering with the SNP data set transformed by principal component analysis for a range of values of k (the putative number of clusters). Examination of these results indicated 4 to be a reasonable value of k. To avoid over-fitting of data, we retained two principal components following optimization procedures as recommended by the package developers (Jombart et al. 2010).

Candidate gene identification
Genes previously implicated in melarsoprol resistance were identified in Alsford et al. (2012). Each of the eight genes listed as a 'primary' hit in Supplementary data file 1 for melarsoprol resistance was used as a query to search for homologs in the Tbg genome, using the BLAST tool at the tritrypdb.org website . We also investigated four genes, TbAT1, TbAQP2, TbAQP3, and TbMRPA, which have been shown previously to affect melarsoprol sensitivity (Maser et al. 1999;Shahi et al. 2002;Baker et al. 2012).
To identify novel candidate melarsoprol resistance genes, the adegenet package in R was used for discriminant analysis of principal components (DAPC) on the set of SNPs used in the clustering analysis above. We performed the DAPC analyses on two data sets. One analysis grouped the strains using their isolation status (BT or AT, Table 1) and included the four pairs of strains isolated from the same patient as a prior grouping. The other analysis was conducted only on the 9 BT strains placing them in cured (four strains) and noncured (five strains) groups. For each data set (patient pairs and cured versus noncured), DAPC was performed on 100 randomized versions of the data set to estimate the frequency of SNPs expected at various loading values by chance. In these simulations, SNP genotypes at each loci were randomly sampled without replacement. This analysis generated a distribution of loading values of SNPs that differentiated two groups of strains with arbitrary genotypes and allowed us to estimate the frequency of SNPs expected at various loading values by random chance. A false discovery rate was estimated by comparing the number of SNPs at a given loading value identified by the real and randomized DAPC analyses. SNPs with a loading value equal to or above the loading value that minimized the FDR were used to identify genes that could be involved in melarsoprol resistance. Increasing the number of randomized data sets to 120 did not change the estimate of the minimized FDR.
The program SNPeff (Cingolani et al. 2012) was used to identify any Tbg genes with candidate SNPs occurring in their coding sequence, and the corresponding protein changes caused by it (nsSNPs), if any. We searched for homologs of these genes in Tbb through the BLAST function at TriTrypDB.org . We used data from Jensen et al. (Jensen et al. 2009), accessed through the TriTrypDB.org ) website, to check whether a Tbb homolog of the candidate gene is expressed in the bloodstream form of Tbb. The same homologous gene names were searched for in the Tbg bloodstream-form expression data from Veitch et al. (2010), as these data were mapped to the Tbb genome (Veitch et al. 2010). The web interfaces of PolyPhen-2 (Adzhubei et al. 2010) and SNAP (Bromberg and Rost 2007) were used to predict effects of changes on protein function. We report Poly-Phen2 results based on the algorithm trained on the Hum-Div data set, which is considered to be more accurate than that based on the algorithm trained on the HumVar data set (Adzhubei et al. 2010). We used MESSA, the meta server for sequence analysis to identify functional domains in genes of unknown function (Cong and Grishin 2012).

Sequencing of strains isolated from HAT patients in DRC
We recovered an average of 18.2 million reads from each Tbg isolate, with an average of 80.5% of reads mapping to the Tbg reference strain, DAL972 (Jackson et al. 2010), and an average coverage of 40.5 AE 7.6 (SD). To enable genomewide comparisons of these strains, we identified SNPs with a minimum read depth of 10, using both the samtools mpileup command and the GATK Haplotype Caller tool (Li et al. 2009;McKenna et al. 2010). The samtools mpileup command called 22 201 SNPs, and the Haplotype Caller identified 47 054 SNPs. For both sets, we filtered out SNPs with a minor allele frequency less than 0.05, located in a repetitive region of the genome, or in the coding sequences of VSG genes. This left a final set of 5938 SNPs called by Samtools and 10 499 SNPs called by Haplotype Caller. These SNP sets overlap at 4236 positions (Fig. 1). As the samtools set appears to be more conservative, and 71% of the SNPs in this set were also called by the GATK Haplotype Caller, we report the analysis from the samtools set, unless otherwise noted.
Strains isolated from the same patient are most similar Figure 2A shows a neighbor-joining network based on clustering of the 5938 SNPs found in the 19 Tbg patient derived strains (Table 1) and, for comparison, 1829 Aljo, a Tbg strain isolated in DRC in the 1970s. The pairs of strains isolated from the same patients before (BT) and after treatment (AT) are sister taxa to each other, implying that they are more closely related to each other than to any of the other strains. These pairs do not cluster together, but instead are scattered throughout the tree. High bootstrap support at these nodes strengthens this result. The only relatively minor and expected exception is the relationship of the three 346 strains. From this patient, a sample was taken after treatment (346 AT) and subsequently tested for melarsoprol resistance in a mouse. Another isolate was taken from the mouse (346 AT-RE) when found to be resistant. In the cluster analysis, 346 AT and 346 AT-RE are sister taxa, with the 346 BT isolate being the first one to link to the clade with these two strains.
In addition to the neighbor-joining analysis, we performed a DAPC on the same set of 5938 SNPs. Typically, the k value with the lowest Bayesian information criterion (BIC) is thought to most accurately reflect the number of groups present in the sample. In our case, k values from 1 through 6 have similar BIC values ( Figure S1). Group assignments from k-means clustering of SNP genotype principal components (Fig. 2B) are similar to the patterns observed in the neighbor-joining network ( Fig. 2A). Similar to the clustering analysis shown in Fig. 1, each paired strain from the same patient belongs to the same cluster in the first two discriminant functions of the DAPC at k = 4 (Fig. 3). As expected, the 1829 Aljo strain is distinct from the others. This same general pattern holds for k values from 2 to 8 (Fig. 2B). Both neighbor-joining and DAPC methods indicate that strains derived from the same patient are genetically more similar to each other than to other strains. Qualitatively similar results were obtained using the set of SNPs obtained by GATK ( Figure S2). The one exception is that strain 346 BT is sister to 348 BT, although the node containing these strains is sister to the node containing 346 AT.

Strains isolated from the same patient do not differ in heterozygosity
As multiple trypanosome strains can co-occur, with a multiple infection frequency as high as 47% in some areas (MacLeod et al. 1999;Balmer and Caccone 2008), exposure to melarsoprol could select for resistant strains among those infecting the patient. This could result in the apparent reduction of heterozygosity when comparing samples before and after treatment from the same patient, because a sample comprised of multiple genetic strains presumably has a higher frequency of apparent heterozygosity relative to a sample comprised of a single strain. Table S1 reports the results of chi-square tests for a difference in heterozygosity between the BT and AT sample for each patient pair. There was no statistically significant difference in the proportion of heterozygous SNPs between the BT and AT pairs. In addition, we computed F ws , a measure of a sample's heterozygosity relative to the population as a whole, while controlling for minor allele frequency (Manske et al. 2012). The F ws is largely similar between strains isolated before and after treatment ( Figure S3). This suggests AT strains are just as diverse as their BT counterparts. If the level of heterozygosity is due to the presence of multiple strains, the same mix of strains must have been maintained throughout treatment.
Variation in genes previously shown to affect melarsoprol resistance Table 2 summarizes variation found in 14 genes known to be associated with melarsoprol resistance from previous studies (Maser et al. 1999;Shahi et al. 2002;Alsford et al. 2012;Baker et al. 2012). As reported in Pyana Pati et al. (2014), these isolates contain a chimeric gene, TbAQP2/3, in place of the TbAQP2 and TbAQP3 loci, and are wild type at the TbAT1 locus. Our results support this finding. Also, strain 1829 Aljo was found to be wild-type at TbAQP2, TbAQP3, and the TbAT1 loci. Three genes, TbMRPA, Tbg972.10.1740, and upstream binding protein 1 (UBP1), have nonsynonymous SNPs (nsSNPs). All of the patient-paired strains are heterozygous for the nsSNPs in UBP1 and Tbg972.4.240. Only one gene (Tbg972.10.1740) shows a heterozygous difference between a patient pair (148). The function of this gene is unknown, but the mutation is rated as 'possibly damaging' by the mutation effect prediction program PolyPhen2 (Adzhubei et al. 2010). Apart from this gene, there is very little variation among the strains in the previously known candidate melarsoprol sensitivity genes.

Identification of novel candidate genes that could influence melarsoprol resistance
We used two methods of identifying candidate genes that affect melarsoprol sensitivity. First, we used DAPC to identify SNPs that distinguish BT and AT strains in the WGS data. We restricted this analysis to the four BT/AT pairs, as we were interested in looking for parallel variation that occurred in multiple pairs. We performed the DAPC dividing these strains into two groups (four AT and four BT strains) using the set of 5,938 SNPs called by samtools, described above. The discriminant function completely distinguishes the two groups (Fig. 4A), and Fig. 4B shows the loading values of each SNP contributing to this discriminant function. To evaluate these results, we calculated a false discovery rate (FDR) by performing DAPC analyses on data sets with randomly assigned genotypes to estimate the distribution of loading values contributing to the discriminant function by chance. This enabled us to calculate the FDR by comparing the number of SNPs at a given loading value in real versus simulated data. We identified 69 SNPs with a loading value over 0.00229 (Fig. 4B), minimizing the false discovery rate at 13% (Fig. S4). We chose this cutoff value to ensure we report the largest proportion of SNPs likely to be true positives. While this loading value is low, this may simply reflect that many genes contribute melarsoprol resistance, and/or that other nongenetic factors also play a significant role.
In addition to these 69 SNPs, we also identified 53 SNPs with a homozygous, fixed difference between at least one of the patient pairs (e.g. A/A in a BT strain changed to G/G in the AT strain), as these differences are likely to have more severe phenotypic consequences than heterozygous ones. Six of these 53 SNPs were also identified in the DAPC screen. In total, we identified 122 unique SNPs with genotypes differentiating between BT and AT strains (Table S2).
Out of these 122 differentiating SNPs, we found 35 nsSNPs in 23 genes (Table 3). Although none of these genes was previously known to be involved in melarsoprol sensitivity ( Table 2), one of them, Tbg972.10.18680, is highly similar to, and approximately 4 kb upstream of, a previously known candidate gene, Tbg972.10.18650. Sixteen   Table 3). Expression could not be determined for three of them, because they do not have an obvious homolog in Trypanosoma brucei brucei (Tbb), which was used as the reference genome in Veitch et al. (2010) (listed as 'unknown' in Table 3). To gain insights in to the potential functional relevance of SNPs in these genes, we used two software programs that provide predictions of the effects of the mutation on the encoded protein, Poly-phen2 (Adzhubei et al. 2010) and SNAP (Bromberg and Rost 2007). The results are shown in Table 3, columns 8 and 9. Seven genes, expression site-associated gene 2 (ESAG2), Tbg972.3.4800, Tbg972.9.9130, Tbg972.8.7620, Tbg972.10.1360, Tbg972.10.19330, and Tbg972.11.4390, contain a SNP with a 'nonbenign' or non-neutral effect on the encoded protein.
To identify genetic variants present prior to treatment that could have affected melarsoprol sensitivity and thus allow some BT strains to resist treatment, we carried out a DAPC analysis on all nine strains classified as BT in Table 1. We divided the strains into two groups: one consisting of 5 strains isolated from patients that were ultimately cured, and the other consisting of the 4 BT strains from the patient pairs, which were not cured. Although the discriminant function did not distinguish the cured and relapsed BT strains well ( Figure S5), we found 76 SNPs

/G T/G T/G T/G T/G T/G T/G T/G
Genetic variation found in the 14 genes previously known to affect melarsoprol sensitivity. Any nsSNPs observed are listed, along with the protein change caused by the nsSNP and its predicted effect on protein function. PolyPhen2: result based on the HumDiv data set, SNAP: results from the SNAP web interface. Genotypes of patient-pair strains are listed in the final eight columns. Italicized genotypes indicate a heterozygous difference between strains isolated from the same patient. -Indicates same entry as line above.
with a loading value greater than 0.0015 corresponding to a minimized FDR of 0.337 ( Figure S6). These SNPs are summarized in Table S4.

Discussion
Reconstructing the historical path of adaptation and the genes involved in it is a difficult exercise. Selection can affect patterns of variation in the genome in different ways, and it is difficult to distinguish from background variation, random drift, and gene flow. Further, evidence continues to mount for the importance of standing variation to adaptation in natural populations, but detecting selection on standing variation is even harder than on de novo mutations (Messer and Petrov 2013). In such cases, one way to facilitate the distinction between neutral vs. adaptive changes is to take advantage of natural replicates and to look at the parallel distribution of changes among them. Neutral changes brought about by gene drift or gene flow should be affecting all gene regions similarly, while adaptive ones should result in parallel patterns in different replicates. This evolutionary approach has been very effective in identifying selective patterns and candidate for association studies in a variety of organisms (Schoville et al. 2012). We took advantage of this comparative approach combined with whole genome sequencing to study the evolutionary history underlying the occurrence of melarsoprol treatment failure among Tbg strains, comparing variation in replicate pairs to identify candidate gene regions that could be associated with this trait. By looking at the clustering patterns of the BT/AT patient pairs in relation to each other and to other strains in the region, we could reconstruct the evolutionary history of the emergence of melarsoprol resistance and evaluate if trypanosome isolates from the same patient obtained before and after melarsoprol treatment represent the same strain, and not multiple independently acquired infections. Multiple clustering methods clearly support this hypothesis, as isolates from the same patient are more similar to each other than to other isolates (Fig. 2). This suggests that melarsoprol resistance has been acquired independently by the different AT strains. If this is correct, this finding has two important epidemiological implications: (i) relapsing patients were infected with a strain initially sensitive to melarsoprol that subsequently accumulated the functional mutations necessary for resistance in the patient and (ii) the evolution of resistance for melarsoprol could occur through multiple paths involving different genomic regions as the likelihood of two identical mutations occurring independently in four different replicates is quite small.
Admittedly, the plausibility of this scenario is difficult to assess without an accurate estimate of mutation rates in trypanosomes, which is not currently available. However, we argue this is the most likely possibility given the alternatives. If patients were reinfected after initial treatment, it seems unlikely they would coincidentally be reinfected by a strain most similar to the originally infecting strain. Another alternative is that the initial infection could have included multiple strains, providing the standing variation necessary to select for melarsoprol resistance. We also argue that this is unlikely because, although multiple infections are relatively common in other vertebrate hosts and in the tsetse fly vector, they are much less common in humans (Balmer and Caccone 2008;Aksoy et al. 2013). This is possibly due to a severe bottleneck during the parasite transfer through the tsetse bite or interstrain competition in the human blood. Regardless of its likelihood, if melarsoprol resistance was due to standing variation and infection by multiple strains, we would expect a decrease in heterozygosity, reflecting a loss of variation as one strain out-competes the rest. This is not the pattern we observe, as strains isolated BT have similar levels of heterozygosity as the strains isolated after treatment (Table S1), making this scenario unlikely.
By looking at the clustering pattern of the BT strains, our results also shed light on the diversity of strains that can survive melarsoprol treatment as we have five strains from the same geographic area from patients that were ultimately cured after treatment and others that relapsed. The clustering analysis shows that the BT strains isolated from patients that were ultimately cured (Table 1) do not tend to cluster together (Fig. 2). This suggests that the emergence of resistance can occur from different genetic backgrounds, as there is not a single lineage that is resistant or sensitive. Instead, strains from multiple genetic backgrounds in this region have the potential to survive treatment, as opposed to a single resistant strain explaining Hypothetical protein drug treatment failures. This has important epidemiological implications and can also explain why we find multiple genetic variants in different genes associated with melarsoprol resistance in the replicate BT/AT pairs (see below). Whole genome sequencing and the replicate pair design also enabled us to investigate patterns of variation in loci known to be involved in melarsoprol resistance and to search for additional candidate loci. A number of genes have already been identified as candidate melarsoprol resistance genes, which impact several pathways. Laboratory strains of Tbg resistant to melarsoprol have been found to be defective in the P2 adenosine transport system (Carter and Fairlamb 1993). The gene encoding the adenosine transporter was identified as TbAT1 and mutations in this gene resulted in reduced uptake of melarsoprol (Maser et al. 1999). An RNAi screen identified additional candidate genes conferring resistance to antitrypanocidal drugs when knocked down, and established a role for aquaglyceroporins in both melarsoprol resistance and melarsoprolpentamidine cross-resistance (pentamidine is a drug commonly used to treat early stages of HAT).

--A/A A/A A/A A/A A/A G/G A/A A/A
Tbg972.11.10190 Hypothetical protein Tbg972.11.11950 Hypothetical protein C ? T P2365S Unknown Neutral Yes C/T C/T C/T C/T C/T T/T C/C T/T Tbg972.11.12410 Hypothetical protein

/A T/A T/A T/T T/A T/T T/A T/T
Genetic variation found in 23 genes containing nsSNPs identified by our screen. Gene Number: gene number in DAL972 genome. Protein name/description: name of protein or brief description, if known. Observed nsSNPs are listed, along with the protein change caused by the nsSNP and its predicted effect on protein function. PolyPhen2: result based on the HumDiv data set, SNAP: results from the SNAP web interface. Expressed in Tbg bloodstream form: data from Veitch et al. (2010). Genotypes of patient-pair strains are listed in the final eight columns. Italicized genotypes indicate a heterozygous difference between strains isolated from the same patient. Bold italicized genotypes indicate a fixed, homozygous difference between strains isolated from the same patient. -Indicates same entry as line above.
Naturally, occurring mutations in TbAT1 have been found in strains infecting patients in Uganda that relapsed after melarsoprol treatment (Matovu et al. 2001b). However, among the melarsoprol resistant strains, some had wild-type TbAT1 alleles, suggesting that other factors are also involved (Matovu et al. 2001b). Furthermore, strains isolated from a region in Sudan with high melarsoprol treatment failure rate did not appear to have resistant TbAT1 alleles (Maina et al. 2007). Similar to TbAT1, mutations in aquaglyceroporins occur in natural populations, but are not sufficient to predict relapse after melarsoprol treatment (Graf et al. 2013;Pyana Pati et al. 2014). Many potential targets of mutation appear to exist that can alter a strain's sensitivity, increasing the likelihood an infection will persist through treatment. However, no single genotype appears to confer resistance to all strains. Consistent with this observation, each of the 19 Tbg isolates in our study, regardless of patient treatment outcome, contain a chimeric TbAQP2/3 gene, which is associated with increased pentamidine and melarsoprol cross-resistance Pyana Pati et al. 2014).
Our results support the hypothesis that melarsoprol resistance is polygenic in nature (Alsford et al. 2012). In our own set of candidate genes, only the genotypes at Tbg972.10.1150 completely distinguish BT and AT strains (Table 3, line 21). The BT strains are heterozygous at this position, while the AT strains are homozygous for the reference allele. It therefore seems unlikely that variation in this gene completely explains melarsoprol sensitivity in the strains examined. Further, the genotypes at no SNPs completely distinguish strains isolated from patients that were ultimately cured from those that relapsed. We were thus unable to find a consistent genotypic pattern that predicted melarsoprol sensitivity, even within the region of the DRC where our trypanosome samples were isolated.
The 23 genes identified by our comparative genomic approach (Table 3) represent novel genes that could play a role in melarsoprol resistance. Currently, very little phenotypic data are available for these genes. Even though expression data for Tbg is lacking, most of the candidate genes have some evidence of expression in the bloodstream form of a group 2 Tbg strain (Table 3, Veitch et al. 2010). While many candidate genes are lacking annotations, the fact that the genes are expressed, even in a strain outside the main Tbg lineage, suggests the genes have some functionality (and are not, e.g., pseudogenes). The Veitch et al. (2010) experiment mapped reads to only 81% of the genes in the Tbb genome, possibly explaining why some genes in Table 3 do not appear to be expressed (Veitch et al. 2010). Three of the candidates in Table 3 make especially compelling candidates for further study.
ESAG2 is of interest because it contains three nsSNPs, resulting in a potentially disruptive change to the expressed protein. The Tbg972.3.4800 locus contains a cluster of 7 SNPs identified by our screen as important in distinguishing BT from AT strains, and may encode a trypanosomespecific retrotransposon hot spot protein (Bringaud et al. 2002;Fig. 5). Tbg972.10.18680, and the nearby locus Tbg972.10.18650, also warrant further examination. Tbg972.10.18650 is homologous to the Tbb gene Tb927.10.15310, which was identified in an RNAi screen for candidate genes that affect melarsoprol sensitivity (Alsford et al. 2012). Tbg972.10.18680 is approximately 4 kb upstream of Tbg972.10.18650 and consists of an open reading frame 366 base pairs long, which is similar to part of the open reading frame of Tbg972.10.18650. Although the exact function of these genes is unknown, the identification of two highly similar loci as candidates affecting melarsoprol sensitivity suggests a more detailed investigation of these genes is warranted (See Appendix S1 for further discussion).
Because relapsing strains tend to cluster with the corresponding initially infecting strains, and few mutations are shared between the relapsed strains, it is likely that a combination of factors which may be parasite and/or patient specific determine the failure or success of melarsoprol treatment. Although our screen of candidate genes was fairly extensive for the strains at our disposal, it should be noted that it was not exhaustive, as we sampled a subset of the naturally circulating strains from one geographic region and only used strains that could be cultured in rodents (Pyana et al. 2011; Table 1). While whole genome sequencing of additional strains from relapsed patients may reveal novel loci influencing melarsoprol sensitivity in those strains, this study identified several novel candidate genes which should be targeted for further functional investigations to better understand the mechanistic basis of melarsoprol resistance emergence.
Although it is important to stress that many factors (e.g., patient's genotype and behavior, disease stage, variation in gene expression, interaction between host and parasite genomes) in addition to genetic variation in one or more parasite genes involved in determining resistance to melarsoprol, our results highlight the utility of using next-gene Gray diamonds represent SNPs identified in our screen, and the blue diamond represents a SNP that eliminates a stop codon found in the DAL972 reference strain, allowing read through of the RHS-like orf.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Appendix S1. Candidate genes. Figure S1. Cluster identification. Figure S2. Cluster analysis of Tbg isolates. Figure S3. SNP Heterozygosity. Figure S4. False discovery rate of SNPs distinguishing patient Pairs. Figure S5. Significant SNPs Identified by DAPC. Table S1. Proportion of heterozygous SNPs in patient pairs. Table S2. List of SNPs that differentiate BT and AT strains. Table S3. List of SNPs that differentiate cured and relapsing strains.