SNP‐based genotyping and whole‐genome sequencing reveal previously unknown genetic diversity in Xanthomonas vasicola pv. musacearum, causal agent of banana xanthomonas wilt, in its presumed Ethiopian origin

Abstract For decades, Xanthomonas vasicola pv. musacearum (Xvm) has been an economically important bacterial pathogen on enset in Ethiopia. Since 2001, Xvm has also been responsible for significant losses to banana crops in several East and Central African countries, with devastating consequences for smallholder farmers. Understanding the genetic diversity within Xvm populations is essential for the smart design of transnationally reasoned, durable, and effective management practices. Previous studies have revealed limited genetic diversity in Xvm, with East African isolates from banana each falling into one of two closely related clades previously designated as sublineages SL 1 and SL 2, the former of which had also been detected on banana and enset in Ethiopia. Given the presumed origin of Xvm in Ethiopia, we hypothesized that both clades might be found in that country, along with additional genotypes not seen in Central and East African bananas. Genotyping of 97 isolates and whole‐genome sequencing of 15 isolates revealed not only the presence of SL 2 in Ethiopia, but additional diversity beyond SL 1 and SL 2 in four new clades. Moreover, SL 2 was detected in the Democratic Republic of Congo, where previously SL 1 was the only clade reported. These results demonstrate a greater range of genetic diversity among Xvm isolates than previously reported, especially in Ethiopia, and further support the hypothesis that the East/Central Africa xanthomonas wilt epidemic has been caused by a restricted set of genotypes drawn from a highly diverse pathogen pool in Ethiopia.

The pathogen's taxonomic position was recently revised, placing it within the species X. vasicola (Studholme et al., 2020). The pathogen has been reported since 2001 on banana in banana-producing regions of Eastern and Central Africa (ECA), first in Uganda (Tushemereirwe et al., 2004), then the Democratic Republic of Congo (D. R. Congo; Ndungo et al., 2006), Tanzania (Carter et al., 2010), Rwanda (Reeder et al., 2007), Burundi (Carter et al., 2010), and Kenya (Carter et al., 2010). Despite these country-level reports, there is no detailed geographical/temporal spread information available for most countries in the ECA region. Genetic relationships between bacterial isolates from different times and places can be used to infer chains of pathogen transmission.
However, previous studies have revealed very limited genetic diversity in Xvm Odipio et al., 2009).
Whole-genome sequencing revealed two closely related clades, or sublineages (SL), that appeared to be geographically separated (Wasukira et al., 2012): isolates from Ethiopia, D. R. Congo, and Rwanda fell into SL 1, whereas isolates from Burundi, Kenya, Tanzania, and Uganda belonged to SL 2. Because Xvm was reported in Ethiopia decades before it was detected in Uganda and thereafter other ECA countries, it was widely assumed that the East African epidemic originated from Ethiopia. However, to date only SL 1, and not SL 2, has been reported in Ethiopia. Furthermore, if Ethiopia were the centre of origin, then it should harbour an even greater genetic diversity of Xvm in addition to the two previously known clades.
To test this hypothesis, we genotyped a large collection of isolates from a range of geographical locations in Ethiopia and Central and East Africa, using eight single-nucleotide polymorphisms (SNPs) previously reported to distinguish SL 1 from SL 2 (Wasukira et al., 2012). The genetic diversity of Xvm populations in East and Central Africa has been explored previously using the multilocus variable number of tandem repeat analysis (MLVA; Nakato et al., 2019). In that study, the discriminatory powers and congruence of MLVA and SNP typing techniques were assessed and compared. Results showed that the MLVA haplotypes corresponded to the SNP haplotypes and were consistent with the SNP-based SLs (Nakato et al., 2019). It was determined that both MLVA and SNPs can be used together within a hierarchical typing procedure. However, the phylogenetic relationship of the SNP haplotypes remains to be determined and clarified by additional genomic sequence analysis. We therefore sequenced the genomes of 15 Xvm isolates representing the full range of haplotypes, and carried out a phylogenetic reconstruction from genome-wide data.

| Isolation of Xvm
Three grams of the sample was homogenized in 3 ml of sterile distilled water containing Tween 80 (0.02% vol/vol) using a sterile mortar and pestle and 1 ml of the homogenate was serially diluted with sterile distilled water. A 20 µl aliquot from the 10 −2 dilution was spread on three 9 cm Petri plates containing modified YPGA (containing per L of distilled water: yeast extract 5 g, peptone 5 g, glucose 10 g, agar 15 g, 5-fluorouracil 15 mg, cephalexin 45 mg ;Mwangi et al., 2007). Petri plates were sealed with Parafilm and incubated at 28 °C for 4 days. Colonies with Xvm-like characteristics were transferred to Petri plates containing YDCA (containing per L of distilled water: yeast extract 2.5 g, dextrose 5 g, calcium carbonate 15 g, agar 14 g; Mwangi et al., 2007

TGATTCCTACCGCAGTCGAG
Note.: The WAS primers were described previously (Wasukira et al., 2012). The EW4 and VN primers were developed in this study.

| Bacterial DNA extraction
Total DNA was extracted from Xvm-like colonies using a small-scale protocol described by Mahuku (2004

| PCR-RFLP genotyping
Among the 86 SNPs discriminating SL 1 from SL 2 (Wasukira et al., 2012), eight were used to classify Xvm isolates into sublineages. Restriction fragment length polymorphism (RFLP) assays for four SNP markers (Was1 to Was4) were published previously (Wasukira et al., 2012), while four others were developed in the present study. These SNPs fell within genes encoding a hypothetical protein, urocanate hydratase, and within intergenic regions. PCR amplifications of target DNA were conducted in 20 μl reaction volumes containing 50 ng genomic DNA, 1 U Taq DNA polymerase (Bioneer Corporation), 6 pmol of each of the primers (Table 1), 0.2 mM dNTPs, 2 mM MgCl 2 , and 1× reaction buffer (Promega). PCR amplifications were performed on a Techne thermocycler with initial denaturation for 5 min at 95 °C; 35 cycles of 30 s denaturation at 95 °C, 30 s annealing (see Table 1 for Tm of primers), 30 s extension at 72 °C; and a final extension for 10 min at 72 °C. Each set of primers specified a target ranging from 237 to 590 bp. The exception was the EW4F/EWR primer pair, which targeted a 1,000-bp sequence that contained the 500-bp target of the WAS4F/WAS4R primers. Primers EW4F/EW4R were designed on these extended loci using Geneious (Biomatters) (Kearse et al., 2012). The designed primers were tested in silico for primer efficiency and ability to amplify using Geneious v. 9.1 software (Kearse et al., 2012), and then purchased from BIONEER Inc. or Eurogentec (Angers, France). TA B L E 2 Naming of the haplotypes observed in this study and their correspondence to Wasukira's sublineages Note.: The entire collection (n = 97) was genotyped using the WAS1-WAS4 markers, and a subset of 27 isolates was also genotyped using the four new VN markers. For each PCR-RFLP assay listed in Table 1, the result was either cleavage by the restriction enzyme 'C', or no cleavage 'Nc'. Haplotypes 3a and 3b share the same WAS1-WAS4 pattern, and strains genotyped with WAS1-WAS4 only and showing this pattern were thus assigned to Haplotype 3. Haplotypes 1 and 2 correspond to the previously described SL 1 and SL 2 sublineages (Wasukira et al., 2012), while haplotypes 3 and 4 have not previously been observed and are inconsistent with both SL 1 and SL 2. Haplotypes were determined based on the SNP using a PCR-RFLP assay and clades were determined based on genome basedphylogenetic analysis. NT, not tested.
a Restriction was achieved at two restriction sites, yielding three bands.
Amplified PCR products were separated on 2% (wt/vol) agarose gels in 1× TAE buffer at 150 V for 40 min. The gels were stained with ethidium bromide (0.5 μg/ml) and gel images captured using the GBOX gel documentation system (Syngene). Subsequently, 5 µl aliquots of PCR products were digested with the restriction endonucleases (New England Biolabs) detailed in Table 1. Restricted DNA was separated by electrophoresis in 1.5% agarose gels and visualized as previously described.

| DNA sequencing and phylogenomic analysis
Genomic DNA was sequenced using the Illumina MiSeq according to the manufacturer's instructions. Sequence reads were filtered and trimmed using the TrimGalore wrapper for CutAdapt (Martin, 2011) before analysis using REALPHY (Bertels et al., 2014) with RaxML (Stamatakis, 2014) as the tree-construction method. Genome sequences were assembled using SPAdes v. 3.11.1 (Bankevich et al., 2012) and annotated with the Prokaryotic Genomes Annotation Pipeline (PGAAP) at the NCBI (Tatusova et al., 2016;Haft et al., 2018). Genome sequence data are available via BioProject accession number PRJNA454153. Summary statistics and a full list of GenBank and Sequence Read Archive accession numbers are listed in Table S1.
We identified open reading frames encoding Type III effectors by performing TBLASTN searches against each genome assembly using the effector amino acid sequences from the Xanthomonas Resource website (http://xanth omonas.org/t3e.html). We considered an effector gene to be present if the TBLASTN alignment covered at least 85% of the query length with an amino acid identity of at least 70%.

| Collection and identification of isolates
All isolates were identified as Xvm by PCR amplification of five Xvmspecific coding sequences (Nakato et al., 2018) and preserved as glycerol stocks at −80 °C. The entire collection (n = 97) was genotyped using the four Wasukira's WAS1-WAS4 markers, and a subset of 27 isolates was also genotyped using the four new RFLP markers (VN2, VN5, VN11, VN12) ( Table 2).

| Genotyping of Xvm isolates by PCR-RFLP
Most of the isolates yielded unambiguous PCR-RFLP results with all eight primer pairs. The exceptions were two isolates from D.
R. Congo that failed to amplify with the WAS4 primers (D13L and D24L; Table 3). In order to resolve the genotype at the WAS4 SNP, we designed a new primer pair (EW4F/R) that successfully amplified a product of the expected size for these isolates.
Of the 2 8 (256) haplotypes theoretically possible over four biallelic SNPs, four were observed in the present study. These haplotypes are summarized in Table 2 and haplotypes of each isolate are listed in Table 3. Haplotypes 1 and 2 were identical to those described for sublineages SL 1 and SL 2, respectively (Wasukira et al., 2012). However, Haplotypes 3 and 4 did not match the haplotypes of any of the previously sequenced Xvm genomes (Studholme et al., 2010;Wasukira et al., 2012;  which included all the isolates from Uganda, Tanzania, and one from D. R. Congo (Figure 1). Haplotype 1 was observed in all isolates from Rwanda and some of the isolates from Ethiopia and D. R.
Congo (Figure 1). Three previously unknown haplotypes, 3a, 3b, and 4, were discovered among Ethiopian isolates (Figure 1). Of the five haplotypes observed, Haplotype 1 was isolated from banana and enset, Haplotype 2 was only isolated from banana, Haplotype 3 was isolated only from enset, while Haplotype 4 was isolated from enset and maize (Table 3).

| Genotyping by whole-genome sequencing
The results of the PCR-RFLP assays revealed the existence of previously unknown Haplotypes 3 and 4. To position these unknown haplotypes within the Xvm phylogeny, we sequenced the genomes of 15 isolates, including representatives of each haplotype. We used REALPHY (Bertels et al., 2014) to perform phylogenetic analysis of these 15 genomes along with the 12 previously sequenced Xvm genomes (Wasukira et al., 2012) and a related X.
vasicola pv. vasculorum genome as an outgroup. The resulting phylogenetic tree revealed six well-defined genetic clusters that we called clades (Figure 2). A total of 1,170 SNPs, all within proteincoding genes, differentiated the six clades (Table S2; Figure S1). Of these, 249 were silent, resulting in synonymous codon substitutions, and 617 were nonsilent, resulting in a change in the amino acid composition (Table S2)  perfectly matched with Clades 1, 2, and 4, respectively ( Table 2).
The case of Haplotype 3 and its related 3a and 3b was different: Haplotype 3a correlated to Clade 2, whereas Haplotype 3b corresponded to Clades 3 and 5. Haplotype 3 corresponded to Clade 6 but is probably also composite. The geographical origins of the six clades reconstructed from the genome-wide data of the 15 recently sequenced genomes are shown in Figure 3.

TA B L E 3
Details of geographical location, year, host of isolation, and haplotype for the 27 Xanthomonas vasicola pv. musacearum isolates that were characterized using the eight single-nucleotide polymorphism (SNP) markers Only Clades 1 and 2 contained isolates from both enset and banana. Interestingly, the enset isolates in both clades (NCPPB 2005 in Clade 1, BCC210 and BCC274 in Clade 2) were significantly divergent (as judged by the bootstrap values) from the banana isolates.

| Genome structure and gene content of Xvm clades
Analysis using Roary revealed that the Xvm pan-genome comprises 4,467 gene clusters, of which 3,764 (84.26%) are core, that is, present in all analysed genome assemblies. A further 703 (15.74%) gene clusters were variable, that is, were absent from at least one genome assembly. Fifty-six of these variable genes reside on the 49-Mb plasmid pXCM49 (GenBank CP034656.1) found in most previously sequenced Xvm genomes but absent from several of the newly sequenced isolates ( Figure 2); it is important to note that this plasmid carries no known virulence-associated gene, such as adhesins and

| D ISCUSS I ON
Banana xanthomonas wilt is a relatively new epidemic, having emerged in East and Central Africa only in the 21st century.
Understanding the origins and routes of transmission requires knowledge of genetic relationships between pathogen populations. Genetic variation in Xvm, the causative agent, is very limited (Odipio et al., 2009), thereby posing an obstacle to decipher population genetics of this organism. However, whole-genome sequencing previously revealed several SNP loci that can distinguish several sublineages or clades. The reference study of Wasukira et al. (2012) described two sublineages within Xvm-one being hypothesized from Ethiopia (SL 1) and one of unknown origin (SL 2)-and proposed a list of diagnostic SNPs. We used some of these polymorphic loci to genotype a collection of isolates broadly spanning the geographical and temporal range of the epidemic. While our results are fully consistent with the two clades of Xvm on banana outside of Ethiopia being geographically separated as reported previously (Wasukira et al., 2012) The SNP-derived RFLP typing system, despite not fully convergent with the genomic clades, has proven its usefulness in this study.
This is definitely a simple, fast, and relatively cheap approach for identifying the different Xvm clades circulating within a country or a region. As previously reported (Nakato et al., 2019), the SNP typing system could be used in combination with the MLVA-19 scheme within a hierarchical typing procedure, with the SNP markers being used to define the higher evolutionary groups at the clade level,

F I G U R E 3
Geographical origin of the six clades in Ethiopia and eastern-central Africa. Phylogenetic reconstruction from whole-genome SNP data revealed six well-defined genetic clades, with Clades 1 and 2 perfectly matching previously described sublineages SL 1 and SL 2 and four new clades (Clades 3-6). The Clade 1 Ethiopian isolates (NCPPB 2005, NCPPB 2251) had no GPS coordinates, and thus were not placed on the map and the MLVA-19 scheme being used for outbreak investigations, regional surveillance, amount and directions of gene flows. The whole-genome new clades were also correlated with the DAPC clusters identified using MLVA-19, with Clade 1 grouping DAPC4, 5, and 11; Clade 2 grouping DAPC2, 3, and 8; and Clade 5 grouping DAPC 7 and 9. Clades 3, 4, and 6 corresponded to DAPC10 (Nakato et al., 2019, and data not shown).
It remains now to be determined whether these clades do differ in virulence towards different enset cultivars, and in host range towards enset, banana, and other Poaceae (specifically maize, which can be a natural host of Xvm, as observed in 2017 in Ethiopia). This study, and these genomic resources, pave the way for future studies addressing the evolutionary history of X.
vasicola pv. musacearum, projects in functional genetics addressing the molecular basis of Xvm virulence on banana and enset, as well as breeding efforts for identifying efficient and durable banana and enset resistance sources.

ACK N OWLED G EM ENTS
We are grateful for the expert technical support of Karen Moore,

DATA AVA I L A B I L I T Y S TAT E M E N T
Data that supports the findings of this study are available from the corresponding author on reasonable request.