Shifting national surveillance of Shigella infections toward geno‐serotyping by the development of a tailored Luminex assay and NGS workflow

Abstract The phylogenetically closely related Shigella species and enteroinvasive Escherichia coli (EIEC) are responsible for millions of episodes of bacterial dysenteriae worldwide. Given its distinct epidemiology and public health relevance, only Shigellae are subject to mandatory reporting and follow‐up by public health authorities. However, many clinical laboratories struggle to differentiate non‐EIEC, EIEC, and Shigella in their current workflows, leading to inaccuracies in surveillance and rising numbers of misidentified E. coli samples at the National Reference Centre (NRC). In this paper, we describe two novel tools to enhance Shigella surveillance. First, we developed a low‐cost Luminex‐based multiplex assay combining five genetic markers for species identification with 11 markers for serotype prediction for S. sonnei and S. flexneri isolates. Using a test panel of 254 clinical samples, this assay has a sensitivity of 100% in differentiation of EIEC/Shigella pathotype from non‐EIEC strains, and 68.7% success rate in distinction of Shigella and EIEC. A novel, and particularly successful marker was a Shigella‐specific deletion in the spermidine acetyltransferase gene speG, reflecting its metabolic decay. For Shigella serotype prediction, the multiplex assay scored a sensitivity and specificity of 96.6% and 98.4%, respectively. All discrepancies were analyzed with whole‐genome sequencing and shown to be related to causative mutations (stop codons, indels, and promoter mutations) in glycosyltransferase genes. This observation spurred the development of an in silico workflow which extracts the Shigella serotype from Next‐Generation Sequencing (NGS) data, taking into account gene functionality. Both tools will be implemented in the workflow of the NRC, and will play a major role in the shift from phenotypic to genotyping‐based surveillance of shigellosis in Belgium.


| INTRODUC TI ON
Shigellae are facultative intracellular pathogens and the etiological agents of bacillary dysentery or shigellosis (Croxen et al., 2013;Gomes et al., 2016). Shigellosis affects annually 164.7 million people, and results in a high mortality among children aged 1-4 years in low-and middle-income countries (Kotloff, Riddle, Platts-Mills, Pavlinac, & Zaidi, 2017). In western countries, Shigella infections were traditionally mostly travel-related, but recent surveillance data from the United Kingdom indicate a shift to domestically circulating strains (Aragón et al., 2007;Baker et al., 2015), some of which are increasingly resistant to ciprofloxacin and azithromycin.
The Shigella genus is subdivided into four species based on their antigenic properties: S. sonnei, S. boydii, S. dysenteriae, and S. flexneri, each having different subtypes based on variations in the O-antigen of the LPS layer (Edwards & Ewing, 1986). This classification does not reflect its evolutionary history as phylogenetic analyses clearly cluster Shigella species within the Escherichia coli species (Chen et al., 2014;Edwards, Logan, Langham, Swift, & Gharbia, 2012;Escobar-Páramo, Giudicelli, Parsot, & Denamur, 2003;Pettengill, Pettengill, & Binet, 2016). In particular, enteroinvasive E. coli (EIEC) lineages have been identified as the direct evolutionary ancestor of Shigella, by having acquired a large F-type plasmid (pINV) that encodes the molecular machinery required for invasion, survival, and diffusion of the bacterium within the host (Sansonetti, Kopecko, & Formal, 1982;Yang et al., 2005).
Phylogenetic studies suggest this acquisition occurred multiple times in independent events (Hazen et al., 2016;Pettengill et al., 2016), upon which Shigella spp. evolved to a strictly human pathogen because of intense gene decay. This is reflected by decreased metabolic activity, increased disease severity, and decreased infectious dose (DuPont et al., 1971;Prosseda et al., 2012). Specific surveillance and differentiation of Shigella spp. from non-EIEC remains therefore warranted from a medical and public health perspective.
National surveillance in Belgium is performed by the National Reference Centre for Shigellosis (NRCS), which receives annually approximately 400 Shigella cultures on a voluntary basis from peripheral laboratories (Figure 1a). Of the 2,066 confirmed Shigella strains received in the period 2013-2018, 72.1% were S. sonnei, 21.9% S. flexneri, 4.3% S. boydii, and 1.7% S. dysenteriae with a serotype distribution that has been stable for more than a decade ( Figure 1b). Notably, the number of false-positive Shigella cultures has increased substantially since 2015 as clinical laboratories increasingly rely on MALDI-TOF for bacterial identification, which fails to properly differentiate Shigella from E. coli (Figure 1c, Khot & Fisher, 2013). Shigella spp. are traditionally typed using biochemical, mobility and serological assays, which are time consuming and error prone through possible cross-reactions of O-antigens between E. coli and Shigella (Liu et al., 2008;Sun et al., 2011). Molecular PCR methods have been described for identification and geno-serotyping of Shigella spp. (Dutta et al., 2001;Gentle, Ashton, Dallman, & Jenkins, 2016;Li, Cao, et al., 2009;Sun et al., 2011), but either have limited resolution or are not cost-effective to be implemented in routine surveillance. Some western countries have introduced whole-genome sequencing (WGS) for Shigella surveillance, delivering SNP-level discriminatory power (Chattaway et al., 2017;Dallman et al., 2016;McDonnell et al., 2013). However, wide implementa-| 3 of 20 VENTOLA ET AL.

| Bacterial strains, traditional typing, and genomic DNA extraction
In Belgium, peripheral clinical laboratories collect Shigella isolates from human patients and send them voluntarily to the NRCS for identification using Triple Sugar Iron Agar (TSI, Biotrading, NL) and serotyping by slide agglutination using commercially available monovalent antisera (Denka Seiken CO, UK; Appendix 2). Confirmed EIEC isolates were acquired from the "Centre for Infectious Disease Control" (RIVM, The Netherlands). Bacterial cultures were grown overnight at 37°C on Mueller-Hinton agar (Bio-Rad). For DNA extraction, either a single colony was added to 200 μl of InstaGene™ Matrix (Bio-Rad) and placed in a thermal cycler (56°C for 25 min, 99°C for 8 min, cooled to 4°C). The mixture was spun (14,000 g, 1 min) and the supernatant was used immediately or stored at −20°C. Alternatively, gDNA was extracted semi-automatically using the MgC Bacterial DNA Kit™ with 60 μl elution volume (Atrida, NL), according to the manufacturer's instructions for gram-negative bacteria.

| MOL-PCR using Luminex xTAG beads
For all targeted genes, upstream and downstream probes were designed targeting 35-45 bp conserved regions with maximal conservation and accessibility using OligoAnalyzer 3.1 (Table 1). Upstream probes are equipped with an internal anti-TAG sequence compatible with the anti-TAG of the MagPlex™ beads, while universal T7 and T3 primer sequences were added to the 5′ and 3′ ends of upstream and downstream probes, respectively. Downstream probes were 5′-phosphorylated.
Our MOL-PCR approach has been described in detail elsewhere (Wuyts et al., 2015). Briefly, all reactions were assembled in cooled 96-well plates in a 10 μl reaction volume containing 2 nM of each probe, 2 U of Taq DNA Ligase (New England Biolabs, Ipswich, MA), 1× Taq DNA ligase buffer, 2 μl of DNA template, and nuclease-free water. Ligation was performed by initial denaturation (95°C, 10 min), Hybridization of the PCR product to colored microspheres was performed in a volume of 20 μl per reaction, with MagPlex™-TAG microspheres (750 beads/target) added to 0.1 M Tris-HCl, pH 8.0/0.2 M NaCl/0.08% Triton-X. To this mixture, 5 μl of PCR product was added, followed by a denaturation step (90 s at 96°C) and 30 min of hybridization at 37°C. Subsequently, 100 μl of a reporter mix containing 4 μg/ml streptavidin-R-phycoerythrin (Life Technologies) was added, and the samples were incubated for 15 min at 37°C in the dark. Subsequent read-out was performed at 37°C using 100 μl of these samples, on a MAGPIX device with a minimal bead count of 50 microspheres/target (Wuyts et al., 2015). For each marker, the signal-to-noise (S/N) ratios were calculated by dividing the median fluorescence intensity (MFI) by the corresponding MFI of the NC. During assay design, an S/N ratio ≥2.0 indicated positive identification.

| Whole genome sequencing and in silico serotyping
Genomic DNA was prepared using MgC Bacterial DNA Kit™ with 60 μl elution volume (Atrida, NL), following the manufacturer's instructions. Sequencing libraries were constructed using the Illumina Nextera XT DNA sample preparation kit and subsequently sequenced on an Illumina MiSeq instrument with a 250-bp pairedend protocol (MiSeq v3 chemistry) according to the manufacturer's instructions.
The resulting SAM file was then converted into an indexed BAM file using SAMtools view v1.3.1, followed by SAMtools sort and SAMtools index (Li, Handsaker, et al., 2009). Afterward, a pileup was generated using SAMtools mpileup with output format set to "VCF," followed by variant calling by BCFtools call v1.6 with the following options: "-consensus-caller," "-variants-only," and "-ploidy 1" (Li, 2011). Variants that were covered by <10 reads or variants that were not covered by at least one forward and one reverse read were removed using BCFtools filter (Danecek & McCarthy, 2017). Indels were normalized and duplicates removed using BCFtools norm with the option "-rm-dup both." Finally, the functional effect of the mutations was determined using BCFtools csq v1.9.30 (commit: g983f7da) with the option "-local-csq" enabled. Genes that contained a stop codon and/or a frameshift were considered to be not expressed for the determination of the serotype.
Mutations in the gtr promotor were detected similarly by first mapping trimmed reads against a 381 bases-long region covering the gtr promotor and initial coding sequence (accession number KT988057.1).
Read mapping and variant calling were done as described before but variant filtering was slightly more strict: minimal depth 10×, minimal forward depth 1×, minimal reverse depth 1×, minimal SNP quality 25, minimal mapping quality 30, minimal Z-score of 1.96, and minimal Ymultiplier of 10 as described elsewhere (Kaas, Leekitcharoenphon, Aarestrup, & Lund, 2014). The promotor was considered to be wild type if there were no filtered mutations inside the -35 box or the -10 TA box. Otherwise the gtrX promoter was considered as not wild type and the gtrX gene as not expressed for the determination of the serotype. The profiles described in Sun et al. (2011) were then used as a decision system to classify the serotype.

| Multiplex target selection and design
To introduce molecular testing in national Shigella surveillance, we designed a specific multiplex assay for identification, differentiation, and subtyping of Shigella spp. from cultured strains. Our strategy was based on converting known molecular markers into a MOL-PCR assay with read-out on a Luminex MAGPIX ® platform, allowing multiplex detection of up to 50 genes in a single well (Table 1). For identification of the EIEC/Shigella pathotype, we targeted the invasive plasmid antigen H (ipaH) and the plasmid invC (Ojha, Yean, Ismail, & Singh, 2013;Venkatesan, Buysse, & Hartman, 1991). To distinguish EIEC from Shigella, we inferred the presence of lacY (Pavlovic et al., 2011), cadA (Prosseda et al., 2007) and a Shigella-specific deletion 19_20delGT in speG (Prosseda G, personal communication). Next, we included probes targeting wbgZ and rfc for identification of S. sonnei and S. flexneri, respectively (Ojha et al., 2013). Finally, we adapted a previously described multiplex PCR assay for serotyping of S. flexneri that targets genes for O-antigen synthesis or modification (Gentle et al., 2016;Sun et al., 2011) into to a Luminex-compatible format (Table 1). A decision tree to interpret the results of the final assay can be found in Figure 2a. A probe targeting the opt gene, responsible for addition of phosphoethanolamine to L-rhamonse II or III, leading to Flexneri variants 4av, Xv, and Yv (Sun et al., 2012), was not included as no positive control samples were present in our collection. Genetic serotyping of S. boydii and S. dysenteriae was omitted from the current assay as this would have required the inclusion of 31 additional targets, substantially increasing the reaction cost to cover only a minority of samples submitted in Belgium (<5%, Figure 1b).
Serotypes 1c (n = 7), 4b (n = 1), 5a/b (n = 1), X (n = 8), and Y (n = 4) are underrepresented in the NRCS collection in comparison to other serotypes, and all available isolates were included in this study. To this collection, we added 33 isolates which had a negative identification for Shigella spp., 16 confirmed EIEC strains, and six untypable isolates exhibiting nonspecific agglutination reactions.
TA B L E 1 Luminex probes designed using published targets TA B L E 1 (Continued) F I G U R E 2 In-house molecular (MOL)-PCR-based Luminex assay for Shigella typing. (a) Decision tree for the developed MOL-PCR assay for detection and subtyping of Shigella spp. (b) Graphical representation of raw Luminex data for tested species and serotypes during test validation. The read-out is scored as the median fluorescence intensity, which is converted to signal-to-noise ratios (S/N) for allele calling. The single available isolate of S. flexneri 5 was confirmed as Escherichia coli based on whole-genome sequencing During the test phase of the assay, we detected false-positive signals in 5.1% (13/254) of the tested isolates, due to an elevated background (Appendix 2), which disappeared upon re-extraction of their gDNA (data not shown).  Figure 2).
As expected, isolates belonging to Sonnei Phase II (5/5) could not be detected. We employed NGS to evaluate the 10 discordant Flexneri isolates in more detail (Table 2), which allowed to characterize the genes responsible for O-antigen synthesis or modification at a much higher resolution. We identified explanatory indels and frameshift mutations in oac, gtrI, and gtrIV in six strains, impeding their function (Gentle et al., 2016). Moreover, we detected promoter mutations upstream of the gtr operon in four strains, suggesting decreased expression levels resulting in the S. flexneri 3b serotype. A peculiar result was observed for strain S16BD02240, which was previously typed as the only S. flexneri 5 isolate in Belgium. While the species identification panel detected speG and not ipaH or invC, the serotype probes rfc and gtrV were positive (Appendix 2). Closer inspection of sequencing results revealed the insertion of a phage-encoded gtrV protein in an E. coli background, leading to the E. coli O13/O135:H11 serotype (Knirel et al., 2016).

| NGS-based serotyping
To enhance future workflows, we designed a WGS-based workflow for automated extraction of Shigella serotypes from NGS data that includes detection of opt, wzx, wzy, and other known glycosyltransferase genes, enabling the detection of all currently described variants of the O-antigen from S. boydii, S. sonnei, S. dysenteriae, and S. flexneri (Li, Cao, et al., 2009). To account for observed differences between phenotypes and genotypes described previously, we included the detection of TAG stop codon and frameshifts in all analyzed genes, and promotor mutations in the gtr operon ( Figure 3).
The algorithm was tested on publicly available NGS data from 135 globally collected S. flexneri strains (Connor et al., 2015), leading to identical serotype predictions in 127 of 135 (94.1%) of tested strains (Appendix 3). Interestingly, frameshifts (17%) and amber mutations (2.9%) were regularly detected among the 127 correctly predicted serotypes, thus showing frequent inactivation of glycosyltransferase genes. Next, we analyzed the eight deviating results using the CLC Bio Genome Workbench. In two samples, we observed low (<5×) coverage of the opt gene, hinting at plasmid loss in a subpopulation of the culture. Given the minimal coverage set at 10×, these genes were below our detection limit. In 3 of 8 cases, our NGS workflow failed to call gtr operon promoter mutations (n = 1), and indels in gtrX (n = 1) and oac (n = 1). In the three remaining cases, no obvious explanation of the discrepancy could be detected.  (Barbagallo et al., 2011), indicating that the loss of speG function is an emerging trait. As predicted, EIEC have an intermediate position as active N-acetylspermidine is still present in most EIEC strains (68.7% in our dataset), yet intracellular spermidine tends to be higher as compared to commensal E. coli (Campilongo et al., 2014). Interestingly, among all non-Shigella strains that were sent to the NRCS by peripheral Belgian laboratories, not a single strain with defective speG was detected (Appendix 2), strongly suggesting that the large majority are non-EIEC strains.

| D ISCUSS I ON AND CON CLUS I ON
A weakness of the current assay is the low positive predictive value for EIEC strains. A first option to cope with this is to expand the biochemical typing of ipaH positive strains, as described by van den Beld et al. (2018). Alternatively, the discriminatory power of the molecular assay can be increased by incorporating additional markers published by Australian researchers during the review process of this article (Dhakal, Wang, Lan, Howard, & Sintchenko, 2018).
Their large-scale genome comparisons identified six genetic loci separating Shigella from EIEC, which combined presence/absence led to 95.1% sensitivity. Due to the flexibility of the Luminex-based MOL-PCR methodology, the expansion of our assay from a 17-to a 23-plex is expected to go swiftly with minimal impact on cost and handling time.
In addition to species identification, the presented Luminex assay simultaneously detects S. sonnei Phase I and S. flexneri serotypes. Two published reports on molecular geno-serotyping report 92.6% and 97.8% concordant results between phenotypic serotyping and PCR (Gentle et al., 2016;Sun et al., 2011), comparable with the 95.1% observed in our MOL-PCR based assay.
As reported also in these studies, we also observed a robust correlation between the phenotypes and genotypes for S. sonnei and S. flexneri serotypes 1b, 1c, 2a, 2b, 3a, F6, and Y. Discrepancies are commonly caused by amber mutations, insertions, and deletions in O-antigen synthesis or modification genes, rendering these phenotypically inactive. In our test set, these accounted for 5.4% of deviating results among tested S. flexneri. In the global collection of Flexneri strains analyzed by Connor et al. (2015), 19.9% of strains contained such mutations, making a strong case for using WGS data in serotype prediction instead of PCR-based methods that only take a part of the gene into account. As a note,

ACK N OWLED G EM ENTS
The authors thank Zahra Boukhouchi and Frédéric Fux for excellent technical assistance. The NRC is partially supported by the Belgian

Ministry of Social Affairs through a fund within the Health Insurance
System. The research was also partially supported by project RF_16/6302 (AMR-ARRAY) sponsored by the Contractual Research (RCO) of the Federal Public Service in Belgium.

CO N FLI C T O F I NTE R E S T
All authors report no conflict of interest. designed the experiments and wrote the paper. All authors read and approved the manuscript.

E TH I C S S TATEM ENT
None required.
Accession numbers of publicly available NGS data used in this study, and raw Luminex data are listed in Appendix 3.

R E FE R E N C E S
| 11 of 20 VENTOLA ET AL.

APPENDIX 1
Schematic representation of the MOL-PCR method. When a target allele or SNP is present in the DNA extract of a bacterial strain under study, two sequence-specific probes will be ligated in 25 consecutive cycles of denaturation and annealing. The upstream probes carry a 24 nucleotide sequence (shown in red) which binds to the anti-TAG covalently attached to the MagPlex-TAG microspheres (Luminex). Using universal T7 and 5′-biotin labeled T3 primers (shown in blue), the ligated oligonucleotides are amplified, denatured, and hybridized to the corresponding beads. After short exposure to streptavidin-R-phycoerythrin, the luminescent signal is read by the MAGPIX ® machine, and S/N ratios are calculated using negative control samples. This figure is modified from Ceyssens et al. (2016).

APPENDIX 3
In silico workflow results