We hypothesized that ultra-deep sequencing of total DNA from a diagnostic sample may generate extremely high resolution genetic fingerprints from multiple bacterial taxa. This was tested using a Chlamydia trachomatis-positive diagnostic specimen, and attempting to identify robust single nucleotide polymorphisms (SNPs) by alignment with reference genomes.
Total nucleic acid preparation purified using the MagNApure system (Roche Applied Science, Roche Diagnostics Australia Pty Limited, Castle Hill, NSW, Australia) from a vaginal swab was provided by the Pathology Department at Royal Darwin Hospital (RDH Pathology). This had previously been shown by RDH Pathology to be C. trachomatis positive using the Roche COBAS® TaqMan® system, with a CT of 25, with all stages of sample collection and analysis performed according to the manufacturer's instructions. The material supplied to us was a remnant of the preparation used for the diagnostic testing procedure.
The nucleic acid was subjected to whole genome amplification using the REPLI-g® UltraFast kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions. The amplified material was subjected to paired-end sequencing on an Illumina Hiseq 2000 device (Illumina Inc., San Diego, CA, USA), with 0.5 lane devoted to this sample. The sequencing was performed by Macrogen Inc. (Seoul, South Korea). The sequencing service included QC of template DNA, library preparation and filtering of low quality reads post-sequencing.
Sequence data were aligned with reference genomes using Bowtie 2 . The default setting of allowing up to three mismatches per 101 bp read was used for classing a read as alignable to a reference genome. Firstly, the sequence data were tested against the human genome (Hg19 (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)) using the no-mixed mode, which only accepts appropriately spaced pairs of reads as valid. By this means, 53 954 660 reads (5 449 420 660 bp) were identified as human. These reads were removed from the dataset using BAMtools . To obtain an overview of microbial composition, the remaining reads were aligned against bacterial and viral genome databases (see Supplementary Data for details). In the no-mixed mode, 16 088 118 reads (1 624 899 918 bp) were identified as bacterial and 158 766 reads (16 035 366 bp) identified as viral.
The sequences were aligned against a subset of the available complete non-LGV-associated C. trachomatis genomes [3-6]. SNPs were identified using the mpileup command in SAMtools  followed by variant calling using BCFtools within SAMtools . Default parameters were used apart from coverage, which was set to a minimum of five and a maximum of 100. SNPs were regarded as robust if they were identified by BCFtools, with an associated PHRED quality score of ≥50, which equates to a ≤10−5 probability that the SNP is spurious. We also filtered the SNPs to include only those for which there was no significant evidence that the sequence reads define more than one allele. Alignable C. trachomatis sequence comprised approximately 0.04% of the total alignable sequence and 0.17% of the sequence alignable with bacterial genomes. Approximately 85% of the C. trachomatis chromosome was sequenced at least once, and the average coverage was 2.6×. Only 8% (c. 80 kbp) of the chromosome was covered at 5× depth and so met the coverage criterion for SNP calling. However, the plasmid was present at 6× higher copy number than the chromosome, allowing SNP calling over 100% of the plasmid. Of C. trachomatis genomes, the sequences in the diagnostic sample most closely resembled isolate F-SW4, with the plasmid sequences being identical, and 23 robust SNPs in the aligned 8% of the chromosome separating our sample from F-SW4 (Table 1). Four additional SNPs had quality scores <50, so 23 SNPs is a lower boundary. The numbers of SNPs identified are completely consistent with whole genome-based C. trachomatis phylogeny , with the two T1 genomes (F-SW4 and E-SW2) being closely related to the specimen strain, and the T2 clade genomes D-SOTON-5, B-Jali20 and A-HAR13 equally and more distantly related.
|Serovar-strain||Length||Aligned reads||Mean coveragea/% covered at least once||SNPs, ≥5× coverage||SNPs: ≥5× coverage Qual score ≥50, only two allelesb|
|A-HAR13||1 044 459||26 488||2.6×/84.8||777||622|
|B-Jali20||1 044 352||26 480||2.6×/84.8||783||625|
|D-Soton5||1 042 140||25 452||2.6×/85.0||679||560|
|E-SW2||1 042 839||26 850||2.6×/85.6||175||149|
|F-SW4||1 042 736||26 932||2.6×/85.8||27||23|
|pA-HAR13||7 510||1 316||17.7×/99.9||38||38|
|pB-JALI20||7 506||1 314||17.7×/99.9||38||38|
|pD-Soton5||7 492||1 310||17.7×/99.9||42||42|
|pE-SW2||7 169||1 316||17.7×/99.9||2||2|
|pF-SW4||7 493||1 316||17.7×/99.9||0||0|
We used the same SNP-based approach, but with a raised lower limit of 10× coverage, to identify and characterize the most abundant bacterial taxa in the sample. By alignment against the bacterial genome database, it was found that sequences from four bacterial species were present in sufficient quantity for high genome coverage (Table 2). Interestingly, Prevotella sp, Gardnerella vaginalis and Mycoplasma hominis are vaginosis associated [8, 9]. The Clostridiales genospecies is uncultured and previously only found in a vaginal microbiome study (Genbank accession NC_013895) so the significance of its presence is unknown. All reference genomes used were published by direct Genbank submission (Table 2) except for Mycoplasma hominis PG21 (ATCC 23114) . High coverage did not translate into alignment of the complete reference genomes with our sequences. For example, with Prevotella melaninogenica there was c. 400× coverage across only c. 70% of the reference genome. This is almost certainly due to many reads having >3 mismatches with the orthologous position in the reference sequence. The mean divergence of >2% between the alignable reads and the reference suggests that this is inevitable. The SNP information is provided as supplementary data in pdf format. The corresponding Microsoft Excel and VCF files are available from the authors upon request.
|Reference sequence||Length||Aligned reads||Mean coveragea/% covered at least once||SNPs: all||SNPs: ≥10 coverage, Qual score ≥50, only two allelesb|
|1 796 408||7 714 804||433.6×/75.4||31 434||29 623|
|1 371 874||5 535 288||407.3×/67.9||19 656||18 338|
G. vaginalis 49-05
|1 617 545||1 846 112||115.2×/88.7||32 938||29 700|
Clostridiales genomosp. BVAB3 str. UPII9-5
|1 809 746||617 134||34.4×/95.7||9 244||8 177|
M. hominis PG-21
|665 445||98 984||14.0×/93.8||5 510||5 008|
The C. trachomatis CT of 25 for the vaginal swab diagnostic procedure is midrange for C. trachomatis-positive samples analysed at the RDH Pathology (Dr R. Baird, personal communication). Thus, this approach to C. trachomatis genotyping will only be successful with a subset of conventional diagnostic samples. However, it is clear that this experiment also yielded information that could be used for very high resolution epidemiological inference from other vaginosis-associated bacterial species. There is accumulating evidence that bacteria associated with bacterial vaginosis may be transmitted, so data of this nature have considerable potential to be used in contact tracing [11, 12].
The huge number of well-supported SNPs identified in this study serves as an indication of the high quality of sequences obtained, and also the plethora of epidemiologically informative sites in these taxa. Typically, metagenome data are used in ecological studies. As recently foreshadowed , this study illustrates that metagenome data could also contribute to infectious disease surveillance, epidemiological inference and forensic investigation by providing extremely high resolution SNP-based typing information from multiple microbial taxa.