Sequences of multiple bacterial genomes and a Chlamydia trachomatis genotype from direct sequencing of DNA derived from a vaginal swab diagnostic specimen

Authors

  • P. Andersson,

    1. Division of Global and Tropical Health, Menzies School of Health Research, Charles Darwin University, Darwin, NT, Australia
    Search for more papers by this author
  • M. Klein,

    1. Menzies School of Health Research, Corporate Services, Charles Darwin University, Darwin, NT, Australia
    Current affiliation:
    1. Information Technology Management and Support Division, Charles Darwin University, Darwin, NT, Australia
    Search for more papers by this author
  • R. A. Lilliebridge,

    1. Division of Global and Tropical Health, Menzies School of Health Research, Charles Darwin University, Darwin, NT, Australia
    Search for more papers by this author
  • P. M. Giffard

    Corresponding author
    1. Division of Global and Tropical Health, Menzies School of Health Research, Charles Darwin University, Darwin, NT, Australia
    • Corresponding author: P. M. Giffard, Division of Global and Tropical Health, Menzies School of Health Research, Charles Darwin University, PO Box 41096, Casuarina NT 0811, Australia

      E-mail: phil.giffard@menzies.edu.au

    Search for more papers by this author

Abstract

Ultra-deep Illumina sequencing was performed on whole genome amplified DNA derived from a Chlamydia trachomatis-positive vaginal swab. Alignment of reads with reference genomes allowed robust SNP identification from the C. trachomatis chromosome and plasmid. This revealed that the C. trachomatis in the specimen was very closely related to the sequenced urogenital, serovar F, clade T1 isolate F-SW4. In addition, high genome-wide coverage was obtained for Prevotella melaninogenica, Gardnerella vaginalis, Clostridiales genomosp. BVAB3 and Mycoplasma hominis. This illustrates the potential of metagenome data to provide high resolution bacterial typing data from multiple taxa in a diagnostic specimen.

We hypothesized that ultra-deep sequencing of total DNA from a diagnostic sample may generate extremely high resolution genetic fingerprints from multiple bacterial taxa. This was tested using a Chlamydia trachomatis-positive diagnostic specimen, and attempting to identify robust single nucleotide polymorphisms (SNPs) by alignment with reference genomes.

Total nucleic acid preparation purified using the MagNApure system (Roche Applied Science, Roche Diagnostics Australia Pty Limited, Castle Hill, NSW, Australia) from a vaginal swab was provided by the Pathology Department at Royal Darwin Hospital (RDH Pathology). This had previously been shown by RDH Pathology to be C. trachomatis positive using the Roche COBAS® TaqMan® system, with a CT of 25, with all stages of sample collection and analysis performed according to the manufacturer's instructions. The material supplied to us was a remnant of the preparation used for the diagnostic testing procedure.

The nucleic acid was subjected to whole genome amplification using the REPLI-g® UltraFast kit (Qiagen, Hilden, Germany) according to the manufacturer's instructions. The amplified material was subjected to paired-end sequencing on an Illumina Hiseq 2000 device (Illumina Inc., San Diego, CA, USA), with 0.5 lane devoted to this sample. The sequencing was performed by Macrogen Inc. (Seoul, South Korea). The sequencing service included QC of template DNA, library preparation and filtering of low quality reads post-sequencing.

Sequence data were aligned with reference genomes using Bowtie 2 [1]. The default setting of allowing up to three mismatches per 101 bp read was used for classing a read as alignable to a reference genome. Firstly, the sequence data were tested against the human genome (Hg19 (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)) using the no-mixed mode, which only accepts appropriately spaced pairs of reads as valid. By this means, 53 954 660 reads (5 449 420 660 bp) were identified as human. These reads were removed from the dataset using BAMtools [2]. To obtain an overview of microbial composition, the remaining reads were aligned against bacterial and viral genome databases (see Supplementary Data for details). In the no-mixed mode, 16 088 118 reads (1 624 899 918 bp) were identified as bacterial and 158 766 reads (16 035 366 bp) identified as viral.

The sequences were aligned against a subset of the available complete non-LGV-associated C. trachomatis genomes [3-6]. SNPs were identified using the mpileup command in SAMtools [7] followed by variant calling using BCFtools within SAMtools [7]. Default parameters were used apart from coverage, which was set to a minimum of five and a maximum of 100. SNPs were regarded as robust if they were identified by BCFtools, with an associated PHRED quality score of ≥50, which equates to a ≤10−5 probability that the SNP is spurious. We also filtered the SNPs to include only those for which there was no significant evidence that the sequence reads define more than one allele. Alignable C. trachomatis sequence comprised approximately 0.04% of the total alignable sequence and 0.17% of the sequence alignable with bacterial genomes. Approximately 85% of the C. trachomatis chromosome was sequenced at least once, and the average coverage was 2.6×. Only 8% (c. 80 kbp) of the chromosome was covered at 5× depth and so met the coverage criterion for SNP calling. However, the plasmid was present at 6× higher copy number than the chromosome, allowing SNP calling over 100% of the plasmid. Of C. trachomatis genomes, the sequences in the diagnostic sample most closely resembled isolate F-SW4, with the plasmid sequences being identical, and 23 robust SNPs in the aligned 8% of the chromosome separating our sample from F-SW4 (Table 1). Four additional SNPs had quality scores <50, so 23 SNPs is a lower boundary. The numbers of SNPs identified are completely consistent with whole genome-based C. trachomatis phylogeny [6], with the two T1 genomes (F-SW4 and E-SW2) being closely related to the specimen strain, and the T2 clade genomes D-SOTON-5, B-Jali20 and A-HAR13 equally and more distantly related.

Table 1. C. trachomatis chromosome and plasmid reference sequences used for SNP identification, and the numbers of SNPs identified that discriminate the sequence data from the reference genome
Serovar-strainLengthAligned readsMean coveragea/% covered at least onceSNPs, ≥5× coverageSNPs: ≥5× coverage Qual score ≥50, only two allelesb
  1. a

    Only sequences covered at least once were included in the calculation.

  2. b

    There is a single SNP allele in the sequence data, with the other allele in the reference genome.

Chromosome
A-HAR131 044 45926 4882.6×/84.8777622
Genbank NC007429.1     
B-Jali201 044 35226 4802.6×/84.8783625
EMBL-EBI FM872308     
D-Soton51 042 14025 4522.6×/85.0679560
Genbank HE601799.1     
E-SW21 042 83926 8502.6×/85.6175149
Genbank NC017441     
F-SW41 042 73626 9322.6×/85.82723
Genbank HE601804     
Plasmid
pA-HAR137 5101 31617.7×/99.93838
EMBL-EBI CP000052     
pB-JALI207 5061 31417.7×/99.93838
EMBL-EBI FM865438     
pD-Soton57 4921 31017.7×/99.94242
Genbank HE603230.1     
pE-SW27 1691 31617.7×/99.922
Genbank NC012630     
pF-SW47 4931 31617.7×/99.900
Genbank NC012625     

We used the same SNP-based approach, but with a raised lower limit of 10× coverage, to identify and characterize the most abundant bacterial taxa in the sample. By alignment against the bacterial genome database, it was found that sequences from four bacterial species were present in sufficient quantity for high genome coverage (Table 2). Interestingly, Prevotella sp, Gardnerella vaginalis and Mycoplasma hominis are vaginosis associated [8, 9]. The Clostridiales genospecies is uncultured and previously only found in a vaginal microbiome study (Genbank accession NC_013895) so the significance of its presence is unknown. All reference genomes used were published by direct Genbank submission (Table 2) except for Mycoplasma hominis PG21 (ATCC 23114) [10]. High coverage did not translate into alignment of the complete reference genomes with our sequences. For example, with Prevotella melaninogenica there was c. 400× coverage across only c. 70% of the reference genome. This is almost certainly due to many reads having >3 mismatches with the orthologous position in the reference sequence. The mean divergence of >2% between the alignable reads and the reference suggests that this is inevitable. The SNP information is provided as supplementary data in pdf format. The corresponding Microsoft Excel and VCF files are available from the authors upon request.

Table 2. Non-C. trachomatis reference DNA sequences used for alignment with the sequencing reads, and the numbers of SNPs identified that discriminate the sequence data from the reference genome
Reference sequenceLengthAligned readsMean coveragea/% covered at least onceSNPs: allSNPs: ≥10 coverage, Qual score ≥50, only two allelesb
  1. a

    Only sequences covered at least once were included in the calculation.

  2. b

    There is a single SNP allele in the sequence data, with the other allele in the reference genome.

P. melaninogenica

ATCC25845

Chromosome 1

Genbank NC014370

1 796 4087 714 804433.6×/75.431 43429 623

P. melaninogenica

ATCC25845

Chromosome 2

Genbank NC014371

1 371 8745 535 288407.3×/67.919 65618 338

G. vaginalis 49-05

Genome

Genbank NC013721

1 617 5451 846 112115.2×/88.732 93829 700

Clostridiales genomosp. BVAB3 str. UPII9-5

Genome

Genbank NC013895

1 809 746617 13434.4×/95.79 2448 177

M. hominis PG-21

Genome

Genbank NC013511

665 44598 98414.0×/93.85 5105 008

The C. trachomatis CT of 25 for the vaginal swab diagnostic procedure is midrange for C. trachomatis-positive samples analysed at the RDH Pathology (Dr R. Baird, personal communication). Thus, this approach to C. trachomatis genotyping will only be successful with a subset of conventional diagnostic samples. However, it is clear that this experiment also yielded information that could be used for very high resolution epidemiological inference from other vaginosis-associated bacterial species. There is accumulating evidence that bacteria associated with bacterial vaginosis may be transmitted, so data of this nature have considerable potential to be used in contact tracing [11, 12].

The huge number of well-supported SNPs identified in this study serves as an indication of the high quality of sequences obtained, and also the plethora of epidemiologically informative sites in these taxa. Typically, metagenome data are used in ecological studies. As recently foreshadowed [13], this study illustrates that metagenome data could also contribute to infectious disease surveillance, epidemiological inference and forensic investigation by providing extremely high resolution SNP-based typing information from multiple microbial taxa.

Acknowledgements

The authors thank Rob Baird, Kevin Freeman and Peter Fagan from the Royal Darwin Hospital Pathology Department for contributing clinical material and associated data. These results have previously been presented in preliminary form at the 2012 Annual Scientific Meeting of the Australian Society for Microbiology, Brisbane, Queensland, 3 July 2012.

Transparency Declaration

This study was supported by project grant 1004123 from the National Health and Medical Research Council (Australia).

Ancillary