The recent advances in sequencing technologies have given all microbiology laboratories access to whole genome sequencing. Providing that tools for the automated analysis of sequence data and databases for associated meta-data are developed, whole genome sequencing will become a routine tool for large clinical microbiology laboratories. Indeed, the continuing reduction in sequencing costs and the shortening of the ‘time to result’ makes it an attractive strategy in both research and diagnostics. Here, we review how high-throughput sequencing is revolutionizing clinical microbiology and the promise that it still holds. We discuss major applications, which include: (i) identification of target DNA sequences and antigens to rapidly develop diagnostic tools; (ii) precise strain identification for epidemiological typing and pathogen monitoring during outbreaks; and (iii) investigation of strain properties, such as the presence of antibiotic resistance or virulence factors. In addition, recent developments in comparative metagenomics and single-cell sequencing offer the prospect of a better understanding of complex microbial communities at the global and individual levels, providing a new perspective for understanding host–pathogen interactions. Being a high-resolution tool, high-throughput sequencing will increasingly influence diagnostics, epidemiology, risk management, and patient care.
In recent years, there has been a major transformation in the way that clinicians and researchers extract genomic information from patient samples. The development of ultra-high-throughput sequencing (UHTS) technologies has been instrumental in advancing research in all scientific areas, but particularly in microbiology, where genomes are small. As shown by the impressive increase in genomic data output (Fig. 1), whole genome sequencing (WGS) has entered all research laboratories, and will soon become an integrated tool in diagnostic laboratories.
In clinical bacteriology, it is critical to rapidly characterize the pathogen present in a clinical sample, to improve patient care. Identification at the species level and antibiotic susceptibility testing are of major importance in guiding antibiotic treatment and the management of infectious diseases. Although matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) mass spectrometry (MS) has been a revolution in clinical microbiology, and may have significant applications in typing, identification, and even toxin detection , there is still no high-throughput approach with which to fully and rapidly characterize any bacterial strain. Generally, such detailed characterizations involved multiple analyses, and were only performed by research laboratories specializing in a given pathogen. These detailed analyses take days to months, depending on the type of bacterium and the complexity of the question.
UHTS offers the possibilities of reducing the number of steps needed for full characterization of the pathogen, and optimizing the ‘time to result’ (Fig. 2). Bacterial whole genome shotgun sequence data can be obtained from a pure culture, directly from a clinical sample, or even from a single bacterium present in a given sample [2, 3]. Genome finishing—the most costly and time-consuming step—is often not necessary, and the release of unfinished genomes has become a major trend in the area [4, 5]. Unfinished genomes, also called raw, draft or dirty genomes, provide enough data for extraction of the required information (Fig. 3), such as the presence of toxins  or genes or mutations coding for antibiotic resistance . Unfinished genomes can also be directly used to develop new diagnostic tests such as ELISA  and PCR .
WGS has become rapid and cheap enough to replace some older techniques previously used to characterize a pathogen at the genomic level. We review here the main UHTS techniques, and discuss their main applications in clinical and diagnostic laboratories. Finally, we highlight substantial challenges that remain in the development of innovative pipelines for genome analysis and data storage, to gather information in an effective, accurate and harmonized way.
Sequencing Technologies—in Short
In 2005, new high-throughput sequencing technologies appeared on the market, and were referred to as ‘next-generation sequencing’ technologies, as they replaced Sanger's dideoxy chain termination sequencing method. Their development was quick and remarkable, and they rapidly turned out to be essential tools for microbial genomics. Next-generation sequencing technologies have been the subject of excellent reviews [9, 10], and we will only highlight their main advantages and limitations with regard to their use in clinical microbiology (Table 1).
Table 1. Main sequencing technologies and their characteristics
Read length (bp)
Throughput (Gb/h run)
Best suited for:
Data not available on the corporate website.
De novo and metagenomics
De novo and metagenomics
De novo and metagenomics
Sequencing by synthesis with reversible terminators
The 454 Genome Sequencer, the first to be commercialized, rapidly established itself as a standard for de novo sequencing and metagenomics, thanks to its long reads (up to 700 bp) . Shortly thereafter, Solexa sequencing by synthesis became the most widely used system among the research community . It has major applications in resequencing and RNA sequencing, thanks to its high throughput, allowing a lower cost per base, although with shorter reads (36–150 bp). The SOLiD sequencing system , based on two-base sequencing by ligation, is insensible to homopolymer errors, and is principally used for resequencing, transcriptomics, or epigenomics. Arriving later in the field, Heliscope  remained marginally used by the community. Both 454 and Solexa offer the possibility of obtaining paired-read information, which is of great help for assembly. In addition, samples can easily be tagged with short (6–8 bp) barcoding sequences and pooled into a single run. Finally, thanks to its higher throughput, Illumina enables the multiplexing and sequencing of nearly 100 bacterial samples at a time, making it a cost-effective platform for sequencing large collections of bacteria.
More recently, a third generation of technologies arrived on the market. Life Science launched the Ion Torrent PGM and the Ion Proton Sequencer, which are based on the sensing of proton release during base incorporation . The major advantage of this technology is the short run time, enabling sequences to be obtained in only a few hours vs. several days for previous high-throughput sequencers. Finally, biology-based systems, such as PacBio and Nanopore, directly sequence DNA at the single-molecule level, respectively by detecting the fluorescence of dye-labelled nucleotides added by an immobilized polymerase , or by sensing the ionic current of DNA that passes through a pore . They both yield very long reads, which should facilitate the mapping and assembly of repeated and complex DNA.
The various systems have different characteristics, preventing the development of a universal approach for data analysis. Software developers face challenges to cope with the increasing number of reads and to integrate their particular characteristics in terms of length, type of error, error rate and particular weak points. Whereas 454 and Ion Torrent are more prone to artificial insertions or deletions in long mononucleotide repeats, Solexa and PacBio result in more substitution errors. In addition, 454, Solexa, SOLiD, Heliscope and Ion Torrent are based on a first PCR amplification step that results in biased sequence representation, artificially generating deep-coverage or low-coverage regions. Even though PacBio and Nanopore avoid this bias, these technologies are still young, and may also have other drawbacks, such as the error rate, which is notably high.
In diagnostic bacteriology, the processing of clinical samples has changed little over the years, and most analyses still depend on isolating a viable microorganism (Fig. 2), although antigen detection and PCR have been strongly developed in recent years, allowing faster diagnosis as well as accurate detection of fastidious organisms, strict intracellular bacteria, and viruses. The clinical applications of genomics can be divided into two categories: (i) those requiring a bacterial isolate, such as bacterial typing, outbreak monitoring, or the determination of biological properties, such as the presence of virulence factors; and (ii) others that may be applied directly on the sample, such as metagenomics and community profiling.
The future of clinical microbiology certainly lies in the development of methods to obtain full microbial genomes by the direct sequencing of clinical samples. This strategy was successfully applied to sequence the eukaryote Plasmodium falciparum from a blood cell-depleted sample of blood , and the bioterrorism agent Francisella tularensis from abscess pus . Direct sequencing could be applied to physiologically sterile samples or to samples with high bacterial concentrations (millions of copies/mL). If required, the bacterial concentration could be increased by using an initial antibody-based bacterial purification step. Direct sample sequencing provides major advantages, as it permits a gain in time equivalent to the time needed for bacterial culture, and also allows the detection of unculturable bacteria.
Diagnostics, species identification, and taxogenomics
Determining the bacterial species is often crucial for making accurate clinical decisions, and it provides direct information on pathogenic potential. Historically, bacterial identification was based principally on colony growth time and morphology, Gram staining, sugar assimilation/fermentation, and biochemical tests. MALDI-TOF MS has recently been introduced successfully for routine use , thanks to its rapid turn-around time (minutes) and very low running cost. However, MALDI-TOF MS fails to identify unusual species in approximately 50% of cases . These species may then be identified by sequencing of a few phylogenetically informative core genes, such as 16S rRNA or rpoB. In rare situations, when precise classification of novel species to a new clade is required, dirty genome sequencing might be useful for helping with species description.
Genome sequences have been widely used to develop new diagnostic tests for known pathogens [21, 22] and emerging pathogens such as Tropheryma whipplei , or mutant variants such as Chlamydia trachomatis harbouring a deletion on its cryptic plasmid . Furthermore, raw genome sequences may help in identifying new targets for diagnostic PCRs or for the development of serological tests, e.g. ELISA . An interesting initiative was the establishment of a webserver, ssGeneFinder, that is able to provide gene targets for the identification of specific microorganisms based on automated analysis of their pan-genome .
At present, species are defined by both phenotypic and genetic characteristics, and represented by type strains deposited in two international strain collections. With the advent of WGS, Didelot et al.  proposed that genomic reference sequences should constitute a reference standard. Newly sequenced microorganisms would then be classified among all previously sequenced reference genomes by phylogenetic reconstructions. Nevertheless, we think that a polyphasic taxonomy that includes both phenotypic and phylogenetic analyses is still needed.
Jolley et al.  proposed the use of multilocus sequence typing (MLST) based on ribosomal protein-encoding genes to classify the available 1900 bacterial genomes. These genes, which are essential for the bacterial cell, have the advantage of being present in all bacteria, as well as providing more resolution than 16S rRNA, being under strong selection for functional conservation. Any new genome could easily be placed in the current classification by analysis of the same 53 conserved ribosomal genes. In another study, Larsen et al.  developed a web-based method for MLST based on preassembled genomes or directly on short sequence reads. The available MLST schemes based on various alleles are currently restricted to 66 bacterial species, but cover the most important pathogens.
With the increase in available sequences, such comparative methods will improve statistical power and will be highly discriminative, allowing a precise bacterial species classification. As comprehensive approaches, they certainly constitute a major improvement over actual classification cut-offs based on DNA–DNA hybridization (>70%) , 16S rRNA identity (>95%) , or average nucleotide identity . However, new ambiguities may be revealed, e.g. the apparent polyphyly of Bacillus cereus and Bacillus thuringiensis  or Neisseria polysaccharea . Thus, they will challenge the definition of bacterial species, and eventually lead to deep modifications of the current taxonomy.
An ideal typing tool for epidemiology
The typing of bacterial strains is essential for investigating transmission pathways and supporting the monitoring of outbreaks. Some reference laboratories use restriction fragment length polymorphism, pulsed-field gel electrophoresis or MLST to type bacterial strains. Until recently, these techniques were considered to be reference standards for epidemiology, but they have limited resolution in differentiating strains evolving from a single bacterial clone [34-36].
WGS represents the ultimate tool in epidemiological typing, as it allows the identification of single genomic changes between two isolates. With the advent of inexpensive high-throughput sequencing, it is now becoming fast and cheap enough to be widely used in routine epidemiological investigations. The first attempt to sequence a whole organism for epidemiological investigation was made after the anthrax letter attack in the USA in 2001. The whole genome of Bacillus anthracis was determined by Sanger sequencing and compared with other genotypes, and this indicated that the morphological variants present in the letters were not prevalent in the environment .
Multiple epidemiological studies are now exploiting the single-nucleotide polymorphisms that differentiate strains to investigate their relatedness by simple phylogenetic analyses. Sanger and Illumina sequencing were applied to investigate the spread of the leprosy agent, Mycobacterium leprae, through human migration and trade routes . When the resolution is optimal, i.e. when the rate of genomic variation is sufficiently high, it is even possible to determine the transmission route between hospital centres or even between patients in the same ward. For example, Harris et al.  used Illumina sequencing to investigate the micro-evolution and propagation of a methicillin-resistant Staphyloccocus aureus strain through five continents. McAdam et al.  performed Bayesian phylogenetic reconstructions to identify the probable spread of the pandemic EMRSA-16 clone from large hospital centres to regional healthcare settings in UK. Similarly, Lewis et al.  investigated the spread of a multidrug-resistant Acinetobacter baumanii strain in a hospital outbreak by using 454 sequencing technology. Recent studies on methicillin-resistant Staphyloccocus aureus  and Clostridium difficile [43, 44] highlighted the benefit for healthcare of bench-top high-throughput sequencing strategies to detect outbreaks and person-to-person transmission.
Until now, WGS has been of no help to doctors in fighting epidemics, owing to the delay in detecting the outbreak and sequencing the strain. However, it was of great help in determining the transmission route and gaining insights into the history of pathogen spread. Moreover, as shown by the German Escherichia coli outbreak (see below), WGS is key to understanding the determinants and modelling the evolutionary events that may lead to a hypervirulent strain. Thus, monitoring and investigation of past and recent outbreaks is of particular importance to improve our understanding of pathogen transmission and our management of risk and crisis situations.
Pathogen monitoring during outbtreaks—the case of E. coli O104:H4
The recent outbreak of the virulent E. coli O104:H4 is an excellent illustration of the speed of data acquisition and major outcomes that can be obtained by WGS (Fig. 4). In May 2011, an outbreak of diarrhoea and haemolytic–uraemic syndrome (HUS) caused by a Shiga toxin-producing E. coli strain started in Germany. In 2 months, over 3100 non-HUS cases and 900 HUS cases were reported, causing 53 deaths. Several independent centres rushed to sequence some outbreak and historical reference isolates, most of them combining different sequencing technologies [45-47]. Within <2 weeks from the identified onset of the outbreak, and within <62 h from strain isolation, the first draft genomes were available and deposited in public databases . Rohde et al.  provided an interesting approach in setting up an open-source genomics program for the analysis of their strain.
Several studies reached the conclusion that the outbreak was caused by the unusual E. coli serotype O104:H4, which contained genes from both enteroaggregative E. coli (EAEC) and enterohaemorrhagic E. coli (EHEC) [45-47]. Although it is similar to EAEC, the outbreak clone harbours a prophage encoding Shiga toxin, as well as additional virulence and antibiotic resistance factors [45, 46]. Mellman et al.  suggested that a highly pathogenic hybrid of EAEC and EHEC emerged by gain and loss of chromosomal and plasmid-encoded factors. This story illustrates how the plasticity of bacterial genomes with regard to genetic exchanges can form new virulent pathogens and this multicentre work demonstrated the feasibility and usefulness of rapid draft genome sequencing.
Biology and virulence
Genome sequences provide some information for virulence determination, and open new prospects for large-scale research on genotype–phenotype associations. In principle, it is possible to predict resistance to antibiotics by identifying known genetic elements, e.g. mecA, which confers methicillin resistance to S. aureus , or drug targets such as rpoB and others for Mycobacterium tuberculosis [49, 50]. However, if full genomes are excellent resources for identifying known mechanisms of antibiotic resistance, susceptibility testing is inexpensive and fast for rapid-growing bacteria, and also allows the detection of resistance not associated with gene mutations, such as decreased permeability of bacterial cell walls to antibiotics. Furthermore, genetic systems for resistance may be complex and are not necessarily well characterized, making phenotypic–genotypic comparisons difficult.
In this case, UHTS may only be used as a complement to routine susceptibility testing to investigate abnormal results and new mechanisms of resistance. Comparative genomics is a strategic approach for investigating the resistome of bacterial strains, and was successfully applied to A. baumanii  and Burkholderia dolosa . Similarly, comparing strains that have become resistant after laboratory selection may provide insights into the evolutionary mechanisms for the acquisition of resistance, such as for Bacillus anthracis .
Likewise, many bacteria encode well-characterized toxins that are known to cause severe diseases, such as HUS caused by EHEC , toxic shock syndrome caused by Streptococcus pyogenes , or diphtheria caused by Corynebacterium diphtheriae . In this setting, full genome sequencing allows us to confirm the presence of toxins and to identify new or mutated toxins that may be missed by diagnostic PCRs specifically targeting toxin-encoding genes. During the recent outbreak of cholera in Haiti, PacBio sequencing was used on a number of strains to characterize the phage-encoded toxin .
The profiling of microbial communities was historically based on the cloning and Sanger sequencing of 16S rRNA, followed by phylogenetic analyses. This first culture-free method improved our knowledge of bacterial diversity . Today, high-throughput sequencing strategies have taken over, avoiding the need for cloning, and providing good sensitivity to rare DNA, thanks to their very high coverage [59, 60].
454 pyrosequencing rapidly appeared as a method of choice, and was shown to reveal a greater variety of species and a give more reliable estimate of their relative abundance, although it has a tendency to inflate the estimates of microbial diversity . Dethlefsen et al.  pyrosequenced tagged hypervariable regions of the 16S rRNA from the gut microbiome of patients before and after ciprofloxacin treatment. The antibiotherapy reduced the taxonomic diversity in the gut, and influenced the abundance of one-third of bacterial taxa. Similarly, Armougom et al. [63, 64] showed the differences in microbial communities in obese, normal and anorexic patients. Other studies reported the use of Illumina-based sequencing to unravel the oral microbiota  or the gut microbiota, for which over 3 million non-redundant genes were characterized .
An interesting initiative is the Human Microbiome Project, which was funded to characterize the microbiomes of 250 volunteers and to identify associations with potential diseases . As the project progressed, terabases of sequences were released , and they provide a comprehensive picture of the healthy human body microbiome  and its metabolome .
Recently, there was a proposal to use culturomics, i.e. the use of a large panel of different culture conditions, as a complement to metagenomics [71, 72]. This approach led to the identification of new species, and showed only a partial redundancy of both methods (<20%), with culturomics being more sensitive and avoiding a certain amount of coverage bias, and metagenomics allowing the detection of so-called unculturable bacteria.
The use of single-cell genomics is emerging as a new strategy for the investigation of microbial communities and uncultured microorganisms present in clinical samples [2, 3]. The amplification of femtograms of bacterial DNA is based on a method called multiple displacement amplification [73, 74], which avoids the need for cultivation. Raghunathan et al.  first provided the proof-of-principle that DNA from a single bacterium could be sequenced. Further technical improvements in cell isolation and DNA amplification, sequencing and assembly have made it possible to assemble almost complete genomes from single cells .
Single-cell sequencing has already provided draft genome sequences of major bacterial taxa that were not previously available in the databases. It has been used principally to sequence environmental bacteria such as Beggiatoa  and Poribacteria , but it has broad applications for clinical investigations as well. As shown by Grindberg et al. , single-cell sequencing can be efficiently combined with metagenomics to provide an inventory of the microbial community and the genetic linkage of sequences in a single organism. Their analysis enabled discovery of the gene cluster required for the synthesis of apratoxin, a promising cytotoxin for cancerous cells. Such combined metagenomics–single-cell sequencing approaches should provide a deeper understanding of human microbial communities and target novel unculturable pathogens.
Challenges in the Pipeline
With the decrease in cost, high-throughput sequencing is now available to a large number of diagnostic laboratories and clinical research groups. The question is no longer when and where to sequence, but how to rapidly transform genomic data into biological and clinically useful knowledge. Although it is easy to perform on a laboratory scale, the technique will result in maximal benefit to the scientific community only if a global system for sharing genomic data and mining information on microorganisms is created.
Indeed, clinicians and researchers need to not only assemble the genome of the pathogen but also to interpret the data, which requires the incorporation of a century of knowledge into databases and the comparison of multiple strains. For example, when the genome of Vibrio cholerae from the epidemic in Haiti was sequenced, some evidence pointed to UN peacekeepers importing it from Nepal . However, it was only after months that a strain from Nepal was analysed and the link could be ultimately confirmed .
Such information is currently delivered by a skilled workforce, but the process needs to be automatized to allow mainly computer-driven processing of genomes. In the first stage, global national servers will probably be preferred to store pathogen information, resistance profiles, and disease details, especially for pathogens considered to be possible bioterrorism agents. It is thus a challenge to create a global resource from which the hundreds of thousands of genomes sequenced annually and their associated so-called meta-data could be provided to the whole scientific community. Such an effort requires reinforcing collaboration and breaking down barriers between doctors, biologists, and bioinformaticians. Ideally, information should be shared between countries as soon as it is available, and scientists should stop retaining information until a paper is published. Finally, harmonization of procedures for data sequencing and analysis will facilitate the comparison and exchange of results obtained in different laboratories.
The Future of Genomics in Clinical Microbiology
High-throughput sequencing is sweeping through clinical microbiology, every day bringing biological knowledge and novel ideas for clinical applications of microbial genomics. Bacterial genomics has proved to be an excellent tool for investigating strain particularities that may explain atypical syndromes. In the near future, investigations of unusual bacterial infection cases will likely include a description of the corresponding bacterial genome and the analysis of strain particularities. Similarly, UHTS represents a high-resolution typing tool for epidemiology. The recent epidemics of Shiga-toxin producing E. coli in Germany demonstrated the role of genomics in the development of tools for diagnosis and increasing our understanding of the dynamics of bacterial origin and spread.
UHTS represents a milestone towards the simplification of bacterial diagnostic procedures and possibly toward single-step analysis (Fig. 2). Indeed, the isolation of bacterial strains is common to generations of microbiologists, but can be labour-intensive, especially for some fastidious intracellular bacteria and slow-growing species, despite increasing automation and a variety of new chromogenic media that facilitate strain isolation. Also, culture approaches remain of limited value when a patient has been pretreated with antibiotics and is not sensitive to unculturable bacteria. New developments that avoid the need to first culture the bacterium are welcome, especially for slow-growing bacteria, as they enable significant gains in time (Fig. 2B).
Direct raw sequencing of clinical samples will become a major asset, and may be used for diagnostics. Moreover, single-cell sequencing promises to accelerate our understanding of the vast numbers of microorganisms that affect our health. Although we have only begun to understand the microbial diversity in healthy and unhealthy patients, we can clearly see that detailed human microbiome analysis could become part of routine patient clinical management.
It is increasingly appealing to not only consider investigating the pathogen, but also to perform host genomics simultaneously, at least for bacteria whose pathogenicity has been clearly associated with specific genetic susceptibilities, such as T. whipplei , M. tuberculosis , or C. trachomatis . The principle of dual host–bacterium genomics may also be applied to RNA sequencing to obtain a comprehensive view of changes in gene expression in both interacting partners .
As a whole, current and future massively parallel sequencing technologies provide a profusion of opportunities that may expand our understanding of the complex host–pathogen interaction and improve patient health management. However, these new technologies will not reach maximal efficacy until genomics is integrated into microbiological diagnostic laboratories, within the healthcare system. Substantial challenges remain for the development of methods for data analysis and the harmonization of laboratory procedures. Only under these circumstances can knowledge developed in the past be expanded to new genome-based information.
None of the authors have received any financial support for this work or has any conflicts of interest with this manuscript.