Advantages and limitations of genomics in prokaryotic taxonomy

Authors


Abstract

Taxonomic classification is an important field of microbiology, as it enables scientists to identify prokaryotes worldwide. Although the current classification system is still based on the one designed by Carolus Linnaeus, the currently available genomic content of several thousands of sequenced prokaryotic genomes represents a unique source of taxonomic information that should not be ignored. In addition, the development of faster, cheaper and improved sequencing methods has made genomics a tool that has a place in the workflow of a routine microbiology laboratory. Thus, genomics has reached a stage where it may be used in prokaryotic taxonomic classification, with criteria such as the genome index of average nucleotide identity being an alternative to DNA–DNA hybridization. However, several hurdles remain, including the lack of genomic sequences of many prokaryotic taxonomic representatives, and consensus procedures to describe new prokaryotic taxa that do not, as yet, accommodate genomic data. We herein review the advantages and disadvantages of using genomics in prokaryotic taxonomy.

Introduction

Taxonomy, the study of organism classification, is a part of systematics, the study of the diversity and relationships among organisms. Prokaryotic taxonomy is traditionally regarded as consisting of three separate, but interrelated, areas: classification, nomenclature, and characterization. Classification is the arrangement of organisms into taxonomic groups on the basis of similarities; nomenclature is the assignment of names to the taxonomic groups identified in the classification; and characterization is the determination of whether an isolate is a member of a taxon defined in the classification and named in the nomenclature [1]. The influence of prokaryotic taxonomy is tremendous: attaching a name to a microbial strain conveys assumptions and implications associated with that organism, such as routine identification from clinical samples, pathogenicity potential, safety of handling, and cost [2]. However, there is no universal agreement on the rules and criteria used for microorganism classification.

Taxonomic classification has long been based solely on phenotypic characteristics, genetic data having being used only since the 1960s. However, the sequencing of the first bacterial genome in 1995 [3] substantially changed microbiology, by giving access to the whole genetic repertoire of a strain. It is now possible to generate whole prokaryotic genome sequences in a very short period of time, offering the possibility of using the whole genomic sequence of a prokaryote for its taxonomic description. In this review, we explore the benefits and shortcomings of using genomic data in prokaryotic taxonomy.

Historical Overview and Current Practice in Prokaryotic Taxonomy

Although Carolus Linnaeus set the bases of modern taxonomy in the 18th century by studying plants, it was not before the late 19th century that Ferdinand Cohn classified bacteria into genera and species. Cohn and his contemporaries used morphology, growth requirements, chemical reactions and pathogenic potential as the basis for bacterial classification [4]. Later, biochemical and physiological properties were also used by the Society of American Bacteriologists (which later became the American Society for Microbiology) in a report on bacterial characterization and classification that became the basis for the first edition of Bergey's Manual of Determinative Bacteriology in 1923. In 1947, a Code of Bacteriological Nomenclature was approved at the 4th International Congress for Microbiology [5]. In the 1960s, the technique of DNA–DNA hybridization (DDH) was introduced to measure genetic relatedness [6], but it was only widely accepted for classification purposes more than 20 years later [7]. In the 1980s, the development of PCR and sequencing of the 16S rRNA gene led to major changes in prokaryotic taxonomy [8], and this tool, although already commonly used for the description of new species in the 1990s, was recommended in 2002 as a key parameter in taxonomic classification [9, 10].

Although prokaryotic nomenclature is regulated in the International Code of Nomenclature of Prokaryotes or the ‘Bacteriological Code’ [11], which is the latest edition of the Code of Bacteriological Nomenclature and is overseen by the International Committee on Systematics of Prokaryotes (ICSP), there has been no officially recognized system for the characterization and classification of prokaryotes until now. However, the most widely used system of characterization relies on a polyphasic approach, which is also used in the most widely accepted classification presented in Bergey's Manual of Systematic Bacteriology [4, 12].

The term ‘polyphasic taxonomy’ was introduced in 1970 to refer to a taxonomy that brings together and incorporates many levels of information, from ecological to molecular, and includes several distinct types of information to yield a multidimensional classification. Currently, polyphasic taxonomy refers to a taxonomy that aims to utilize all available data [13]. These data include both phenotypic information, such as chemotaxonomic features (cell wall compounds, quinones, polar lipids, etc.), morphology, staining behaviour, and culture characteristics (medium, temperature, incubation time, etc.), and genetic properties, such as G+C content, DDH value, and 16S rRNA gene sequence identity with other closely related species with validated names [14].

Currently, the most commonly used tool for evaluating the phylogenetic position of a prokaryote is 16S rRNA gene sequence comparison. Likewise, the latest whole taxonomic schema for prokaryotic diversity presented in Bergey's Manual uses 16S rRNA phylogeny as its main basis [15]. However, there is growing interest in the use of other genes (protein-encoding genes) to resolve issues that are not solved by 16S rRNA gene sequencing. For example, some housekeeping genes (e.g. dnaJ, dnaK, gyrB, recA, and rpoB) have been used instead in multilocus sequence typing/multilocus sequence analysis (MLSA) [16]. One limitation of 16S rRNA is that it is rather conserved, and hence is not universally reliable for determination of taxonomic relationships at the species level. Furthermore, both nucleotide variations within multiple rRNA operons in a single genome and the possibility of 16S rRNA genes being derived from horizontal gene transfer may distort relationships between taxa in phylogenetic trees [17]. Nevertheless, 16S rRNA is currently the first-line tool for evaluating the taxonomic status of a prokaryotic strain at the same genus or species levels. It is currently assumed that two strains are members of the same species if their 16S rRNA gene sequence identity is >99%, and it may provide the first indication that a novel species has been isolated if an identity of <98.7% is found [18]. Similarly, a 16S rRNA identity of <95% with the phylogenetically closest species with a validated name may suggest that the isolate is a representative of a new genus.

Another widely used taxonomic criterion is DDH. A DDH value of ≥70% has been recommended as a threshold for the definition of members of a species, and DDH is deemed necessary when strains share >98.7% 16S rRNA gene sequence identity [12, 14]. However, the DDH cut-off used is not applicable to all prokaryotic genera. For example, when applied to Rickettsia species, a DDH of 70% would not discriminate Rickettsia rickettsii, Rickettsia conorii, Rickettsia sibirica, and Rickettsia montanensis [19]. In addition, DDH protocols are considered to be tedious and complicated, with inherently large degrees of error, and only a few laboratories are equipped for this method, which remains expensive and is clearly not adapted to routine microbiology [2, 20]. Furthermore, DDH studies can provide only a rough measurement of average genetic relationship, only closely related species or subspecies can be distinguished, and incremental databases cannot be developed for this method [4].

The Prokaryotic Genomic Era

The sequencing of the Haemophilus influenzae genome in 1995 by conventional Sanger sequencing was a landmark in modern biology, as it marked the beginning of the genomic era [3]. However, in the next decade, bacterial genome sequencing remained time-consuming and expensive, and was reserved to a few sequencing centres worldwide. Thanks to the next-generation sequencing (NGS) technologies introduced from 2005, the number of sequenced prokaryotic genomes has rapidly increased, as new platforms are much faster and cheaper [21]. As of 18 September 2012, the Genome online Database listed 3381 prokaryotic genomes available as either full genome sequences, scaffolds, or contigs, and 11 789 other prokaryotic genome projects are ongoing (http://www.genomesonline.org/cgi-bin/GOLD/index.cgi).

The current commercially available NGS platforms can be divided into two categories: the high-end instruments and the bench-top instruments [21]. The high-end instruments can produce long reads and deliver dozens to thousands of prokaryotic genomes per run, but are too expensive for the average research laboratory; the bench-top instruments are modestly priced, and have lower throughput, but are also fast and considered to be better for most applications in microbiology [22]. The 454 GS FLX+, Illumina's HiSeq 2000/2500, Life Technologies' 5500xl SOLiD and Pacific Biosciences' PacBio RS are the latest high-end instruments, one of which has an output of up to 600 Gb per run, whereas 454 GS Junior, Life Technologies' Ion PGM and Ion Proton and Illumina's MiSeq are bench-top instruments that are able to sequence a complete prokaryotic genome in a few days.

NGS technology has already transformed microbiology and the way in which people study prokaryotes. Genome sequencing has made possible the development of specific culture media for several prokaryotes, and enabled us to more easily identify bacterial pathogens, test their antibiotic resistance and virulence, and track their emergence and spread [22, 23]. Sequencing is now replacing microarrays as the method of choice for studying gene expression (with RNA sequencing), mutant libraries (with Tn-seq and transposon-directed insertion site sequencing), and protein–DNA interactions (with chromatin immunoprecipitation followed by sequencing) [21]. Finally, it is no longer an absolute requirement to obtain large quantities of highly purified DNA for sequencing of a prokaryotic genome, as full genome sequencing from a complex microbial community and sequencing from a single cell are also possible, although the former method provides only an average sequence of a group of a closely related but not necessarily clonal population [24, 25].

Can Genome Sequences be Used in Prokaryotic Taxonomy?

Over the past 10 years, scientists have attempted to use genomes to assess the phylogenetic relationships between organisms, with a variety of techniques being used, including examination of the order of the genes, analysis of core genes (presence or absence or sequence alignment), indels or single-nucleotide polymorphisms in core genes, and the construction of super-trees (phylogenetic trees assembled from a combination of smaller phylogenetic trees) [17, 26]. As argued by Klenk and Göker [27], genome-scale data for phylogenetic reconstruction are advantageous, as genome sequences provide more characters to be analysed, and this, in general, improves the phylogenetic signal/noise ratio. Moreover, genomic information such as gene content, gene order and rare genomic rearrangements is complementary to the data provided by the nucleotide sequence. It was also argued that, although horizontal gene transfer might be very widespread in prokaryotes, it has not been proven to hinder phylogenetic reconstruction from genomic data. The vast majority of genes and genetic markers that are distinctive of higher prokaryotic taxa are vertically inherited, and a solid foundation for microbial systematics can be developed on the basis of these [28]. Indeed, Zhi et al. argued that trees based on the comparison of orthologous genes have reasonably good congruence with those built by comparison of 16S rRNA sequences, and, to some extent, with trees based on the presence and absence of genes [17]. For some recent examples, Thompson et al. demonstrated that the phylogenetic tree of vibrios obtained with the 16S rRNA gene is similar to that obtained with MLSA [29], and Bennet et al. found similar results for Neisseria when using multilocus sequence typing of 53 ribosomal protein subunits [30]. However, there is also an opposing view that a phylogenetic tree based on a single gene does not necessarily reflect the history of prokaryotes, as pointed out by Doolittle and Bapteste [31].

Whereas genome-base phylogeny has been the subject of a substantial number of publications, data on genome-based taxonomy remain scarce. In 2011, Whitman [32] recommended the routine description of prokaryotic species on the basis of their genomic sequences. In this way, type strains would be uniquely and unambiguously identified, and redundancy of nomenclature would be impossible. The genomic sequences would not only establish the genetic identity, but would also provide a diagnosis of the species with a precision unimaginable at the time when the Code was written. However, Kämpfer and Glaeser argued that genes and genomes do not function on their own, and can only display their potential within the cell as the basic unit of evolution and hence taxonomy [13]. Therefore, the ‘minimalist’ and/or genomic approach to descriptions of novel taxa must not abandon fundamental principles of taxonomy, including the incorporation of phenotypic data and requirements for strain deposition in culture collections.

Current genetic taxonomic criteria include several numerical cut-offs, notably DDH. Therefore, several authors studied the correlation between the percentage of nucleotide sequence similarity at the core genome level and DNA–DNA reassociation results. In particular, the average nucleotide identity (ANI) and MLSA have been suggested to be valid alternatives to DDH [33, 34]. ANI, defined as the mean percentage of nucleotide sequence identity of orthologous genes shared by two genomes, seems to reproduce DDH results with more accuracy. Two prokaryotic strains may be considered as belonging to the same species if they share a ≥96% ANI value, this cut-off being equivalent to the 70% DDH value. In addition, ANI studies can be performed in silico with public databases, and Richter and Rosselló-Móra even proposed that reliable ANI values may be obtained from the comparison of sequences covering ~20% of each genome [35]. In addition to ANI, other parameters, such as the maximal unique matches index, defined as a genomic distance index based on both DNA conservation of the core genome and the proportion of DNA shared by two genomes [36], and ‘tetranucleotide regression’, defined as the differences between observed and expected values of the frequencies of all 256 possible tetranucleotide (A, T, G, C) combinations [35], have been proposed to help evaluate the species status of a strain based on genome data. Furthermore, the genome-to-genome distance calculator can be used to calculate the genomic distance on the basis of the total length of all high-scoring segment pairs identified by a BLAST search of the genome [37, 38]. The results of ANI, the maximal unique matches index and the genome-to-genome distance calculator have been suggested to have a high correlation with DNA–DNA relatedness. However, the value of ANI is, at present, unbeatable, because it most probably reflects what experimentally occurs when two DNAs are hybridized in DDH experiments [39]. In 2010, Tindall et al. [14] suggested, in a ‘taxonomic note’ on the characterization of prokaryote strains published in the International Journal of Systematic and Evolutionary Microbiology, the official publication of the ICSP, that ANI may substitute for DDH analyses in the near future. With the rapid development and decreasing cost of high-throughput prokaryotic genome sequencing technology (with the imminent possibility of having a $1 bacterial genome sequence [21]), this proposition seems reasonable. ANI has been used recently, for instance, to describe new species of Burkholderia, Geobacter, and Vibrio, as well as to help characterize a new subspecies of Francisella, a new genus of Sphaerochaeta, and a new class of Dehalococcoidetes [40-45].

However, several current drawbacks limit the use of genomics for systematics. First, Klenk and Göker pointed out that completely sequenced genomes for many of the major lineages of prokaryotes are lacking [27]. The currently available genome sequences have been obtained mostly from three phyla (Proteobacteria, Firmicutes, and Actinobacteria). Thus, many phyla are poorly represented in genomics (http://www.genomesonline.org/cgi-bin/GOLD/index.cgi). Furthermore, the same authors noted that, even if the genome sequences of the species of interest are available, in many cases they are not type strains, and, therefore must be used with caution, as prokaryote taxonomy is based on type strains only [14]. However, efforts such as the phylogeny-driven Genomic Encyclopedia of Bacteria and Archaea programme, which aims to sequence all type strains [46], should help to fill the gaps, even though Zhi et al. argued that the increasing number of available genomes currently remains highly biased towards organisms of biotechnological and medical importance [17]. Another problem is that existing genomic sequences vary greatly in their finished quality, often being available only as unfinished draft assemblies that, according to Ricker et al. and Klassen et al., may be less informative than finished whole genome sequences [47, 48]. For that reason, minimal sequencing quality should be defined for genomes to be included in taxonomic analyses. For example, the guidelines developed by the Next-generation Sequencing: Standardization of Clinical Testing work group might be utilized for this purpose [49]. Moreover, Ozen et al. argued that the results obtained with whole genome-based tools such as ANI do not consistently agree with current taxonomy, and different methods should be used for the different levels of taxonomy, as they stated that there is not one universal method with which to naturally classify prokaryotes [50]. However, Sutcliffe et al. emphasized that, indeed, the current principles and practice of prokaryotic systematics have not yet fully accommodated genomic data, and that significant revision of the procedures used to describe novel prokaryotic taxa is needed, including the likely introduction of new publication formats [51]. Furthermore, Figueras et al. pointed out that consensus genome comparison criteria that are acceptable in prokaryotic taxonomic classification remain to be defined [52].

In our laboratory, we recently included genome sequence analysis in a polyphasic strategy to describe new bacterial species, together with phenotypic data including their matrix-assisted laser desorption ionization time-of-flight mass spectrum, and main phenotypic characteristics (habitat, Gram stain reaction, culture and metabolic characteristics, and, when applicable, pathogenicity) [53]. In our scheme, the degree of nucleotide sequence similarity of orthologous genes between the genome of a putative new bacterial species and the genomes of its most closely related and validly published species should be similar to that observed among these validly published species. Our method differed from the ANI calculation, as we first determined the orthologous protein set between two genomes by BLASTP, using a coverage of ≥50% and a degree of amino acid identity of ≥30%, and then calculated the mean percentage of nucleotide sequence identity between these orthologous genes (Fig. 1). In contrast, orthologous genes used for ANI determination are identified by a BLASTN search. As an example, the genome from Peptoniphilus senegalensis sp. nov., isolated from a Senegalese patient's stool, shared 976, 977 and 1195 orthologous genes (86.9%, 87.08% and 86.48% mean orthologous gene nucleotide similarity) with Peptoniphilus lacrimalis, Peptoniphilus indolicus, and Peptoniphilus harei, respectively [54]. These values were similar to those observed among validly published Peptoniphilus genomes, as P. indolicus shared 942 and 1078 orthologous genes (87.06% and 86.78% mean similarity) with P. lacrimalis and P. harei, respectively, and P. harei and P. lacrimalis shared 1 095 orthologous genes and 87.35% mean similarity. Therefore, both genomic and phenotypic data were consistent with the new species status of P. senegalensis sp. nov.

Figure 1.

Current strategy used in our laboratory to describe novel prokaryotes. MALDI-TOF MS, matrix-assisted laser desorption ionization time-of-flight mass spectrometry; ORF, open reading frame.

Conclusions

The current availability of >3000 prokaryotic genome sequences, including those from most of the major human pathogens, offers the opportunity to make use of the total genetic content of prokaryotes for their taxonomic classification. However, as ANI or other genomic comparison markers may replace DDH as a standard to circumscribe prokaryotic species in the very near future, several challenges remain, in particular the need to define a genomic-based method that is agreed upon by microbiologists, and cut-offs that either apply to most prokaryotes or vary according to taxonomic groups. In addition, although the integration of genomic data into prokaryotic taxonomic classification seems unavoidable in the near future, genome sequences should always be included in a polyphasic strategy in combination with phenotypic data. Thus, procedures to describe new prokaryotic taxa need a reassessment to accommodate genomic data while genome sequences of more prokaryotic taxonomic representatives or under-represented taxa are looked for.

Transparency Declaration

None of the authors of the present manuscript have a commercial or other association that might pose a conflict of interest (e.g. pharmaceutical stock ownership or consultancy). This work was supported by a grant from the Méditerranée Infection Foundation.

Ancillary