The evolutionary relationships of proteobacteria, which comprise the largest and phenotypically most diverse division among prokaryotes, are examined based on the analyses of available molecular sequence data. Sequence alignments of different proteins have led to the identification of numerous conserved inserts and deletions (referred to as signature sequences), which either are unique characteristics of various proteobacterial species or are shared by only members from certain subdivisions of proteobacteria. These signature sequences provide molecular means to define the proteobacterial phyla and their various subdivisions and to understand their evolutionary relationships to the other groups of eubacteria as well as the eukaryotes. Based on signature sequences that are present in different proteins it is now possible to infer that the various eubacterial phyla evolved from a common ancestor in the following order: low-G+C Gram-positive⇒high-G+C Gram-positive⇒Deinococcus-Thermus (green nonsulfur bacteria)⇒cyanobacteria⇒Spirochetes⇒Chlamydia-Cytophaga-Aquifex-green sulfur bacteria⇒Proteobacteria-1 (? and δ)⇒Proteobacteria-2 (α)⇒Proteobacteria-3 (β)⇒Proteobacteria-4 (γ). An unexpected but important aspect of the relationship deduced here is that the main eubacterial phyla are related to each other linearly rather than in a tree-like manner, suggesting that the major evolutionary changes within Bacteria have taken place in a directional manner. The identified signatures permit placement of prokaryotes into different groups/divisions and could be used for determinative purposes. These signatures generally support the origin of mitochondria from an α-proteobacterium and provide evidence that the nuclear cytosolic homologs of many genes are also derived from proteobacteria.
Proteobacteria comprise one of the largest divisions within prokaryotes and account for the vast majority of the known Gram-negative bacteria [1–8]. This group of organisms, formerly known and still often referred to as ‘purple bacteria and relatives’[1,3,9–11], encompass a very complex assemblage of phenotypic and physiological attributes including many phototrophs (responsible for the purple characteristics), heterotrophs and chemolithotrophs [4–7,12,13]. The proteobacterial group is of great biological significance as it includes a large number of known human, animal and plant pathogens [4–7]. In addition, this group of organisms have made major contributions toward the origin of eukaryotic cells and their organelles [14–17]. Photosynthesis or purple coloration is limited to only a small number of organisms belonging to this phylum. Therefore, members of the International Committee for Systematic Bacteriology concluded that the name ‘purple bacteria and their relatives’ was inappropriate for this group of organisms. In its place, they proposed a new name for this taxon at the class level, Proteobacteria classis nov., after the Greek god Proteus who could assume many different shapes, to reflect the enormous diversity of shape and physiology seen within this group [1,18].
The proteobacteria (or purple bacteria) group was first circumscribed by Woese and coworkers based on the information derived from 16S rRNA/rDNA analyses [3,19,20]. Different species have been placed in this group based on 16S rRNA oligonucleotide catalogs, phylogenetic analysis based on full and partial sequences, rRNA cistron similarities, and the results of DNA–rRNA hybridization [3,19,21–23]. However, the main basis of defining the proteobacterial group thus far has been the formation of a distinct clade by these organisms in the phylogenetic trees based on 16S rRNA/rDNA sequences [1–3,9,10,19,21,24]. While a number of other eubacterial phyla can be distinguished from all others by means of distinctive signatures (nucleotide substitutions, etc.) that have been identified in the 16S rRNA [3,9], no signature in the 16S rRNA that could serve to define the proteobacterial group was found. The phylogenies based on 16S and 23S rRNA have led to the division of the proteobacterial group into five subdivisions or subclasses that have been arbitrarily designated α, β, γ, δ and ? [3,25–27]. In a classification used by De Ley and coworkers [23,28,29], the α, β and γ groups are referred to as rRNA superfamilies IV, III, and I+II, respectively. However, the relative branching orders of the various proteobacterial subdivisions and how they are related to other eubacterial divisions remain to be determined [3,9,24].
As the proteobacteria consist of more than 200 genera and encompass a major proportion of the known Gram-negative organisms [1,4–7], a good understanding of the evolutionary relationships among this group of organisms, and how they are related to the other groups of prokaryotes, is central to understanding the phylogeny of prokaryotes. In our recent work, a new approach employing conserved inserts and deletions (referred to as signatures or signature sequences) in different protein sequences was described for deducing the phylogenetic relationships among prokaryotes [30,31]. Using this approach, it was shown that the various main phyla comprising the eubacteria evolved from a common ancestor in the following order: low-G+C Gram-positive⇒high-G+C Gram-positive⇒Deinococcus-Thermus (green nonsulfur bacteria)⇒cyanobacteria⇒Spirochetes⇒Chlamydia-Cytophaga-Flavobacteria-green sulfur bacteria⇒Proteobacteria (α, δ and ?)⇒Proteobacteria (β and γ) [30,32]. In our earlier work, the evolutionary relationship among proteobacteria was not studied in detail and only a limited number of signatures that were useful in this regard were described . In the present followup review, which focuses primarily on the evolutionary relationships of the proteobacteria, a large number of signatures in different proteins are described that are helpful in understanding the branching order and the evolutionary relationships of this group of prokaryotes. These signatures provide the means to define the proteobacterial division as well as the various proposed subdivisions within it. Additionally, they provide evidence that the various subgroups which form this division have evolved from their most recent common ancestor (related to the Chlamydia-Cytophaga group) in the following order: δ, ? subdivision⇒α subdivision⇒β subdivision⇒γ subdivision. The branching patterns of proteobacteria and their subdivisions in phylogenetic trees based on different gene/protein sequences (16S rRNA and various proteins) have been examined and these results are consistent with and strongly support the inferences deduced based on the protein sequence signatures. The observed branching order of proteobacterial subdivisions raises important questions concerning the nomenclature and the rank assignment for these groups. The protein signatures described here also provide insight into the relationships of the proteobacterial species to the origin of mitochondria and eukaryotic cells.
2Reliability of signature sequences for evolutionary studies
Signature sequences are defined in our work as conserved inserts or deletions (i.e., indels) in proteins that are restricted to specific taxa [30,31]. When a conserved indel of defined length and sequence, flanked by conserved regions that ensure that the observed changes are not due to misalignment or sequencing errors, is found at the same position in homologs from certain groups of species, then the simplest and most parsimonious explanation for this observation is that the indel was introduced only once during the course of evolution and then passed on to all descendants [30,31]. Thus, based upon the presence or absence of a signature, the species containing or lacking it can be divided into two distinct groups which bear a specific evolutionary relationship to each other. Since indels in different genes or proteins could be introduced at different time points in evolution, they provide useful milestones for evolutionary events and based on these the order of evolution of different group of species can be deduced [30,32].
The signature sequence-based approach employed in the present work has certain advantages over the traditional approach involving tree construction for understanding the evolutionary relationships among distantly related taxa [10,33–35]. In the phylogenetic approach, the branching pattern of a species in a tree is dependent upon a large number of variables. These include: reliability of the sequence alignment; regions of sequences that are retained or excluded; number and range of species included; differences in the evolutionary rates among species; base compositional differences between species, phylogenetic methods employed, etc. [33,34,36–47]. Since many of these factors are difficult to control in different studies involving even the same gene or protein, their influence on the branching orders of species in phylogenetic trees cannot be predicted. As a consequence, the results of phylogenetic studies are often variable and in a large number of cases they fail to resolve the relationship among distantly related taxa. In contrast, in the signature sequence-based approach, the assignment of a species to a particular group is based on the presence or absence of a well-defined character (i.e., indel), and hence it generally presents no problem [30,31,48–51]. Another distinct advantage of the signature sequence approach is that the entire information on which a given inference is based is contained in the signatures shown and hence their accuracy and reliability can be easily assessed. Although the signature sequence approach is simple in both principle and practice, it is important to examine whether the inferences deduced using this approach are reliable. There are three potential situations which if not recognized could lead to incorrect inferences by this approach. Each of these situations and their possible effects are discussed below.
The first potential problem is the possibility of lateral gene transfer (LGT) among species. If certain genes present in a given species are acquired by means of LGT from another species, rather than through common ancestry, then the presence of these shared gene sequences or derived characteristics (e.g., signature sequences) will be misleading. LGT is indicated to occur very frequently among species [52–67], particularly where strong selection pressure may exist for the transfer and retention of certain genes (e.g., antibiotic resistance) [68–73]. An important issue, therefore, is to develop criteria by which the reliability of a given signature could be assessed and to determine to what extent a given signature has been affected/corrupted by the LGT problem.
To infer that the presence of a particular gene in a given species is due to LGT, it is necessary at first to assume a model for the evolutionary relationships among the organisms, which provides a kind of normal or standard pattern against which the other results obtained could be compared. The phylogenies based on 16S rRNA provide the currently accepted model for the evolutionary relationships among the prokaryotes [3,9,10,19,24,74–76]. These studies have led to the recognition of a number of main phyla or groups among eubacteria. These groups include: Thermotogales, Deinococcus-Thermus group, green nonsulfur bacteria, cyanobacteria, low-G+C and high-G+C Gram-positive bacteria, Cytophaga/Bacteroides group, Spirochaetes, Chlamydia/Planctomyces group and the five main divisions of Proteobacteria (α, β, γ, δ and ?) [3,9,10,24,76]. Although the 16S rRNA phylogenies indicate these groups to be distinct, they do not provide reliable information regarding the branching orders of these groups from the common ancestor [3,9,10,24]. Based on the 16S rRNA model, one expects that different species belonging to these groups/phyla should form distinct clades in phylogenetic trees based on other gene/protein sequences or be distinguished from each other by specific protein signatures. In cases where a given phylogenetic tree or signature shows the same kind of relationship as observed or predicted by the 16S model, that pattern or signature should be considered reliable. Examined in this light, a large number of protein signatures described in our earlier work , which are specific for the particular groups of eubacteria (e.g., low-G+C and high-G+C Gram-positive bacteria, Deinococcus-Thermus group, cyanobacteria, Cytophaga/Bacteroides group, Spirochaetes, Chlamydia/Planctomyces group, etc.) and distinguish them from the other eubacterial groups, are reliable. However, it is not uncommon for one or more species to show aberrant behavior in phylogenetic trees or with regard to a given signature sequence (e.g., an archaebacterium or Gram-positive species branching within a clade consisting of different proteobacteria). In such cases, the aberrant branching of the particular species could be a consequence of LGT or other anomalies and is generally so recognized, but these instances do not invalidate the existence of the clade or the signature. However, in cases where there is a major disagreement between the 16S rRNA model and the relationship shown by certain signatures (and gene phylogenies) as well as other important cellular characteristics, other possibilities or models to explain these results need to be considered [30,31,77–80].
Another potential problem in using conserved indels to deduce evolutionary relationships is that the indels in different species could arise independently rather than being introduced only once in a common ancestor. It could be argued that the indels are introduced in proteins mainly at sites which are interstructural junctions with high tolerance for changes . In this scenario, the indels in a given protein at a particular position will be introduced by chance in different species and the species containing or lacking the indels will show no specific evolutionary relationship to each other. However, the signature sequences in different proteins that we have identified argue strongly against such a possibility for several reasons. First, most of the signatures used and described in our work are present in highly conserved regions, which because of their high degree of sequence conservation are sites of high structural constraints. Second, a signature sequence, as defined in our work [30,31], is present in only specific and well-defined groups of prokaryotes and is not randomly present in different organisms. Third, in nearly all cases, the indicated signatures (i.e., inserts or deletions) which are of defined length and conserved sequence are found in all species from particular groups of prokaryotes but generally in none of the species from other groups. These observations are inconsistent with the independent origin of the signatures (indels) in different species and instead strongly suggest that they are due to common ancestry.
Two examples further illustrate and emphasize these points. In the highly conserved Hsp70 protein found in all eubacteria, a large insert of 21–23 amino acids (aa) is present in all Gram-negative bacteria but is not found in any of the homologs from Gram-positive bacteria [30,31,77]. All true Gram-negative bacteria are known to contain an outer membrane as a distinctive structural characteristic not found in any Gram-positive bacteria [30,31,77,78]. The formation of the outer membrane, which created the periplasmic compartment, due to its very complex nature  likely occurred only once during the course of evolution. Since the presence or absence of this large conserved insert in the Hsp70 protein shows a perfect correlation with this important structural characteristic of the prokaryotes , it gives confidence that this molecular signature is biologically meaningful and that the inferences based on it are reliable [30,31,77,78,80].
Another example that is useful and instructive is the Hsp60 or GroEL protein. Similar to the Hsp70 protein, Hsp60 homologs are found in all eubacteria without any exception and they have been cloned and sequenced from more than 130 prokaryotic organisms representing various divisions of eubacteria as well as a large number of eukaryotic organisms [83,84]. In this protein, a 1-aa insert in a highly conserved region is found in all Gram-negative bacteria and eukaryotic homologs but this insert is not present in the Deinococcus-Thermus group of species or in any of the species belonging to the Gram-positive groups of bacteria. A partial alignment of Hsp60 sequences containing the signature region for 132 eubacterial homologs is shown in Fig. 1. In addition to the boxed indel, this protein contains a number of specific amino acid substitutions which are distinctive of different subdivisions of eubacteria. The simplest explanation for the boxed insert, as suggested in our earlier work , is that it was introduced in a common ancestor of various other eubacteria after the branching of Gram-positive bacteria and the Deinococcus-Thermus group. What is the probability that this signature sequence occurred by chance? If this insert originated independently in different species, one would have to postulate that this random insertion event took place in various Gram-negative bacteria (>80) but did not occur at any time in any of the Gram-positive bacteria (>50 sequences). Since the region surrounding this indel is highly conserved in all species, the structural constraints on the protein should be comparable in all cases. If the presence or absence of the insert in Hsp60 could be compared to the two outcomes of a tossed coin, then the probability that these results could be obtained by chance would be comparable to observing all tails in the first 50 tosses (i.e., minus insert) and all heads (i.e., plus insert) in the next 80 tosses. The probability of such an event is 2−50×2−80, which is infinitely small, indicating that it is highly unlikely. A similar argument could be made for most other signatures described in our work.
Another potential problem that can confound the interpretation of signature sequence data could result from unidentified gene duplication events. It is possible that in some cases, two types of homologs may be present in various species (resulting from an ancient gene duplication event) and of these only one may contain a given sequence signature. In this case, if the sequences are known for one kind of homolog from some species, and the second type of homolog from a different group of species, then the failure to recognize the two types of homologs will lead to incorrect inferences. However, for all proteins where signatures have been identified in this work, only single homologs are found in most eubacteria. Only in a small number of cases (viz. Hsp60 and Hsp70, DNA gyrase B) are two or more homologs sometimes present within isolated species or in closely related organisms [30,85–87]. However, these homologs are the results of recent gene duplications and they present minimal problems in the interpretation of the signature sequence data .
The criteria discussed here for identifying LGT and other potential problems will be helpful in assessing the significance of different protein signatures that will be described in this work. It should be mentioned that the indels in proteins generally tend to be short with the most common indel length being one amino acid . The length of the insert in a protein is governed by the structural constraints imposed by its function, with the smaller insert more readily accommodated by the protein structure and function than the larger ones . However, the preponderance of the smaller indels in proteins does not necessarily mean that they are evolutionarily less significant. As seen above for the Hsp60 protein, a 1-aa insert in a protein in a highly conserved region could be as meaningful and important as a longer insert in a different protein. In terms of evolution, both larger and shorter inserts involve a single genetic event requiring either addition or deletion of nucleotides in multiples of three, which can occur with equal probability for smaller or larger indels.
The signatures described below were identified by aligning orthologs of different proteins in the NCBI database as described in earlier work [30,31]. The alignments were inspected manually to identify well-defined indels in conserved regions which are of potential use for evolutionary studies. The indels which were not flanked by conserved sequences were adjudged unreliable and generally not further investigated. Most of the analyses reported here were completed by March 1999.
3Evolutionary relationships among proteobacteria
3.1Signature sequences indicating a close relationship of the Chlamydia-Cytophaga group to the proteobacteria
In earlier work, we presented evidence that the evolutionary relationship within eubacteria is a continuum with different eubacterial groups evolving from a common ancestor in the following order: low-G+C Gram-positive→high-G+C Gram-positive→Deinococcus-Thermus→cyanobacteria→Spirochetes-Chlamydia and relatives→Proteobacteria-1 (α, δ and ?)→Proteobacteria-2 (β and γ) . The branching orders of different groups were determined by means of signatures in different proteins which were introduced either prior to or following the branching of these groups. In addition, for a number of groups (viz. low-G+C Gram-positive, high-G+C Gram-positive and cyanobacteria) unique group-specific signatures were identified [30,89]. In these studies, a specific relationship of the proteobacterial group to the Spirochetes-Chlamydia divisions was indicated by the presence of a 1-aa insert in a conserved region in the FtsZ protein [30,32]. The homologs of the FtsZ protein play an essential role in the cell division and cell septation processes in prokaryotes and they also show limited sequence and functional similarity to the eukaryotic tubulin [90–92]. The FtsZ homologs have been found in all completed bacterial genomes [86,93–107], except for the mycoplasma species [108,109]. The identified insert in the FtsZ protein was present in all organisms belonging to the Spirochetes-Chlamydia-Cytophaga and Aquifex groups of species as well as in different subdivisions of proteobacteria, but was not found in any homologs from the other groups of prokaryotes (e.g., cyanobacteria, Deinococcus-Thermus group, low- and high-G+C Gram-positives, archaebacteria) [30,32]. This signature provided evidence that in comparison to the other prokaryotic groups, proteobacteria are more closely related to the Spirochetes-Chlamydia-Cytophaga groups of organisms. The signature sequences described below now further clarify this relationship.
The relative branching order of the species within the Spirochete-Chlamydia-Cytophaga groups is not clear from earlier phylogenetic studies [3,10,110] and it was also not resolved in our earlier work . However, a signature contained in alanyl-tRNA synthetase now provides evidence that this group can be divided into two groups. Alanyl-tRNA synthetase, which plays an essential role in protein synthesis by charging the alanyl-tRNA with its cognate amino acid [111–113], has been found in all completed genomes [86,93–106,108,109,114]. This protein contains a 4-aa insert (Fig. 2) that is uniquely shared by all members belonging to different divisions of proteobacteria as well as by various organisms belonging to the Chlamydia, Bacteroides-Cytophaga and green sulfur bacteria (Chlorobium tepidum) groups. Interestingly, this insert is also present in Aquifex aeolicus, indicating that this species may also be related to the above groups of prokaryotes. This insert, however, is not present in the spirochete species Borrelia burgdorferi and Treponema pallidum, or in species belonging to various other divisions of Gram-negative bacteria, Gram-positive bacteria and archaebacteria. This signature suggests that the spirochete division of prokaryotes branched off prior to those belonging to the Chlamydia-Bacteroides-Cytophaga and green sulfur bacteria groups. The observed insert in alanyl-tRNA synthetase was likely introduced in a common ancestor of the Chlamydia-Cytophaga-green sulfur bacteria and proteobacteria after the branching of various other groups of prokaryotes including spirochetes (Fig. 2, top diagram).
A close relationship of the species belonging to the Chlamydia, green sulfur bacteria and Aquifex groups to the proteobacterial division is also suggested by a signature identified in the citric acid cycle enzyme succinyl-CoA synthetase (β subunit) (Fig. 3). Succinyl-CoA synthetase carries out cleavage of the thioester bond in succinyl-CoA in a coupled reaction to generate succinate, at the same time producing GTP from GDP . It is the only step in the citric acid cycle that directly leads to the formation of a high-energy phosphate bond. This protein contains a conserved insert of 7 aa that is shared by all homologs from different groups of proteobacteria as well as by the Chlamydia-green sulfur bacteria and Aquifex groups of species. The shared presence of this insert in Aquifex aeolicus, Chlamydia and green sulfur bacteria is again suggestive of a close relationship between these groups of organisms. However, this insert is not present in any homolog belonging to the cyanobacteria, Deinococcus-Thermus group, high-G+C Gram-positive bacteria or archaebacterial species. A number of genomes of intracellular pathogens, viz. Tre. pallidum, B. burgdorferi, Helicobacter pylori, Mycoplasma genitalium and M. pneumoniae, contained no homolog for this protein [95,99,105,108,109]. Besides proteobacteria and the Chlamydia-Cytophaga group, this insert is also present in the two low-G+C Gram-positive species, Bacillus subtilis and Staphylococcus aureus. Based on signature sequences in a number of different proteins, the low-G+C Gram-positive bacteria are indicated to be very distinct and ancestral to the above groups of prokaryotes [30,31]. Therefore, the shared presence of this insert in the two low-G+C Gram-positive species is likely due to LGTs from the above group of organisms containing the insert. The alternative possibility that the ancestral enzyme contained the insert and that it was subsequently lost independently in various other groups of eubacteria (viz. high-G+C Gram-positive, Deinococcus-Thermus, cyanobacteria) is considered less likely. The boxed insert in succinyl-CoA synthetase is also present in all eukaryotic (mitochondrial) homologs, which supports the view that mitochondria originated from a species belonging to a group (α-proteobacteria) that contained the insert [14,15,17,84,116–120].
3.2Protein signatures defining the proteobacterial group
A number of proteins contain signatures that appear specific for the groups of organisms generally classified as proteobacteria. The first of these signatures is a 2-aa insert in the highly conserved Hsp70 or DnaK family of proteins [121,122]. The members of the Hsp70/DnaK family are present in all eubacteria and thus far no exception has been found [86,93–109]. The 2-aa signature in Hsp70 (boxed, Fig. 4) is present in all sequenced homologs from organisms comprising the α, β, γ, δ and ? subdivisions of proteobacteria, but it is not found in any homologs from other groups of prokaryotes, including the Chlamydia-Cytophaga group, Aquifex, Spirochetes, cyanobacteria, Deinococcus-Thermus group, different groups of Gram-positive bacteria and various archaebacteria. We have also carried out BLAST searches on a fragment of the Hsp70 protein containing this signature sequence on the NCBI Unfinished Microbial Genome Sequences database . The sequences retrieved from these searches are denoted by asterisks in the various alignments. In all cases, the signature described here correctly distinguished the various proteobacterial species from other groups of prokaryotes. In cyanobacterial species, which contain multiple Hsp70 homologs , this insert was not present in any of the homologs. In Escherichia coli, in addition to the normal Hsp70 protein, a second homolog Hsc66 is found which is distantly related to Hsp70 and carries out an unrelated function [124,125]. This homolog lacks the insert, but based on its limited sequence similarity and unrelated function, it is readily recognized as a paralog  and the absence of the insert does not confuse or affect the interpretation. As noted in earlier work [30,87], in addition to proteobacteria this insert is also present in the Hsp70 homolog from Thermomicrobium roseum, a species which based on 16S rRNA phylogeny is placed in the same group as the green nonsulfur bacterium Chloroflexus aurantiacus[3,10]. However, signature sequences in Hsp60 and Hsp70 proteins indicate that in contrast to the close relationship observed of The. roseum for the proteobacterial group, the species Cf. aurantiacus branches very deeply and seems related to the Deinococcus-Thermus groups of species [30,32]. Further studies on The. roseum should be helpful in clarifying its phylogenetic position. The indicated signature in Hsp70 was likely introduced in a common ancestor of proteobacteria as indicated in Fig. 4 (top diagram). This insert in the Hsp70 protein is also present in all eukaryotic mitochondrial as well as cytosolic homologs [30,127,128], providing evidence that both these groups of homologs are derived from proteobacterial species.
The enzyme CTP synthetase catalyzes conversion of UTP into CTP by transferring an amino group to the 4-oxo group of the uracil ring . A prominent signature, distinctive of the proteobacterial group, has been identified in this protein (Fig. 5). Except for the mycoplasma species, the gene encoding CTP synthetase is present in all other completed microbial genomes. The signature consists of a 10-aa insert (boxed), which is present in all proteobacterial species including different organisms from α, β, γ, δ and ? subdivisions for which the sequence information is currently available, including sequences in the NCBI Unfinished Microbial Genomes database . Other prokaryotes including species belonging to the Chlamydia-green sulfur bacteria and Aquifex, spirochetes, cyanobacteria, Deinococcus-Thermus group, low- and high-G+C Gram-positive bacteria and various archaebacteria, did not contain the indicated insert. Interestingly, all of the high-G+C Gram-positive species contained a smaller 4-aa insert in the same position, instead of the 10-aa insert found in the proteobacteria. This insert in the high-G+C Gram-positive bacteria was likely introduced independently of the insert in the proteobacteria and it could serve as a distinguishing signature for this group. As in the case of Hsp70 protein (Fig. 4, top), the large insert in CTP synthetase was likely introduced in a common ancestor of the proteobacterial group and it provides a molecular marker to distinguish and define this group of organisms.
Inorganic pyrophosphatases are important in metabolism because many biosynthetic reactions which are thermodynamically unfavorable, and where pyrophosphate (PPi) is one of the products, are made energetically favorable and irreversible by the coupled hydrolysis of PPi [129,130]. A signature sequence consisting of an insert of 2 aa which is shared by the proteobacterial species is present in this protein (Fig. 6). As in the case of Hsp70 and CTP synthetase, the indicated insert in the inorganic pyrophosphatases is present in various proteobacterial species but not in other groups of eubacteria except Aquifex aeolicus. Interestingly, a similar insert of 2 aa in the same position is also present in the two Crenarchaeota archaebacteria Sulfolobus acidocaldarius and Aeropyrum pernix for which sequence information is presently available, but not in any other archaebacteria (i.e., Euryarchaeota). The insert in the Crenarchaeota species was likely introduced independently and it could possibly serve as a molecular marker to distinguish between these two groups of archaebacteria. The insert in Aquifex aeolicus likely originated by LGT from one of the above two groups containing the insert.
It may be noted that, as in the case of succinyl-CoA synthetase, inorganic pyrophosphatase is also not found in the genomes of the spirochete species Tre. pallidum and B. burgdorferi[99,105]. The gene for this protein is also absent in the genome of B. subtilis, which is surprising in view of the fact that the homologs for this protein are present in the related species B. stearothermophilus as well as in mycoplasma species (M. genitalium and M. pneumonia), which contain only a minimal complement of genes [108,109].
3.3Protein signatures defining a clade of α-, β- and γ-proteobacteria
A number of proteins contain signatures that provide evidence for the existence of a clade consisting of α-, β- and γ-proteobacteria, but exclusive of the δ and ? subdivisions. These proteins are the following.
Lon protease is an ATP-dependent protease, found in both eubacteria and eukaryotes, which is involved in the regulation and energy-dependent degradation of several short-lived proteins [131,132]. The homologs of Lon protease are present in all sequenced bacterial genomes with the exception of Synechocystis PCC 6803 and Mycobacterium tuberculosis[86,93–109]. In this protein, which shows a high degree of sequence conservation, a 1-aa deletion is found in various species representing the α, β and γ subdivisions of proteobacteria, but not in any of the other divisions of eubacteria including those from the δ and ? subdivisions of proteobacteria (Fig. 7). The indicated signature thus appears to be a unique characteristic of the α, β and γ subdivisions of proteobacteria. The simplest explanation for this observation is that the indicated deletion occurred in a common ancestor of the α, β and γ subdivisions, after the branching of the δ and ? subgroups (Fig. 7, top diagram).
In eukaryotic cells, Lon protease is synthesized with a N-terminal mitochondrial targeting presequence and it is localized within mitochondria . In view of this, it is surprising to find that all eukaryotic homologs of this protein do not contain the signature (1-aa deletion) which is common to various α, β and γ subdivision members for which the sequence information is presently known. Since the origin of mitochondria from a member of the α-proteobacterial subdivision is supported by several lines of evidence [14,16,116,117,119,120], it is conceivable that the indicated deletion may not be found in certain α-proteobacterial species (presently unknown) from which mitochondria probably originated.
3.3.2DNA gyrase A subunit
DNA topoisomerases are enzymes that play essential roles in DNA replication, transcription, recombination and repair [81,133]. DNA gyrase in bacteria is a type II topoisomerase which uses the energy of ATP hydrolysis to introduce negative supercoils into DNA. The enzyme is made up of two subunits designated A and B. The DNA gyrases are present in all sequenced eubacterial genomes [86,93–109]. A shared sequence signature common to only the members of the α, β and γ subdivisions of proteobacteria is present in the A subunit of DNA gyrase. The signature consists of an insert of 34 aa in a conserved region, which is present in most sequenced homologs from the α, β and γ subdivisions of proteobacteria, but is not found in any of the species corresponding to the δ and ? subdivisions or in homologs from the other divisions of eubacteria (Fig. 8). There are two species which show anomalous behavior with regard to this signature. First, this insert is absent in the reported sequence from Salmonella typhimurium. Although this observation could be explained by the selective loss of the insert in this species, the fact that this insert is present in all other β- and γ-proteobacteria including the closely related species S. typhi, it is likely that the absence of this insert in S. typhimurium is due to anomalous reasons. It would be helpful to independently clone and confirm the sequence from this species to exclude trivial possibilities. Second, this insert is also absent from the α-proteobacterium, Caulobacter crescentus, although it is present in all other studied α-proteobacteria. This observation may or may not be anomalous. Since the evolutionary relationship within eubacteria is indicated to be a continuum, it is possible that within the α subdivision, C. crescentus is an earlier branching species and the indicated insert in the DNA gyrase A subunit was introduced in a common ancestor of β, γ and other α subdivision members after the branching of C. crescentus.
In addition to the common signature for α-, β- and γ-proteobacteria, the boxed insert in DNA gyrase A contains another interesting feature that appears to be a unique characteristic of the α-proteobacterial homologs. Within the large insert present in this protein, a deletion of 4 aa (boxed) is seen in various organisms belonging to the α subdivision members (Fig. 8). One possible interpretation of these results is that after the introduction of the large insert in a common ancestor of α, β and γ subdivisions, a further deletion occurred in the branch leading to the α subdivision members. Alternatively, the original insert in these groups was only 30 aa long and subsequently a 4-aa insert was introduced in a common ancestor of β- and γ-proteobacteria.
In bacteria, SecA homologs are involved in the export of proteins to the periplasmic compartment [136,137]. The SecA homologs have been found in all completed bacterial genomes, with the exception of Chlamydia trachomatis[86,93–106,108,109]. A conserved insert of 7 aa shared by various species belonging to the α, β and γ subdivisions of proteobacteria is present in the SecA protein (Fig. 9). The indicated insert in SecA protein is not present in H. pylori or Campylobacter jejuni, which are members of the ? subdivision, or in members from other eubacterial divisions such as cyanobacteria, Aquifex, Deinococcus and different groups of Gram-positive bacteria. However, a smaller unrelated insert of 5 aa is also present in this position in the two spirochetes species, B. burgdorferi and Tre. pallidum. Based upon the signature sequences described earlier, it is unlikely that the spirochetes and the α-, β- and γ-proteobacteria shared a common ancestor exclusive of the δ- and ?-proteobacteria and the Chlamydia-Cytophaga groups of species. The simplest explanation to account for these results is that the large insert in proteobacterial species was introduced in a common ancestor of the α, β and γ subdivisions, whereas the smaller insert in the spirochetes species originated independently.
The enzyme biotin synthetase, which is encoded by the bioB gene in E. coli, catalyzes the last step in the biosynthesis of biotin involving conversion of dethiobiotin to biotin [138,139]. A 2-aa insert in this protein has been identified which is commonly shared by homologs from the α, β and γ subdivisions of proteobacteria, but is not present in other divisions of eubacteria (Fig. 10). The only exception observed is that of Chl. pneumoniae, which could have acquired this insert either independently or by means of LGT. Although biotin synthetase is widely distributed among proteobacterial species, in the genomes of a number of bacterial species, particularly those which are intracellular pathogens (viz. Rickettsia prowazekii, Chl. trachomatis, B. burgdorferi, M. genitalium and M. pneumoniae), the gene for this protein was not identified. The biotin requirement in these species is likely met either by dietary sources or by production of the vitamin by the intestinal resident bacteria. The gene for biotin synthetase has been cloned and sequenced from yeast as well as higher plants  and the boxed insert (which is a characteristic of the α, β and γ subdivisions) is present in these homologs, providing evidence that they are derived from a species belonging to this clade.
3.3.5DNA gyrase B subunit
DNA gyrase protein described above contains another useful signature sequence within its B subunit. In this case, a 1-aa insert in a highly conserved region is present in all of the known homologs from β-and γ-proteobacteria and some α subdivision members (Fig. 11). The α subdivision members where this insert has thus far been identified are C. crescentus and Ri. prowazekii. Interestingly, both these species also contain another DNA gyrase B homolog which lacks the insert. Based on these observations, the explanation we favor is that there was a gene duplication event for DNA gyrase B in a specific lineage of α-proteobacteria, where the observed insert was first introduced. Subsequently, while these α-proteobacterial species retained both the genes, the gene lacking the insert was lost in the common ancestor of the β-and γ-proteobacteria which evolved from α-proteobacteria (see Section 3.4). The alternative possibility that the insert was introduced in a common ancestor of β- and γ-proteobacteria and that subsequently these homologs were acquired by some α-proteobacterial species cannot be excluded.
3.4Protein signatures defining a clade of β- and γ-proteobacteria
A close relationship among the β- and γ- proteobacteria has been noted by Woese  based on the oligonucleotide catalogs and the presence of specific nucleotides in 16S rRNA sequences that appear distinctive of these subdivisions. The signature sequences in a large number of proteins described below strongly support this inference and provide evidence for the existence of a clade consisting of the β and γ subdivisions of proteobacteria.
The members of the Hsp70/DnaK family of proteins, described earlier, contain another signature consisting of a 4-aa insert in a highly conserved region which is uniquely shared by members of the β and γ subdivisions of proteobacteria . Since our original description of this signature , sequence information for this protein has become available from a large number of additional eubacterial species and an update of this signature is presented in Fig. 12. Due to space constraints, only limited information for the Gram-positive bacteria is shown and no information for the archaebacterial and eukaryotic homologs is presented. As seen, the indicated signature in Hsp70 is present in all known homologs representing numerous genera of the β and γ subdivisions of proteobacteria. In contrast to the β- and γ-proteobacteria, this insert is not present in any of the homologs from different divisions of eubacteria, or the various homologs from archaebacteria and eukaryotes  (and unpublished data). This insert was likely introduced in a common ancestor of the β- and γ-proteobacteria, after the branching of the δ, ? and α subdivisions. In view of the highly conserved and well-defined nature of this signature, this could be used to define a clade consisting of the β and γ subdivisions of proteobacteria. This signature will be referred to as the Hsp70 (β,γ) insert in our work. This insert in Hsp70 protein is not present in any of the eukaryotic homologs (both mitochondrial and nuclear cytosolic) providing evidence against their origin from the β and γ subdivisions of proteobacteria .
Hashimoto et al.  have previously described a 37-aa insert in this protein, as being commonly shared by the γ-proteobacteria and eukaryotic homologs. However, sequence information available now from many additional species indicates that this insert is a characteristic of both the β and γ subdivisions of proteobacteria as well as of the eukaryotic homologs (Fig. 13). Presently, no sequence information is available for this protein from any α-proteobacterial species. In Ri. prowazekii, the only α-proteobacterial species whose complete genome has been sequenced, the gene for valyl-tRNA synthetase was not found.
In eukaryotic cells, a single gene is known to encode both mitochondrial and cytosolic valyl-tRNA synthetases [142,143]. Since the origin of mitochondria from an α-proteobacterium is supported by a large body of evidence [14,16,17,116,117,119,120,144,145], it is likely that this insert will be eventually found in some species belonging to the α-proteobacterial subdivision.
The enzyme PRPS catalyzes the transfer of a pyrophosphate group from ATP to ribose 5-phosphate leading to the formation of phosphoribosyl pyrophosphate, required for the biosynthesis of purine, pyrimidine and pyridine nucleotides as well as the amino acids histidine and tryptophan . A signature consisting of a 1-aa insert in a conserved region has been identified in this protein, which is specific for the β- and γ-proteobacteria (Fig. 14). The indicated insert is not present in species representing the α and ? subdivisions and presently no sequence is known from the δ subgroup species. The gene for PRPS has been sequenced from a broad range of prokaryotes and, with the exception of Chlamydia and Rickettsia[102,103], it is present in all of the completed eubacterial genomes. The identified signature is highly specific for the β- and γ-proteobacteria and it is not found in any other prokaryotic homologs. This insert, like that of Hsp70 (Fig. 12) and valyl-tRNA synthetase (Fig. 13), was likely introduced in a common ancestor of the β- and γ-proteobacteria.
3.4.4Ribosomal L24 protein
The ribosomal L24 protein contains a 1-aa deletion that is present in all of the homologs from the β and γ subdivisions of proteobacteria as well as in the two Chlamydia species (Fig. 15). This deletion, however, is not found in any other group of prokaryotes including species from the α and ? subdivisions of proteobacteria. The simplest explanation for this observation is that the deletion was introduced in a common ancestor of the β and γ subdivisions of proteobacteria after the branching of other subdivisions. The deletion in Chlamydia species could be explained by either an independent deletion event or due to LGT from the β- and γ-proteobacteria.
3.4.5Ribosomal protein L18
Similar to the L24 protein, ribosomal L18 protein also contains a 1-aa deletion mainly restricted to the β and γ subdivisions (Fig. 16). The only exception in this case is that of Thermus aquaticus species, which could have acquired this deletion independently or by LGT. The deletion is not present in other groups of Gram-negative and Gram-positive bacteria. It should be noted that the gene for L18 protein was not found in a number of completed microbial genomes including Hel. pylori, Chl. trachomatis, Tre. pallidum, M. genitalium and M. pneumoniae[95,102,105,108,109].
UDP-glucose epimerase which is encoded by the galactose operon is involved in the conversion of galactose 1-phosphate into glucose 1-phosphate . This reaction is essential for the entry and metabolism of galactose via the glycolytic pathway. A 1-aa insert is present in this protein in all of the species belonging to the β and γ subdivisions of proteobacteria except Burk. pseudomallei (Fig. 17). The insert is not present in the homologs from the α and ? subgroup members, in other divisions of Gram-negative and Gram-positive bacteria or in archaebacterial homologs.
Cysteine synthase catalyzes the formation of l-cysteine from O-acetyl-l-serine and hydrogen sulfide, which is the terminal step in cysteine biosynthesis [148,149]. The gene for cysteine synthase has been cloned from diverse groups of bacterial and plant species. A signature consisting of a 1-aa insert has been identified in this protein, which is commonly shared by all β and γ subdivision members, but is generally not present in other groups of prokaryotes or plant homologs (Fig. 18). Besides the β- and γ-proteobacteria, the insert is also present in Aquifex aeolicus and a cyanobacterial species, which could have acquired this insert either independently or by means of LGT. The insert in cysteine synthase is absent in all eukaryotic homologs providing evidence that they are not derived from the β- and γ-proteobacterial species.
3.5Protein signatures specific for the γ-proteobacteria
This enzyme involved in purine nucleotide biosynthesis , contains a 2-aa deletion (Fig. 19), which is shared by all species belonging to the γ subdivision of proteobacteria, but is not present in other proteobacterial subdivisions or other groups of prokaryotes.
3.5.2Ribosomal L16 protein
A 1-aa deletion common to γ-proteobacterial species is present in the L16 protein (Fig. 20). The bacterial endosymbiont of pea aphid (Acyrthosiphon kondoi) also contained this deletion indicating that it originated from this group of proteobacteria. The signature sequences in both PAC transformylase and ribosomal protein L16 were likely introduced in a common ancestor of the γ-proteobacteria.
The eukaryotic homologs of L16 lack the indicated deletion providing evidence that they are not derived from γ-proteobacteria. One exception is the protist species Reclinomonas americana, which has the largest mitochondrial genome [119,151]. However, phylogenetic analysis based on L16 sequences indicates that Rec. americana shows the expected closer relationship to the α-proteobacterial group in comparison to the γ subdivision members (unpublished results) indicating that the deletion in the Rec. americana homolog likely originated independently.
3.6Signature sequence specific for the ?-proteobacteria (RecA protein)
Presently, very limited sequence information is available for the species corresponding to the δ and ? subdivisions of proteobacteria. Due to this, no signature which clearly shows the relative branching orders of these two subdivisions with respect to the other groups of proteobacteria has been identified. However, in the RecA protein, involved in DNA repair and recombination processes , we have identified a 1-aa insert that appears specific for the ? subdivision members (Fig. 21). The indicated insert is present in Hel. pylori, Camp. jejuni, as well as Camp. fetus, but it is not found in Myxococcus xanthus (δ) or other proteobacterial subdivisions, or in any other divisions of eubacteria. The observed results are best explained by postulating that the indicated insert was introduced in the branch leading to ?-proteobacteria. Although, in comparison to the ? subdivision members, the species corresponding to the δ subdivision show a closer relationship to the α, β and γ subdivisions in some phylogenies (Section 4), in view of the very limited sequence information that is available they are presently kept in a single group.
4Branching patterns of proteobacteria in phylogenetic trees
As indicated above, the phylum ‘purple bacteria and relatives’, now designated proteobacteria, was originally described based on 16S rRNA oligonucleotide catalogs and phylogenies and these still remain the sole basis for defining it [1–3,9,10,12,19–21,23,24,76]. In view of the central role of the 16S rRNA phylogenies in defining this group as well as other divisions within prokaryotes [3,9,10,19,23], it is important to examine the branching order and evolutionary relationships among proteobacterial subdivisions within the rRNA trees. A comprehensive phylogenetic tree for 16S rRNA based on 253 prokaryotic species covering all major groups has been presented by Olsen et al. . The tree was rooted between archaebacteria and eubacteria based on duplicated elongation factors (EF-1α/Tu and EF-2) and ATPase gene sequences [153–155]. In this rooted tree, the proteobacterial species form a monophyletic group and their closest relatives are indicated to be species corresponding to the Chlamydia-Planctomyces group . The spirochetes and cytophaga groups of species are the closest relatives of the latter group. Within proteobacteria, deepest branching is observed for a clade consisting of the δ and ? subdivision members, followed by members of the α subdivision of proteobacteria. The γ subdivision species show polyphyletic branching, with members of the β subdivision branching between them. A similar branching order for the different proteobacterial subdivisions, and with Chl. trachomatis as their closest relative, is seen in a 16S rRNA phylogenetic tree reported by Eisen . In this latter work, the bootstraps scores for different nodes are also provided, which are helpful in terms of understanding the reliability of the branching orders and the robustness of different groups. In the tree reported by Eisen , a clade consisting of δ, α, β and γ subdivision members is observed with a bootstrap value of 84%, which is significant. The branch leading to the ? subdivision species and Chl. trachomatis is seen immediately prior to the above clade, but the bootstrap scores for these nodes are not shown because they were not significant. The clades consisting of α, β and γ subdivisions, and β- and γ-proteobacteria, are also clearly resolved with bootstrap scores of 96 and 95%, respectively. Further, similar to the phylogenetic tree reported by Olsen et al. , the β and γ subdivision members are not clearly distinguished from each other and the former group branches in between members of the latter group. Thus, the branching pattern of the proteobacterial subgroups based on 16S rRNA trees (i.e., Chlamydia→?, δ subdivisions→α subdivision→β and γ subgroups) is strikingly similar to that deduced based on signature sequences in different proteins.
The evolutionary relationships among proteobacterial subdivisions have also been examined based on a number of proteins. Several investigators have reported detailed phylogenetic analyses based on RecA protein sequences employing different methods [110,156–158]. In the phylogenetic trees based on RecA sequences, the branching patterns of different proteobacterial species are in general very similar to that seen in the 16S rRNA trees described above, with only minor differences in the branching positions of certain species (e.g., Camp. jejuni, Hel. pylori, Myxo. xanthus, etc.) [110,156,157]. Karlin et al.  have analyzed RecA sequences based on pairwise comparisons of significant segment alignment scores. Similar to the RecA phylogenetic trees, their analysis indicated that the α-, β- and γ-proteobacteria formed a coherent group, with β and γ subdivisions consistently more closely related to each other than to the α subgroup sequences . In phylogenetic trees based on the highly conserved Hsp60 or GroEL protein, the proteobacterial species again formed a monophyletic group with the Chlamydia group of species as their closest relatives, and with different subdivisions branching in the order: Chlamydia→δ, ? group→α, β and γ groups [83,84,116,144]. Although the α, β and γ subdivisions are well resolved in the Hsp60 trees, the relative branching order of these groups is unclear [10,83,84,116,144]. The evolutionary relationships among eubacteria have also been examined based on sigma factors σ70 sequences . In phylogenetic trees based on σ70 sequences, the Chlamydia species again are the closest relatives of proteobacteria and the different subdivisions of proteobacteria branched in the following order: Chlamydia→?, δ groups→α→β and γ groups.
Recently, Klenk et al.  have reported phylogenetic analysis based on RNA polymerase β and β′ subunits. Although the representation of proteobacterial species in this study was limited, the organisms corresponding to various proteobacterial subdivisions branched in the following order: ?→α→β and γ. In both cases, the species Aquifex aeolicus was indicated to be the closest relative of the proteobacterial group . A close relationship of Aquifex aeolicus to Hel. pylori and Chlamydia species is also seen in phylogenetic trees based on group 1 sigma factor sequences . These results are of interest because based on signature sequences in various proteins described here and in our earlier work, the Aquifex group of species appear closely related to the Chlamydia-Cytophaga-green sulfur bacteria group of organisms . These observations are at a variance with the deep branching of Aquifex observed in the 16S rRNA and EF-Tu trees [10,162]. The basis for this discrepancy is not clear at present. However, it should be noted that Aquifex is reported to have received a large number of genes from archaebacteria by means of LGT . Further, contrary to the general belief that the genes for rRNA and information transfer processes are immune to horizontal transfer because of their interaction with a large number of other components [52,65,75], Asai et al.  and Yap et al.  have recently provided evidence that lateral transfer of rRNA genes between distantly related species can occur and is not precluded. Thus, it is possible that some of the genes showing deeper phylogenetic branching of Aquifex may be of archaebacterial origin.
Detailed phylogenetic studies have been carried out with the Hsp70/DnaK family of sequences in which a number of signature sequences that are useful in defining proteobacteria and some of their subdivisions are present [30,87,121,127]. Fig. 22 shows a neighbor-joining phylogenetic tree based on Hsp70 sequences. The tree is based on 362 aligned amino acid positions for which sequence information was available from all of the species. The tree reveals a monophyletic grouping of α, β, γ and δ subdivision species with good bootstrap score (85%). The Chlamydia-Flavobacteria-Spirochetes groups of species are the closest relatives of this clade. The α subdivision members, which branch after Myxo. xanthus (δ group), also form a well-resolved monophyletic clade. The β and γ groups of species form a monophyletic clade 100% of the time, however, the relationship between these two groups was not clearly resolved. The various species corresponding to the γ subdivision, except for Francisella tularensis, formed a monophyletic clade with a 100% bootstrap score. F. tularensis, on the other hand, branched earlier than the β-proteobacteria, making the γ subgroup polyphyletic. In contrast to other proteobacteria, the two species, Hel. pylori and Camp. jejuni, belonging to the ? subdivision branched very deeply in the tree between Thermotoga maritima and the low-G+C Gram-positive group. The observed branching position of the ? group of species in the Hsp70 tree is clearly anomalous in view of the phylogenies and signature sequences in a large number of different genes/proteins [10,30,84,110,116,144,159,160]. This branching pattern is also inconsistent with the signature sequence in the Hsp70 protein (Fig. 3), which strongly suggest that the proteobacterial species form a coherent group. In earlier phylogenetic studies, where Hel. pylori and Camp. jejuni sequences were not included [30,87,122,127,165], a clear distinction between Gram-positive (monoderm) and Gram-negative (diderm prokaryotes) bacteria was strongly supported by different phylogenetic methods. The anomalous branching of Hel. pylori and Camp. jejuni in the Hsp70 tree is very likely a consequence of the long branch length phenomenon, where the fast-evolving lineages tend to branch deeply and artificially in the phylogenetic trees [34,166].
Phylogenetic analysis was also carried out based on alanyl-tRNA synthetase sequences, which show a specific relationship of the Chlamydia-Cytophaga-Aquifex group to the proteobacteria. In a phylogenetic tree based on Ala-tRNA synthetase sequences, the α- and γ-proteobacterial subdivisions are well resolved with a β group of species (Thiobacillus ferroxidans) lying in between them (Fig. 23). The species Chl. trachomatis shows a closer relationship to the α, β, γ clade in comparison to Hel. pylori, which branches in between Chl. trachomatis and Aquifex aeolicus. Although, due to the long branch lengths and low bootstrap scores for many nodes, the branching orders in this tree are not reliable [34,166,167], a close relationship of the Aquifex to the Chlamydia species and a strong affinity of the Chlamydia-Aquifex groups of species to the proteobacteria, as deduced from the signature sequence, is clearly evident (Fig. 23). In addition to the proteins discussed here, Brown and Doolittle  have reported phylogenetic trees for a large number of other proteins. However, in most of their phylogenetic trees representation of the proteobacterial group was limited to only a few species, due to which no clear inferences concerning their branching positions or evolutionary relationships could be drawn.
5Evolutionary relationships among proteobacteria: summary
Based on the signature sequences in different proteins described above, a model for the relationship of the proteobacterial group to other eubacterial phyla as well as the branching order of different subdivisions within the proteobacteria can now be deduced. These studies indicate that the proteobacterial group shared a common ancestor with the Chlamydia-Cytophaga-green sulfur bacteria-Aquifex groups and that the different subdivisions within proteobacteria evolved from this common ancestor in the following order: Chlamydia-Cytophaga group→?, δ group→α→β→γ (Fig. 24). Although, in comparison to the ? subdivision, species belonging to the δ subdivision (e.g., Myxococcus xanthus) are more closely related to the α-, β- and γ-proteobacteria, in view of the paucity of sequence data on these two subdivisions, they are presently kept in the same group. The branching orders of each of the above groups are clearly marked by signature sequences in a number of different proteins, which serve to define as well as circumscribe the resulting groups of species. The Chlamydia-Cytophaga group in the present work includes a number of species which based on rRNA phylogenies are placed in several divisions of eubacteria (viz. Chlorobiaceae, Bacteroides and Cytophaga group, Chlamydia and Aquificaceae). Most of these divisions are small, comprising only one or a few genera [4,5,168–171], and the evolutionary relationships among them are not resolved in any of the phylogenies. Although, based on signature sequences, the species belonging to these divisions have been shown to branch between Spirochetes and Proteobacteria, the exact relationships among the species comprising the Chlamydia-Cytophaga group are presently unclear. It is likely that as additional sequence information becomes available, new signatures will be identified that may clarify the relationships among this group of species.
It would be useful to consider at this stage whether the inferences derived from signature sequences are reliable and whether the LGT among species poses a serious problem in the interpretation of the data. From the data presented here, it is clear that the various identified signatures in proteins are present in only species belonging to certain specific taxa but not in other groups of prokaryotes. For example, the indicated inserts in the Hsp70 protein (Figs. 4 and 12) and CTP synthetase (Fig. 5) are unique for either the entire proteobacterial group or certain subdivisions within it, but are not found in homologs from any other divisions of eubacteria. Similarly, for a large number of other signatures, the indicated indels are present in only certain well-defined taxa and are not distributed randomly among different groups of prokaryotes. Of the large number of signatures described here, only in few cases has there been an indication of LGT among species (Figs. 2, 5, 9, 14–16). In most such cases, the postulated LGT has taken place from members of certain specific taxa to isolated species and was not random among different groups of prokaryotes. The cases where the LGTs (or independent deletion/insertion events) have probably occurred are readily identifiable because of the unusual or anomalous behavior exhibited by certain species, which is inconsistent with other signatures and phylogenies. These isolated cases of LGT (or independent deletion/insertion events) thus do not confuse or invalidate the consistent picture that is emerging from consideration of signature sequences in a large number of different proteins. Further, the relative branching orders of different proteobacterial subdivisions as well many other eubacterial taxa as deduced from the signature sequences are in good agreement with their branching in the phylogenetic trees based on 16S rRNA and several highly conserved proteins [30,84,87,144]. Some differences from 16S rRNA in the branching orders of higher eubacterial taxa are in areas which are not resolved in the 16S rRNA trees [3,9,10,24]. Hence, they should not be considered as disagreements. These results provide evidence that the phylogenetic inferences deduced using the signature sequence approach are reliable and that the problem of lateral gene transfer is not so serious as to preclude determination of the evolutionary relationships among the prokaryotes.
6Proteobacteria: relationships to the eukaryotic homologs
The origin of mitochondria from a bacterial endosymbiont related to the α-proteobacterial subdivision is now widely accepted [14,16,17,116,117,119,120]. There is increasing evidence that the eukaryotic nuclear cytosolic homologs of many genes are also of proteobacterial origin and that the ancestral eukaryotic cell itself may have originated by the symbiotic association and ultimate fusion of an archaebacterium and a proteobacterium [30,122,128,172–181]. The signature sequences in various proteins identified in the present work provide useful information in this context. Many of the proteins studied here are also present in eukaryotic cells and by examining whether the indicated signatures are present or absent in the eukaryotic homologs, the origin of mitochondria or eukaryotic homologs from specific subdivisions of proteobacteria can be inferred. In the Hsp70 protein, which has been studied extensively , two signatures are present, one defining the proteobacterial group (Fig. 4) and the second specific for the β and γ subdivisions (Fig. 12). The former signature is present in all mitochondrial as well as nuclear cytosolic homologs, whereas the latter is not found in any of them . These signatures provide evidence that both mitochondrial and the nuclear cytosolic homologs of this protein have originated from proteobacterial species other than those belonging to the β and γ subdivisions . This inference is supported by signature sequences in other proteins. In succinyl-CoA synthetase, a mitochondrially localized protein, a prominent signature sequence is commonly shared by the eukaryotic homologs and a clade comprised of proteobacteria and the Chlamydia-Cytophaga group of species (Fig. 3). A signature sequence in biotin synthetase (Fig. 10) also shows the specific relationship of the eukaryotic homologs to the α-, β- and γ-proteobacteria. At the same time, signature sequences in a number of proteins, namely Hsp70, cysteine synthetase and ribosomal protein L16, provide evidence that the eukaryotic homologs are not related to the β and γ subdivisions of proteobacteria. Taken together, these observations support the view that the mitochondrial homologs are specifically related to the α-proteobacterial group and that the nuclear cytosolic homologs are also derived from proteobacterial species exclusive of the β and γ subdivisions.
While the above observations support the origin of mitochondria from α-proteobacteria, the signature sequences in two proteins are presently intriguing and raise interesting questions. In Lon protease, which is localized in mitochondria, a 1-aa deletion is present in all α-, β- and γ-proteobacteria, which is not found in any of the eukaryotic homologs (Fig. 7). Likewise, in valyl-tRNA synthetase, a large insert is commonly shared by the eukaryotic homologs and those from the β and γ subdivisions of proteobacteria (Fig. 13). Thus far no homolog for valyl-tRNA synthetase has been identified from any α-proteobacterial species, including Ri. prowazekii, whose entire genome has been sequenced . These signatures raise the possibility that the mitochondrial homologs of these proteins may be derived from a group other than the α-proteobacteria. However, a more likely possibility to explain these observations is that these signatures will be eventually found in some other α-proteobacterial species, from which mitochondria would have originated.
7Nomenclature and taxonomic ranks for proteobacterial subdivisions
The evolutionary relationships among the proteobacterial species deduced here are supported by all available evidence and raise certain issues concerning the classification and nomenclature for this group. As indicated earlier, the different clades observed within the proteobacterial class were originally arbitrarily named α, β, γ, δ, and ?, without any knowledge at the time of the evolutionary relationship or branching order of these subclasses [3,12,23,25–28,182]. However, now that the relationships among these subgroups have become clearer, i.e., →?, δ→α→β→γ, the present nomenclature for the different subdivisions is confusing and now obsolete. Although the question as to how the various proteobacterial groups should be named should be decided by the International Committee for Systematic Bacteriology (ICSB), in the present work I have referred to these groups as Proteobacteria-1 (?, δ), Proteobacteria-2 (α), Proteobacteria-3 (β) and Proteobacteria-4 (γ), indicative of the order in which these groups have evolved from a common ancestor. As pointed out earlier, currently there is a paucity of sequence information for members of the δ and ? subdivisions. Although the members of these two subdivisions could be distinguished from each other based on a signature sequence found in the RecA protein and of these δ subdivision species appear more closely related to the α, β and γ subdivisions, the relationship between these two subphyla needs to be further studied. In view of this and the fact that these two subdivisions contain much fewer genera in comparison to the α, β and γ subdivisions, they are presently kept in a single group and could be regarded as two subphyla, viz. 1A (?) and 1B (δ), of the Proteobacteria-1 group.
A second related issue is the taxonomic rank at which these proteobacterial group should be classified [1,20,183]. The ICSB committee which previously considered the problem of nomenclature for this group only recognized the proteobacterial group at the ‘class level’ and proposed no formal ranks or nomenclature for the different subdivisions or subphyla within this class [1,18]. As was indicated in the report of this Committee, the proteobacterial group is comprised of more than 200 genera, encompassing a large proportion of the known Gram-negative bacteria. The α, β, γ, δ and ? subdivisions within proteobacteria contain respectively at least 50, 30, 80, 15 and four genera. Some of these subdivisions are much larger than many of the other recognized divisions (or classes), e.g., Spirochetes, Chlorobiaceae, Chlamydiae, Deinococcaceae and Thermus, Chloroflexaceae and relatives, etc. within prokaryotes. Now that each of these subphyla within the proteobacteria can be clearly defined and distinguished from others based on signature sequences in a large number of proteins, and the temporal order in which they branched off from the common ancestor can be established, it would be more appropriate to recognize them at a comparable rank level as other main divisions within eubacteria.
8Determination of prokaryotic taxonomy based on signature sequences
Based on the signature sequences in different proteins described here and in our earlier work , all main divisions within prokaryotes can now be distinguished from each other (Fig. 24). Archaebacteria are distinguished from eubacteria based on a large number of signature sequences [30,31]. Although the exact relationship of archaebacteria to Gram-positive bacteria, to whom they are related structurally and by many gene phylogenies, remains uncertain [30,31,77–80], this uncertainty should not prevent or affect our understanding of the evolutionary relationships among eubacteria. Based on the established rooting of the prokaryotes between archaebacteria and eubacteria (Gram-positive bacteria) [30,153,155,184], signature sequences in different proteins permit us to infer the order in which different divisions within eubacteria have split off or evolved from the common ancestor. It is important to note that the evolutionary pattern that has emerged from consideration of different signature sequences is internally highly consistent, giving confidence in its reliability [30,79,185]. The branching order of different eubacterial divisions as deduced from these signatures is as follows: (common ancestor)⇒low-G+C Gram-positive⇒high-G+C Gram-positive⇒Deinococcus-Thermus (green nonsulfur bacteria)⇒cyanobacteria⇒Spirochetes⇒Chlamydia-Cytophaga-Flavobacteria-Aquifex-green sulfur bacteria⇒Proteobacteria-1 (? and δ)⇒Proteobacteria-2 (α)⇒Proteobacteria-3 (β)⇒Proteobacteria-4 (γ). Each of these groups or divisions is defined and circumscribed by signature sequences present in one or more proteins and in many cases additional group-specific signature sequences have also been identified (Fig. 24). Presently, the division of Bacteria into different phyla is based on criteria which are ill-defined, arbitrary and somewhat obsolete [3,20,76,186]. In this context, the protein signatures described here and in our earlier work  could serve as stable and well-defined criteria for the identification and definition of the main phyla within Bacteria. Based on these sequence signatures, it should now be possible to classify any unknown prokaryotes into one of these groups. A flow-chart based on signature sequences that could be used for such classification is shown in Fig. 25. The signature sequence-based approach described here was recently successfully used to determine the branching orders of various photosynthetic eubacterial phyla .
9Bacterial evolution: does it occur in a directional manner?
A surprising but very important aspect of the evolutionary relationship deduced here is that the major eubacterial phyla identified by signature sequences are related to each other linearly rather than in a tree-like manner (Fig. 24). This means that each new major eubacterial phylum has evolved from the preceding one rather than at random from any of the previously existing taxa. If these results are correct, an important implication is that major evolutionary changes have occurred in a directional manner . The reasons why this should be so are unclear at present. However, one could speculate that each new major group of species or phylum evolves from the preexisting ones in response to a major change (i.e., selective pressure) in the environment. In time, species from the newly evolved phylum, which are better adapted to the changed environment, become the predominant species filling in most of the habitats. The species from previously existing phyla, which have reduced fitness in the new environment, become greatly reduced in number and may survive in only specialized niches. When a further major change in the environment occurs, the species from the most recently evolved phylum, because of their large numbers and more varied gene pools, are better suited for the challenge. Hence any major innovation or development is more likely to occur in members of this last group rather in any of the ancient lineages. This view is consistent with the fact that the phyla which are at the leading edge of the evolutionary diagram (i.e., α-, β- and γ-proteobacteria in Fig. 24) represent the most abundant groups of species among prokaryotes. Thus, the observed apparent linear evolutionary relationship among the major eubacterial phyla could be a consequence of the progressive and episodic nature of evolutionary development, wherein each newly evolved phylum shows greater fitness for the present environment but also retains or has access (via lateral gene transfer) to the gene pools from the past.
I thank Vanessa Johari, Charu Chandrashekhar, Brian Le, Thuyan Le, Claire Osepchook and David Smyth for helping me with database searches and sequence alignments. Thanks and appreciation are also due to Dr. Bohdan Soltys and to an anonymous reviewer for critical reading of the manuscript and providing many helpful comments. The work from the author's laboratory was supported by a research grant from the Medical Research Council of Canada.