Survey of the number of two-component response regulator genes in the complete and annotated genome sequences of prokaryotes

Authors

  • Mark K. Ashby

    Corresponding author
    1. Department of Basic Medical Sciences, Biochemistry Section, University of the West Indies, Mona Campus, Kingston 7, Jamaica
    Search for more papers by this author

*Tel.: +1 (876) 935 8789; Fax: +1 (876) 977 7852, E-mail address: mark.ashby@uwimona.edu.jm

Abstract

The numbers of potential response regulator genes were determined from the complete and annotated genome sequences of Archaea and Bacteria. The numbers of each class of response regulators are shown for each organism, determined principally from BLASTP searches, but with reference to the gene category lists where available. The survey shows that for Bacteria there is a link between the total number of potential response regulator genes and both the genome complexity (number of potential protein-coding genes) and the organism's lifestyle/habitat. Increasingly complex lifestyles and genome complexities are matched by an increase in the average number of potential response regulator genes per genome, indicating that a higher degree of complexity requires a higher level of control of gene expression and cellular activity. Detailed results of this study are available online at http://www.mona.uwi.edu/biochem/courses/bc31m/table2.xls and http://www.mona.uwi.edu/biochem/courses/bc31m/table2genelist.pdf.

1Introduction

Bacteria and Archaea live in many different habitats. One of the main mechanisms that these organisms use to sense changes in their environment and then respond appropriately is the two-component sensory transduction system [1–3]. This allows the bacteria to respond appropriately to changes in their external environment by changing their motility or altering expression of specific genes or other cellular activities. Two recent reviews have surveyed the structural and functional relationships within two-component systems [1] or specifically histidine kinases [4], by analysing genomic data. One would expect the number of two-component genes in the genome of a given organism to depend on four things: the changeability of the environment that an organism lives in, interaction with other organisms (e.g. symbiosis or pathogenesis), cellular differentiation and the complexity of the genome. A link is often made in reviews between the genetic complexity of a prokaryote or its lifestyle and the number of regulatory genes.

This study uses the increasing amount of genome sequence information available to test this link. Galperin et al. [5] performed a thorough survey of the different signalling domains found in two-component systems. However, this work has been restricted to counting the number of response regulators rather than the total number of potential two-component genes in a genome, as the identity of many potential histidine kinases is not always clear, whereas the presence of a receiver domain in a deduced protein sequence is much easier to determine. The primary focus of the work presented is to evaluate the total number of response regulators for each organism, rather than the classification of each class of regulatory protein. The presence of a receiver domain was always used for the initial identification of a response regulator. The survey was restricted to completed genome sequences that have been annotated, so that an accurate count could be made of putative response regulator genes. A correlation between the number of putative response regulator genes and both the genome complexity and lifestyle of the organism is shown.

2Materials and methods

Complete genome sequences were accessed at the following websites: The National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/), The Institute for Genomic Research (http://www.tigr.org/tdb/), DOE Joint Genome Institute (http://www.jgi.doe.gov/JGI_microbial/html/), Genomes OnLine Database (http://www.genomesonline.org/) and Cyanobase (http://www.kazusa.or.jp/cyano/). The NCBI accession number for each organism and a list of all the genes surveyed are given on the online data supplement (http://www.mona.uwi.edu/biochem/courses/bc31m/table2.xls and http://www.mona.uwi.edu/biochem/courses/bc31m/table2genelist.pdf). Protein to protein BLASTP 2.2.6/3 searches [6] were performed at the NCBI and Cyanobase websites, while BLASTP 2.OMP_WashU searches were performed at The Institute for Genomic Research website.

The number of response regulator genes was compiled by performing BLASTP searches of the deduced protein sets for each completed genome using the protein sequences listed in Table 1 (data presented in Table 2). Reference was also made to the annotated list of deduced proteins for each complete genome when available. For BLASTP 2.2.6, the lowest score (bits) and the highest E value that still proved to be a recognisable response regulator were 20 and 2.9 respectively. For BLASTP 2.OMP_WashU the lowest ‘High Score’ and the highest ‘Smallest Sum Probability P(N)’ that still proved to be a recognisable response regulator were 52 and 0.9995 respectively. In addition published materials were used for Synechocystis PCC 6803 [7], Bacillus subtilis[8] and Anabaena PCC 7120 [9]. The first BLASTP search was always with CheY to identify all the potential response regulators by the presence of the receiver domain. Subsequent BLAST searches were then used to classify the response regulator sequences with the sequences listed in Table 1. The number of response regulators previously reported for Escherichia coli K12 [10] is less than that reported here (39). Some putative response regulators which had potential DNA binding domains similar to those found in OmpR, NarL or NtrC but which were clearly truncated compared to the length of the test sequence were listed in the ‘others’ column of Table 2, as were those which included other ‘novel’ signalling domains [5]. The NarL column lists response regulators of the NarL and FixJ family. The CheY column lists all of the response regulators that only appeared to have a receiver domain. The NtrC column lists only those response regulators with homology to NtrC that have a receiver domain [2]. To test the validity of the sequences used for the BLASTP searches (Table 1), searches were also performed on five organisms (Bacillus subtilis 168, Clostridium tetani, Caulobacter crescentus, Bradyrhizobium japonicum and Shewanella oneidensis) using the CheY, OmpR, NarL and NtrC polypeptide sequences from E. coli K12.

Table 1.  Protein sequences used for BLAST searches
ProteinSpecies
RpaB OmpRAnabaena PCC 7120
NtrCPseudomonas stutzeri
FixJBradyrhizobium japonicum
NarLPseudomonas aeruginosa
CheYSalmonella enterica
Table 2.  Compilation of potential response regulators (RR) and total coding genes from the complete and annotated Archaeal and Bacterial genome sequences (June 2003)
OrganismGroupOmpRNarLNtrCCheYHybridOthersTotal RRTotal genes
Domain Archaea
Aeropyrum pernixDesulfurococcales00000001840
Sulfolobus solfataricusSulfolobales00000002977
Sulfolobus tokodaiiSulfolobales00000002826
Pyrobaculum aerophilumThermoproteales00000002605
Archaeoglobus fulgidusArchaeoglobus000911112420
Halobacterium sp. NRC-1Halobacteriales00011242075
Methanobacter thermoautotrophicusMethanobacteriales00003691873
Methanocaldococcus jannaschiiMethanococcales00000001729
Methanopyrus kandleri AV19Methanopyrales00000001687
Methanosarcina acetivoransMethanosarcinales0001253204520
Methanosarcina barkeriMethanosarcinales010723133770
Methanosarcina mazeiMethanosarcinales010934173371
Pyrococcus abyssiThermococcales00010121769
Pyrococcus furiosusThermococcales00000002065
Pyrococcus horikoshiiThermococcales00010121801
Ferroplasma acidarmanusThermoplasmales00000001816
Thermoplasma acidophilumThermoplasmales00000001402
Thermoplasma volcaniumThermoplasmales00000001499
Domain Bacteria
Bifidobacterium longumActinobacteria620003111729
Corynebacterium efficiensActinobacteria950002162950
Corynebacterium glutamicumActinobacteria750001133040
Mycobacterium lepraeActinobacteria32000051605
Mycobacterium tuberculosis CDC1551Actinobacteria1130100154187
Mycobacterium tuberculosis H37RvActinobacteria1130100153927
Streptomyces avermitilisActinobacteria25370037727573
Streptomyces coelicolorActinobacteria264901417977769
Troperyma whipplei TW08/27Actinobacteria1000012783
Troperyma whipplei str. TWISTActinobacteria1000012808
Aquifex aeolicusAquificales10300041529
Bacteroides thetaiotaomicron VPI-5482Bacteroides61524213694779
Chlorobium tepidumGreen sulfur200260102252
Chlamydia muridarumChlamydiales0010001909
Chlamydia trachomatisChlamydiales0010001895
Chlamydia pneumoniae AR39Chlamydiales00100011112
Chlamydia pneumoniae CWL029Chlamydiales00100011054
Chlamydia pneumoniae J138Chlamydiales00100011069
Synechocystis PCC 6803Cyanobacteria107081614553267
Anabaena PCC 7120Cyanobacteria171402355221315368
Thermosynechococcus elongatusCyanobacteria750366272475
Bacillus anthracis AmesBacillus2850218445738
Bacillus anthracis A2012Bacillus27504012485544
Bacillus haloduransBacillus16822419514066
Bacillus cereus 14579Bacillus2850229465477
Bacillus subtilis 168Bacillus131004012394112
Clostridium acetobutylicumClostridium2470539483672
Clostridium perfringensClostridium1300125212660
Clostridium tetaniClostridium1601118272373
Enterococcus faecalisEnterococcus1220108233182
Mycoplasma genitaliumMollicutes0000000484
Mycoplasma penetransMollicutes00000001037
Mycoplasma pneumoniaeMollicutes0000000689
Mycoplasma pulmonisMollicutes0000000782
Ureaplasma urealyticumMollicutes0000000614
Lactobacillus plantarumLactobacillus62000083009
Lactococcus lactisLactobacillus61000072267
Listeria innocuaListeria920106182968
Listeria monocytogenes EGD-eListeria1030105192846
Oceanobacillus iheyensisOceanobacillus850208233496
Thermoanaerobacter tengcongensisThermoanaerobacter9203114292588
Staphylococcus aureus MW2Staphylococcus940006192632
Staphylococcus aureus Mu50Staphylococcus930003152714
Staphylococcus aureus N315Staphylococcus940102162594
Staphylococcus epidermidis 12228Staphylococcus940002152419
Streptococcus agalactiae 2603V/RStreptococcus1310005192124
Streptococcus agalactiae NEM316Streptococcus1520006232094
Streptococcus mutansStreptococcus1130001151960
Streptococcus pneumoniae R6Streptococcus920003142043
Streptococcus pneumoniae TIGR4Streptococcus820003132094
Streptococcus pyogenes M1 GASStreptococcus710005131697
Streptococcus pyogenes MGAS315Streptococcus610005121865
Streptococcus pyogenes MGAS8232Streptococcus810003121845
Streptococcus pyogenes SSI-1Streptococcus710003111861
Fusobacterium nucleatumFusobacteriaceae20200372067
Caulobacter crescentusCaulobacter122316256643737
Agrobacterium tumefaciens (Wash)Rhizobiaceae256510119662785
Bradyrhizobium japonicumRhizobiaceae19224201816998317
Brucella melitensis 16MRhizobiaceae1033212212059
Brucella suisRhizobiaceae933322222116
Mesorhizobium lotiRhizobiaceae22105774556746
Sinorhizobium melilotiRhizobiaceae179510811603341
Rickettsia conoriiRickettsia20100251374
Rickettsia prowazekiiRickettsia2010025835
Coxiella burnetiiRickettsia340140122095
Ralstonia solanacearumRalstonia271554517733440
Neisseria meningitidis MC58Neisseria11100142079
Neisseria meningitidis Z2491Neisseria11100142065
Nitrosomas europaeaNitrifying704323192460
Campylobacter jejuni NCTC11168Spirilla601113121634
Helicobacter pylori 26695Spirilla301114101576
Helicobacter pylori J99Spirilla301114101491
Buchnera aphidicola APSEnterobacteriaceae0000000564
Buchnera aphidicola BpEnterobacteriaceae0000000504
Buchnera aphidicola SgEnterobacteriaceae0000000545
Escherichia coli CFT073Enterobacteriaceae15541510405379
Escherichia coli K12Enterobacteriaceae14641410394279
Escherichia coli O157:H7Enterobacteriaceae1573167395361
Escherichia coli O157:H7 EDL933Enterobacteriaceae1773169435324
Shigella flexneri 2aEnterobacteriaceae1353154314180
Shigella f. 2a 2457TEnterobacteriaceae1363145324706
WigglesworthiaEnterobacteriaceae1000001654
Yersinia pestis CO92Enterobacteriaceae1482163343885
Yersinia pestis KIMEnterobacteriaceae1472163334090
Shewanella oneidensisAlteromonadaceae17754726664758
Vibrio choleraeVibrio124751123622742
Vibrio parahaemolyticusVibrio168821421694832
Vibrio vulnificusVibrio155821732792972
Xanthomonas axonopodisXanthomonadaceae1394121728834312
Xanthomonas campestrisXanthomonadaceae1185122218764181
Xylella fastidiosa 9a5cXanthomonadaceae522337222766
Xylella fastidiosa Temecula 1Xanthomonadaceae522337232034
Haemophilus influenzaePasteurellaceae41000051709
Pasteurella multocidaPasteurellaceae521013122015
Pseudomonas aeruginosaPseudomonas2413851324875567
Pseudomonas putida KT2440Pseudomonas2512882218935350
Pseudomonas syringae DC3000Pseudomonas20139102219935471
Salmonella enterica TyphiSalmonella1384144344395
Salmonella enterica Typhi Ty2Salmonella1384147374646
Salmonella typhimurium LT2Salmonella1674146384451
Borrelia burgdorferiSpirochaetales0013116851
Leptospira interrogansSpirochaetales52191119474360
Treponema pallidumSpirochaetales00110131036
Thermotoga maritimaThermotogales400304111858
Deinococcus radioduransThermus/Deinococcus6602310272629

Bacteria were classified according to the lifestyles, either as soil, complex (these form symbiotic relationships or differentiate), pathogenic, enteric (living in the gastrointestinal tract), aquatic or plant pathogens. Pathogens that normally live in the gastrointestinal tract were counted with the gastrointestinal tract data. Several species were placed in more than one category (e.g. Anabaena, C. tetani etc.), the classification for each organism is listed with the data online (http://www.mona.uwi.edu/biochem/courses/bc31m/table2.xls).

3Results and discussion

The survey (performed in June 2003) of the number of response regulators in complete genome sequences is shown in Table 2. The first four Archaeal species, from the domain Crenarchaeota, do not have any response regulator genes. The 14 Euryarchaeota species have 0–20 response regulator genes. A weak correlation between the genome complexity and the number of response regulators was observed. This was not surprising since it seems likely that two-component proteins originated in bacteria and radiated into Archaea by lateral gene transfer [11]. The 105 bacterial species surveyed have a range of total number of response regulators from 0 to 131 (Anabaena). In most cases the simple DNA binding response regulators (OmpR and NarL) predominate. In addition, many of the response regulators listed in the ‘others’ column are simple DNA binding response regulators comprising only a receiver and DNA binding domain. Some of the response regulators that were counted in the ‘others’ column await more detailed structural analysis and classification. A survey of the number of response regulators for 34 organisms can be found in the Sentra database (http://www-wit.mcs.anl.gov/sentra/). In that database the numbers for some of the response regulators was less than in this survey. This may be explained by the fact that the emphasis of the Sentra database is to show different types of response regulator for each organism rather than the total number.

The link between the lifestyle of a bacterium and the average number of response regulators is shown in Table 3A. Pathogens have the lowest average number of response regulators, followed by enteric bacteria, suggesting that the environments that these organisms inhabit are fairly constant with respect to nutrients and pH and that pathogenicity itself does not depend on sophisticated gene regulation. Organisms living in aqueous environments had the third highest average, probably because there is a greater variation in nutrient level, pH and mineral concentration (particularly in freshwater environments). Soil bacteria live in environments which are more variable than aquatic systems, many form spores and have to exist and interact in high densities with many other micro-organisms requiring them to have a higher number of response regulators. The highest number of response regulators was found in those bacteria that exhibit a more complicated lifestyle like Anabaena PCC 7120 (forms heterocysts to fix nitrogen) or form symbiotic associations like the Rhizobiaceae.

Table 3.  The average number of potential response regulator genes sorted by (A) habitat/lifestyle and (B) total number of protein-coding genes for each organism
 Average number of response regulators
A: Habitat/lifestyle
Pathogen16.5
Enteric20.0
Plant-pathogen54.0
Aqueous43.9
Soil48.1
Complex51.6
B: Total number of protein-coding genes
<10001.3
1000–19997.5
2000–299922.2
3000–399934.7
4000–499944.6
5000–599960.4
>600080.8

There is a direct relationship between the number of potential protein-coding genes in a bacterium and the average number of response regulator genes (Table 3B). The average number of response regulator genes increases threefold between the 1000–1999 and the 2000–2999 groups, suggesting there is a requirement for much tighter control of gene expression when the complexity of the genome goes above 2000 potential genes. The biggest single increase is from the 5000–5999 to 6000+ (mostly exhibiting complex lifestyles) groups, with an increase of 20.4 for the average number of response regulator genes.

Acknowledgments

M.K.A. is supported by a New Initiative Grant from the University of West Indies, Mona Campus.

Ancillary