1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Acknowledgements
  6. References

EnvDB is a database that classifies the environmental samples and their associated 16S rDNA sequences currently stored in GenBank. The samples were categorized in a three-level, hierarchical classification of media: the five upper levels (terrestrial, aquatic, thermal, host-associated and other) are further decomposed in 20 intermediate (such as marine, marine sediments, freshwater, soil, gut, etc.) and 47 lower levels (for instance, soil is further decomposed in forest, agricultural, wetlands, grasslands, tropical, arid, etc.). Each sample was also characterized with nine environmental features: polluted, diseased (for clinical samples), acidic, alkaline, hot environment, cold environment, saline, anoxic and restricted (when the study is focused only in particular taxonomic groups). The classification of samples was aided by text-mining techniques, complemented with careful curation and completion by human experts. EnvDB currently includes 359 928 sequences from 3502 samples. The sequences were clustered at several identity levels to obtain operative taxonomic units (OTUs). Sequences and OTUs have been taxonomically assigned to the maximum possible resolution by different procedures. The user can obtain information about sequences, OTUs, samples and environments, combining these tables using a flexible querying system that allows generating very diverse queries. Thus, the user can easily inspect the presence and abundance of particular taxa in particular samples and environments. The database also allows the users to run analyses with their own data: users can input their sequences and find the closest sequences or samples in the database. EnvDB can be accessed in the web address


  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Acknowledgements
  6. References

Our knowledge of the environmental diversity, distribution and habitat preferences of prokaryotic species is actually quite incomplete. The absence of a systematic census of the distribution of prokaryotes in different environments makes it difficult to assess the existence of patterns governing the different microbial communities, and the relevance of individual species. Therefore, our knowledge on the rules governing the composition of microbial communities and the way in which environmental factors exert selective pressures on them is limited. Consequently, many important questions for microbial ecology are still under discussion, such as the existence of biogeographic patterns (Cho and Tiedje, 2000; Whitaker et al., 2003; Ramette and Tiedje, 2007), the degree of environmental specificity or cosmopolitanism for prokaryotic species (Hugenholtz et al., 1998; Finlay, 2002; Horner-Devine et al., 2004; Green and Bohannan, 2006), or the importance of syntrophic and competitive interactions between members of the microbial communities (Gram et al., 1999; Schink, 2002).

Nevertheless, a vast amount of environmental and clinical samplings have been performed, describing the presence of prokaryotic species by means of partial or full 16S rDNA gene sequences. This is a rich source of information about prokaryotic diversity that can potentially help to address many of the questions pointed above. However, these data have been largely unexplored, probably because of their unstructured nature and the lack of a proper classification of their environmental features, although some authors compiled partial sets of samples for particular environments (Pushker et al., 2005; Lozupone and Knight, 2007; Ley et al., 2008). Moreover, the advent of metagenomic sequencing makes it possible to obtain huge amounts of information that, to be fully profitable, must be put in context with the adequate perspective of microbial ecology and the patterns governing bacterial communities in different habitats (Handelsman, 2004).

To bring together this wealth of information, we have developed envDB, a database that collects all the available 16S rDNA sequences from all the sampling experiments deposited in GenBank, enriched with taxonomic and environmental information. Presenting the most complete census up to date of the environmental distribution of prokaryotes, envDB is a very valuable tool for the study the structure of microbial communities.

EnvDB can be accessed in the web address


  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Acknowledgements
  6. References

Obtaining sequences and grouping into samples

The sequences of 16S rDNA from diverse samplings (mainly environmental and clinical) and isolates are deposited into the ‘Environmental’ (ENV) section of GenBank database (Benson et al., 2008). The section contains 710 425 entries in the release 164 (February 2008), mainly for 16S rDNA sequences but also for other molecular markers. We have collected the sequences whose gene or product annotations match either ‘16S’ or ‘small ribosomal’, but do not match ‘23S’, ‘18S’ or ‘intergenic’. Short sequences (below 250 bp) or long ones (above 1900 bp) were excluded. Sequences annotated as ‘Eukaryota’ or ‘Virus’ were also discarded. In this way we obtained a final data set of 399 098 putative 16S sequences from bacterial and archaeal taxa.

The different samples were identified by their full bibliographic references (titles and authors). We assumed that two sequences belong to the same sample if they have identical references. As frequent typographic errors exist, a disagreement of one word in the references was allowed, as far as such words do not contain numbers, which usually identify different samples or clones. This procedure grouped the sequences in 4334 samples. We discarded the samples that contain less than five sequences each, obtaining a final data set composed of 3502 samples. The sequences belonging to each of these samples were clustered to 100% identity using cd-hit (Li et al., 2001) to eliminate redundancies, which produced a final data set containing 359 928 sequences.

Classifying samples in environmental categories and environmental features

We derived a classification of environments for categorizing the collection of samples. The environments were classified in 5 supertypes, 20 types and 46 subtypes (Table 1). The environmental class of a given sample is the combination of its supertype, type and subtype. Each sample can also be described by one or more of the following eight environmental features: polluted, anoxic, high temperature, low temperature, saline, alkaline, acid and diseased (this last feature just applies for animal tissues). A sample usually belongs to just one environmental class, but it can be described by several environmental features.

Table 1.  Environmental classification, showing the three hierarchical levels of definition (supertype, type and subtype), examples of the content of each category, the number of samples in each, and the number of OTUs and sequences within them.
  1. OTUs are defined at 97% identity.

Aquatic (1277)Saline waters (300)Coastal watersCoastal waters653 6208 596
Open watersOpen marine waters1595 08713 088
Deep watersDeep marine waters341 7523 621
LakesSaline lakes23727973
OtherNon-marine saline waters, saline wastewaters199641 452
Saline sediment (199) Marine sediments1998 51414 300
Freshwaters (501)Aquifers 421 6062 087
GroundwatersGroundwaters, ponds, cave waters, etc. Does not include drainages from mines (Soil/Mines)471 7683 212
Lakes 1314 3268 505
Rivers 672 8235 467
Drinking watersWater distribution systems, bottled waters14504983
WastewatersDomestic and industrial residual waters, including water treatment plants2005 6599 139
Freshwater sediment (101)  1014 2796 670
Freshwaters–saline waters interface (31) River estuaries, deltas311 0471 835
Marine host-associated (145) Marine microbiota forming consortia with corals, urchins, sponges, etc.1455 1168 029
Terrestrial (732)Soil (584)AgriculturalSoils devoted to cultivation of any crop. Does not include samples associated to roots of plants (Plants/Rhizosphere)1108 32418 987
ArcticFrozen environments: Arctic and Alpine soils, glaciers, permafrost, ice cores, etc.594 1866 749
AridDeserts, sandy soils, beaches301 3441 738
CaveSubsurface soils, caves, caverns216821 010
ForestSoils associated to any kind of forest. Does not include samples associated to roots of trees (Plants/Rhizosphere)634 9807 880
GrasslandGrasslands, grass pastures, moorlands, prairies, steppes144 9105 860
RocksRocky soils, stones, volcanic soils, walls, wall paintings, art objects672 9204 039
SalineSalterns, salars271 3652 859
OtherOther or undescribed soil types19310 36017 297
Plants (148)RhizosphereSamples associated to roots of plants or trees1004 7797 664
OtherSamples associated to other parts of the plant (leaves, seeds, tree trunks, etc.), including remnants of vegetal tissues481 8883 741
Thermal (190)Hydrothermal (79) Hydrothermal fields, chimneys, vents792 9815 077
Geothermal (111) Hot springs, geysers1112 7056 027
Host-associated (463)Animal host (52) Samples from terrestrial animals. Does not include samples from humans or gastrointestinal tracts of animals521 2922 661
Gastrointestinal tract (331)HumanSamples from the digestive tract of humans. Includes faecal samples879 71554 725
CattleSamples from the rumen of sheep, goats, cows, rabbits, etc. Includes faecal samples733 4186 519
MouseIntestinal samples from mice. Includes faecal samples193 58218 330
InsectSamples from the digestive tract of several insects, especially termites793 5458 838
OtherSamples from the digestive tract of other animals (fishes, birds, etc.)732 3844 556
Oral (39) Samples from the human oral cavity3988610 546
Vagina (12) Samples from the human vagina123142 674
Other tissue (29) Samples from other tissues of humans (skin, blood, bones, semen, etc.)291 5536 521
Other (569)Aerial (11) Samples taken from the atmosphere or exhaled air.111 6413 938
Oil (51) Oil fields, oil tanks, asphalt, tar pits511 2021 902
Artificial (640)CompostComposting process from organic matter521 6072 639
Food treatmentElaboration and storage of foods203681 117
IndustrialAll kinds of bioreactors, cultivations, industrial machines, industrial wastes2224 9978 192
MinesSamples from mines, including drainage waters and tailings1073 8366 157
OtherSeveral other artificial samples, especially landfills391 6452 628
Soil-Saline waters interface (13) Salt marshes, mangroves132 3343 989
Soil-Freshwaters interface (54) Swamps, marshes, wetlands, peat bogs543 2785 106
Unknown (200)  Unclassified instances2006 32910 889

We used a semi-automatical text-mining procedure for classifying the samples in environmental classes and features. A set of 532 samples was initially classified by human experts, according their title and the corresponding PubMed abstracts, when available. This set of samples covered all environmental classes and features. For each of the samples, we created a vector of words containing these that appeared in their titles and abstracts, excluding stop words (very frequent words that do not convey useful semantic information, such as articles, prepositions, adverbs, etc.), and words seen just once. The vectors were used to train a Naïve-Bayes classifier that calculates the likelihood to belong to each environmental class and feature according to the word usage of a particular text (Manning and Schütze, 2003). Briefly, each word is given a probability of belonging to a particular class based on its frequency in that class, which is given by

  • image

where nW,C is the number of times that word W has been seen in class C. The probability of a given text D to belong to a particular class C can be calculated, following Bayes rule, as

  • image

where P(D) can be ignored since we are interested in relative values, P(Ci) is the ratio between the number of documents in that class and the total number of documents, and P(D|Ci) can be calculated as the product of the probabilities for each word:

  • image

These probabilities were calculated and used to annotate the rest of the samples. A sample was assigned to the environment class or feature having the higher probability above a minimum value, and exceeding that of any other by a adjustable percentage (Table 2). The results were checked by human experts, who corrected the possible mistakes and increased the coverage by annotating unclassified instances.

Table 2.  Example of the classification of the text ‘Diversity of the Clostridium coccoides group in human fecal microbiota as determined by 16S rRNA gene library’, corresponding to the title for one of the samples in the database. Usable words in that text are ‘human’, ‘fecal’ and ‘microbiota’. The table shows the individual frequencies for each word in each category, and the corresponding final score. For calculating the final score, words that do not appear in a particular category are given a very low frequency. In this case, the classifier assigns this sample to the ‘gut/human’ category, which is correct.
ENV ClassWordsTotal
Vagina0.037 0.0124e-08
Gut/Mouse 0.0060.0332e-08
Gut/Other 0.0040.0114e-09
Gut/Cattle 0.0050.0053e-09
Oral0.022  2e-10
Artificial/Food treatment  0.0141e-10
Gut/Insect  0.0099e-11
Animal host  0.0099e-11
Other tissue0.005  5e-11
Artificial/Compost  0.0044e-11
Freshwaters/River0.001  1e-11
Sea/Coastal 0.001 1e-11
Plants0.001 0.0011e-11
Freshwaters/Lake  0.0011e-11

The information available for the classification was often scarce: a PubMed identifier (PMID) was not found for 64% of the samples, which implies that no abstract was available, and therefore the sample was described just by a single sentence corresponding to its title, as was submitted to GenBank. Nevertheless, the performance of the classifier was fairly good even in these circumstances, producing results for 52% of the samples with a precision of 81% (measured as the percentage of correctly classified instances, these that were kept unchanged by curators). This helped very much to speed up the process of sample categorization, since in these instances the human annotators had only to verify that the classifier's output was correct. At the end of the process, 3302 samples (94% of all samples) were classified, as it is shown in Table 1. Still, 200 samples could not be classified. Also, it must be noticed that sometimes a single sample is composed by different individual sampling experiments that the authors have merged in the same GenBank entry. Individual contributions are impossible to separate, but since usually all individual samples come from the same or very similar environments (different lakes, different guts of termites, different water treatment plants, etc.), they do not suppose an obstacle for classification and for the final objective of describing taxonomic diversity of the different environments. In few instances (43 cases, around 1% of the total), the individual samples come from different environments (e.g. rivers and oceans), and have been classified in all of these, to reflect the multiple origins of the sequences.

Identifying restricted samplings

Many of the samplings were performed with the objective of studying the presence and/or abundance of particular taxa, usually using specific primers targeting these taxa. These samplings, called ‘restricted’ by us, do not detect the occurrence of other taxa, regardless if they are present or not. Consequently, these samples can be valuable for checking the presence of some taxa in particular samples or environments, but they are probably not appropriate to study patterns of diversity and distribution. Therefore, it is necessary to distinguish these restricted samplings. The identification of restricted samples was performed by two different procedures: (i) looking for the presence of taxonomic names in the title of the article (for instance, ‘Molecular evidence for the presence of novel actinomycete lineages in a temperate forest soil’, or ‘Diversity and depth-specific distribution of SAR11 cluster rRNA genes from marine planktonic bacteria’), and (ii) by means of the taxonomic assignment of operative taxonomic units (OTUs) (see below), identifying the samples that contain sequences from just a single taxon. This procedure allowed us to identify 869 restricted samples (25% of the total), which were labelled accordingly.

Identifying OTUs

Several procedures have been proposed for grouping closely related sequences into OTUs (Seguritan and Rohwer, 2001; Schloss and Handelsman, 2005). We used cd-hit (Li et al., 2001), a very effective procedure for clustering sequences, developed originally to remove redundancy in biological databases. Our choice was motivated because the very high number of sequences in our data set prevents the use of methods relying in time-consuming pair-wise alignments, and because cd-hit also performs well when working with partial sequences. Several identity percentages in the range between 90% and 100% were used for the clustering, and all of them are available in the database. As an example, clustering sequences at 97% identity results in 124 390 different clusters, or OTUs (67% of them composed by a single sequence).

Taxonomic assignment of sequences and OTUs

The sequences were assigned to a reference taxon by using the RDP classifier (Wang et al., 2007), which also provides a confidence estimate for the predictions in each taxonomic rank. We considered only the assignments done with more than 80% confidence. This resulted in predictions to different taxonomic depths for 356 250 sequences. We also collected the original GenBank annotations for all sequences. Actually, few discrepancies exist between these two sources, although RDP assignments are usually more complete and therefore have been used as the main reference. Nevertheless, the user can choose to use GenBank, RDP, or both, as sources for the taxonomic assignments. No assignments have been done below genus level.

Operative taxonomic units were classified by extracting a consensus from the taxonomic assignments of their individual sequences. The objective was to find the lowest taxonomic rank in which a taxon dominates, defining it by being the most abundant, having more than five sequences in the OTU, and being the only taxon with more than 25% of the total number of sequences in it. The OTU is then assigned to the dominant taxon.

In addition, the closest species to each OTU was identified by running blastn searches for a representative sequence of each OTU against the RDP database ( The closest species was that of the best hit having at least 97% identity and 90% of alignment with the query, and belonging to an identified organism (uncultured or unidentified species were not considered). This feature must be used with caution, since we rely entirely on database annotations for ‘identified’ organisms, where inconsistencies can indeed be present.

Database architecture

EnvDB is a MySQL database composed by four main tables: samples, environments, sequences and OTUs. ‘Samples’ table comprises the full reference of each sampling experiment (including dates) and the number of sequences that the sample contains. ‘Environments’ table contains the environmental classes and features for each of the samples. ‘Sequences’ table stores all the individual sequences with reference to their original samples, and also their lengths and taxonomic affiliations. Finally, ‘OTUs’ table contains the OTUs resulting from clustering sequences to the reference identity level.


EnvDB allows the user to perform complex queries using all possible combinations of the four tables described above, via a user-friendly web interface. The queries are constructed by specifying the search criteria and the information to retrieve from each of the tables. The interface is divided accordingly in search and retrieval fields (Fig. 1A). Using this interface, the tables can be combined easily. For instance, choosing to retrieve ‘title’ for samples, and ‘type’ for environments, the database returns the titles for all the samples and their associated environmental types. The queries can be refined by the addition of constraints. For instance, if the last query is modified by adding the taxon ‘Desulfobacter’ for sequence search, the database will provide only the samples containing sequences for the Desulfobacter genus (Fig. 1B). Different constraints can be combined, and therefore a wide range of complex queries can be constructed very easily.


Figure 1. A. Snapshot of envDB main page, showing a simple query for retrieving data. B. Snapshot of the results for the query.

Download figure to PowerPoint

Some predefined queries were added to facilitate the most frequent and simple questions to the database. These queries also help to understand the query system, since their selection automatically fills the appropriate fields in the form and directs the users to where their input is needed. Some of these queries are: retrieving the different OTUs or taxa present in a given environment, obtaining the sequences belonging to a given taxon or OTU, or finding the samples for a particular environment. Also more complex queries are predefined, such as retrieving representative DNA sequences for the OTUs present in one environment (thus retrieving in a single step all the diversity of OTUs found for that environment), or getting all samples and associated environments in which a given taxon has been found (allowing to inspect the distribution of a single taxon or specie).

In addition to retrieve the stored data, envDB allows the users to analyse their own data. Given a set of 16S rDNA sequences input by the user, the system checks for similar sequences in the database (via similarity searches using blastn). The results include the samples and environmental distribution of the similar sequences. The user can also provide a set of query sequences and look for the most similar samples in the database. The most similar OTUs to the query sequences are found by similarity searches as before, and are used to generate a taxonomic profile for the query, which is compared with the profiles for each of the samples by calculating a correlation coefficient. The results show the closest samples, their environmental information, and indicate also the shared and not shared OTUs between the query and retrieved samples. In this way, the user can identify the origin of the most similar samples to the one provided. This is a very valuable tool for discovering unexpected similarities or differences between prokaryotic communities of diverse environments, and it is also useful for finding species that are particularly interesting in a given sample, for instance, because they were not observed before in related samples.


The database is updated in a weekly basis, by downloading the new entries in the environmental section of GenBank database, and analysing them according the procedure described above. This process is fully automatic. The classification of samples in environments can also be performed automatically, but human curation is important to improve the quality and coverage of the annotations.

Future directions

Next steps in the development of the database will include adding all possible environmental, geographical and physicochemical information that can help to characterize the environmental samples. Often, much of this information is cited in the article describing the sampling experiment. We plan to develop advanced text-mining tools that can allow us to automatically discover these pieces of information in the texts. In particular, we are very interested in adding the geographical location of the samples, for allowing more powerful analyses regarding the study of biogeographic patterns in prokaryotes.

Next versions will include new environmental classifications following the envO standard (, such as Habitat-Lite. We will also use the ‘isolation_source’ field of GenBank entries to improve the environmental classification process. We would like also to assess the possibility of customizing the environmental classifications according to the preferences of the users.

Finally, we are also considering the possibility of calculating automatically a phylogenetic tree from a set of sequences selected by the user. This could be interesting to assess different relationships between phylogenetic and environmental distances, for instance, for studying the possible existence of biogeographic patterns for the sequences of a particular taxa.


  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Acknowledgements
  6. References

We wish to acknowledge Ana Durbán and Alexander Neef for critical assessment of the envDB system. This work was supported by projects BFU2006-06003 from the Spanish Ministry of Education and Science (MEC), and GV/2007/050 from the Generalitat Valenciana, Spain. J.T. is a recipient of a contract in the FIS Program from ISCIII, Spanish Ministry of Health.


  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Acknowledgements
  6. References