INVITED REVIEW: Microbial ecology in the age of genomics and metagenomics: concepts, tools, and recent advances



    1. Department of Biology, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S 4K1, Canada, and Institute of Tropical Medicine, Hainan Medical College, Haikuo, Hainan, China
    Search for more papers by this author

Jianping Xu, Fax: 1-905-522-6066; E-mail:


Microbial ecology examines the diversity and activity of micro-organisms in Earth's biosphere. In the last 20 years, the application of genomics tools have revolutionized microbial ecological studies and drastically expanded our view on the previously underappreciated microbial world. This review first introduces the basic concepts in microbial ecology and the main genomics methods that have been used to examine natural microbial populations and communities. In the ensuing three specific sections, the applications of the genomics in microbial ecological research are highlighted. The first describes the widespread application of multilocus sequence typing and representational difference analysis in studying genetic variation within microbial species. Such investigations have identified that migration, horizontal gene transfer and recombination are common in natural microbial populations and that microbial strains can be highly variable in genome size and gene content. The second section highlights and summarizes the use of four specific genomics methods (phylogenetic analysis of ribosomal RNA, DNA–DNA re-association kinetics, metagenomics, and micro-arrays) in analysing the diversity and potential activity of microbial populations and communities from a variety of terrestrial and aquatic environments. Such analyses have identified many unexpected phylogenetic lineages in viruses, bacteria, archaea, and microbial eukaryotes. Functional analyses of environmental DNA also revealed highly prevalent, but previously unknown, metabolic processes in natural microbial communities. In the third section, the ecological implications of sequenced microbial genomes are briefly discussed. Comparative analyses of prokaryotic genomic sequences suggest the importance of ecology in determining microbial genome size and gene content. The significant variability in genome size and gene content among strains and species of prokaryotes indicate the highly fluid nature of prokaryotic genomes, a result consistent with those from multilocus sequence typing and representational difference analyses. The integration of various levels of ecological analyses coupled to the application and further development of high throughput technologies are accelerating the pace of discovery in microbial ecology.


Micro-organisms have been integral to the history and function of life on Earth. They have played central roles in Earth's climatic, geological, geochemical, and biological evolution. However, until very recently, the general importance of micro-organisms has been appreciated by only a few specialists. Indeed, micro-organisms are still most often considered from an anthropocentric perspective, with attention focused on the relatively few species that cause human diseases and the potential of micro-organisms to provide useful products and services. The recent advances in genomics are offering fresh perspectives on this previously underappreciated microbial world.

The microbial world contains a highly heterogeneous group of organisms sharing only one common characteristic, their small sizes. These organisms make up two (out of three) entire Domains of life on Earth, the prokaryotic Bacteria and Archaea (Woese 1987). Within the third Domain, Eukarya, the majority of the phylogenetic diversity is contained within eukaryotic micro-organisms such as protozoa, algae, and fungi. The prokaryotic life emerged about 3.8 billion years ago, about 2 billion years before eukaryotic life arose. Currently, microbial life forms are found in virtually every imaginable ecological niche on Earth, from the tropics to the Arctic and Antarctica, from underground mines and oil fields to the stratosphere and the top of great mountains, from deserts to the Dead Sea, from above-ground hot springs to underwater hydrothermal vents.

Microbial ecology examines the diversity of micro-organisms and how micro-organisms interact with each other and with their environment to generate and to maintain such diversities. Consequently, microbial ecologists have traditionally focused on two areas of study: (i) microbial diversity, including the isolation, identification and quantification of micro-organisms in various habitats; and (ii) microbial activity, that is, what micro-organisms are doing in their habitats and how their activities contribute to the observed microbial diversity and biogeochemical cycling.

Microbial diversity in the environment can be measured by various indices such as phylogenetic diversity, species diversity, genotype diversity, and gene diversity (Box 1). Above the species level, microbial diversity is commonly quantified based on evolutionary distances among observed taxonomic groups from a specific environment (e.g. the phylogenetic diversity based on a common chronometer such as the 16S ribosomal RNA subunit). Below the species level, microbial diversity is typically described using population genetic parameters such as gene diversity and genotype diversity. Gene diversity and genotype diversity refer respectively to the probability that two randomly drawn genes and genotypes in a population will be different. At the species level, microbial diversity is measured as species diversity. There are various measures of species diversity. One commonly used measure refers to the frequency that two randomly drawn individuals in an environment will be different species. This measure takes into account both the number of species (species richness) and the frequency of each species (species abundance) in the environment. Conceptually, this measure of species diversity is similar to those used for gene diversity and genotype diversity.

Table Box 1 .  Measures of microbial diversity in natural environments
Nucleotide diversityFunctional diversity
Gene diversityMorphological diversity
Genotype diversityStructural diversity
Species diversityMetabolic diversity
Phylogenetic diversityMetabolite diversity
Evolutionary diversityProtein diversity
Ecological niche diversity 

Species is the fundamental unit of biological classification and is critical for describing, understanding and comparing biological diversities at different levels among ecological niches. However, what constitute a species remains controversial. For sexual organisms with the meiotic life cycle (such as the majority of plants, animals and sexual microbial eukaryotes), although over 20 species concepts exist in the literature (Mayden 1997), the most widely used is the biological species concept. In this concept, a species consists of individuals capable of interbreeding with each other to produce fertile progeny but are incapable of doing so with members of other species. However, this definition is not applicable to asexual organisms lacking a regular meiotic life cycle. Such organisms include a large proportion of eukaryotic micro-organisms as well as all prokaryotes. Because most prokaryotes lack diagnostic morphological characteristics, have no meiotic sexual life cycle, but can exchange genetic materials among each other in unusual ways, the biological species concept is not applicable to them. Instead, the current most widely accepted species concept for prokaryotes is an operational one, rooted in the degree of DNA–DNA re-association. In this definition, two strains belong to the same species when their purified genomic DNA show at least 70% hybridization. This level of hybridization is equivalent to 94% average nucleotide identity at the whole genome scale (Konstantinidis & Tiedje 2005). It should be noted that this prokaryotic species concept does not translate well to that in plants and animals. For example, using this criterion, all members of primates (e.g. chimpanzees, orangutans, gorillas, gibbons and humans) would be belonging to the same species (Sibley et al. 1990). Because of these and other reasons, species concepts for both prokaryotes and eukaryotes are still evolving (e.g. Cohan 2004; Konstantinidis & Tiedje 2005) (Box 2).

The spatial and temporal distributions of microbial diversities are the subjects of microbial population genetics and biogeography. The patterns of distributions are often discussed in the context of environmental factors such as temperature, pH, salinity, pressure, the availabilities of water and nutrients, and the sources of energy and carbon. These ecological factors influence microbial activities and play very important roles in determining the spatial and temporal dynamics of micro-organisms in natural environments. Consequently, microbial ecologists often group micro-organisms into specific metabolic categories. For example, depending on the energy source, micro-organisms are called either phototrophs (obtaining energy from light) or chemotrophs (obtaining energy from chemicals). Among chemotrophs, if the energy sources are from inorganic molecules (such as H2S, H2, NH3, and Fe2+), they are called chemolithotrophs. In contrast, if their energy sources are from organic compounds, they are called chemoorganotrophs. Similarly, depending on the carbon source, micro-organisms can be either autotrophs (obtaining carbon from inorganic sources such as CO2 and inline image) or heterotrophs (obtaining carbon from organic compounds). Some micro-organisms, either in a free-living state or in association with other organisms, can use atmospheric nitrogen as its nitrogen source. Indeed, the diversity of microbial metabolisms extends far beyond the typical animal and plant metabolic capabilities. Even more striking are the extreme environmental conditions where many micro-organisms are found and thriving. These conditions include extreme high and low pressure, pH, oxygen and metal concentration, salinity, radiation, desiccation, and temperatures (Rothschild & Mancinelli 2001). For example, the nitrate-reducing chemolithoautotroph Pyrolobus fumarii can grow at temperatures of up to 113 °C (Blochl et al. 1997).

Micro-organisms in the environment are commonly organized into several levels of hierarchical organizations, from simple to complex: individuals, populations, guilds (metabolically related populations), communities (sets of interacting guilds), and ecosystems. A microbial ecosystem consists both the microbial community and its interacting biotic (macro-organisms such as plants and animals) and abiotic environmental factors (pH, temperature, inorganic and organic nutrients, etc.). While we commonly associate micro-organisms as decomposers of organic wastes and pathogens of plants, animals and humans, micro-organisms can also form mutualistic associations with each other as well as be fierce predators of other micro-organisms. For example, the minute bacteria Bdellovibrio (∼0.3 µm in diameter) can quickly destroy an Escherichia coli cell many times its own size (1 × 2 µm) (Nunez et al. 2003).

Until very recently, most of what we know about microbial diversity and microbial activity were derived from cultured microbes and ex situ laboratory experimental investigations. While such studies are essential, recent investigations using high resolution microelectronic, microscopic, and genomic tools have shown that much of what we thought we knew about our natural microbial world were in fact highly biased.

In the following sections, I will first provide a brief introduction to the main genomic methods that have been used to examine natural microbial populations and communities. This is then followed by three topics dealing with the impact of genomics on microbial ecology. The first topic is on the widespread application of DNA-based genomics technologies in microbial population studies in two specific areas: (i) the use of multilocus sequence typing to address a variety of ecological questions; and (ii) the use of representational difference analysis (RDA) to investigate genome size and gene content differences among bacterial strains. The second topic is on how genomic methods have been used to reveal unexpected microbial diversities in natural populations and communities. I pay special attention to how phylogenetic typing and metagenomics are transforming our views of the diversity and activity of micro-organisms in their natural habitats. The third topic summarizes how large-scale genome sequencing projects have provided unprecedented insights on the potential functions and activities of various groups of micro-organisms. I will conclude with a discussion on some of the long-standing unresolved questions and future perspectives. It should be pointed out that the field of microbial ecological genomics is progressing rapidly with thousands of publications accumulated in the last several years alone. Therefore, an exhaustive review is not possible. Instead, I have used selected examples to illustrate the impact of genomics on our current understanding of microbial ecology and its potential implications for future research.

Genomics tools

The word ‘genomics’ has become a trendy term widely used by the scientific community and the general public. Originally, the term was used to describe a specific discipline in genetics that deals with mapping, sequencing and analysing genomes. A genome refers to the complete set of genes and chromosomes in an organism. While many people use genomics in this narrow sense, an increasing number of people have expanded its use to include functional analysis of entire genomes as well. These functional analytical aspects include those on whole genome RNA transcripts (called transcriptomics), proteins (proteomics), and metabolites (metabolomics). In addition, various combinations of ‘-omics’ terms have recently become highly fashionable. For example, the discipline that uses genomics methods to analyse natural ecological communities has been called metagenomics, ecological genomics, community genomics, and environmental genomics. In this section, the main genomics tools and methods are briefly described with a focus on those dealing with DNA (Box 3).

Table Box 3 .  Genomic methods in microbial ecology research
DNA sequencingRepresentational difference analysis (RDA)
Polymerase chain reaction2-D gel electrophoresis
DNA cloning systems (plasmid, lambda-phage, cosmid, bacterial artificial chromosome or BAC, yeast artificial chromosome or YAC)Denaturing gradient gel electrophoresis (DGGE)
DNA re-associationGas chromatography
Fluorescent in situ hybridization (FISH)Mass spectrophotometry
Micro-array technologyBioinformatics

DNA sequencing

The most significant technical advance in genomic is the development of efficient, high throughput DNA-sequencing techniques and instruments. While the basic principle for DNA sequencing was established in the mid-1970s, it was not until the mid-1990s when efficient automated DNA sequencers and fluorescent dyes to tag the dideoxyribonucleotides (with one colour for each of the four types of nucleotides) were developed. At present, high throughput DNA sequencing facilities are found in most academic institutions and many molecular biology laboratories. Furthermore, faster and cheaper sequencing methods and equipment are continuously developed. For example, the recently developed pyrosequencing protocol used a novel fibre-optic slide of individual wells. This method could sequence 25 million bases in one 4-hour run with an accuracy of 99.96% (Margulies et al. 2005).

Polymerase chain reaction

The second tool is the polymerase chain reaction (PCR) that allows the analysis of minute amount of DNA from laboratory and environmental sources. In combination with appropriate DNA extraction protocols, PCR allows highly selective amplification of target DNA. Indeed, the PCR technique is permeating almost every aspect of biological research, including many other DNA-based genomics techniques. As will be shown below, in combination with various gel electrophoresis techniques such as the denaturing gradient gel electrophoresis (DGGE), amplification and analysis of the nuclear small ribosomal RNA gene from environmental samples have significantly enhanced our understanding and appreciation of natural microbial diversities.

DNA cloning systems

The third highly useful genomics tool for microbial ecological studies is the availability of efficient in vivo cloning systems (including cloning vectors and hosts). These systems allow the separation and amplification of individual DNA sequences from often unknown but heterogeneous gene pools. A large variety of such systems is now available to accommodate different types and sizes of DNA fragments. For example, depending on the size of fragments for cloning, the vectors may be based on plasmids (optimal range of DNA fragments 0.5–2 kb, upper limit, ∼10 kb), bacteriophages (7–10 kb, ∼20 kb), cosmids or fosmids (35–40 kb; ∼45 kb), bacterial artificial chromosomes (BAC, 80–120 kb, ∼200 kb), and yeast artificial chromosomes (YAC, 200–800 kb, ∼1.5 Mb). Vectors with large insert capacities are ideal for studying genome organizations of unculturable micro-organisms in the environment. For example, the blooming field of metagenomics has benefited significantly from the cosmid, BAC and YAC cloning systems.

Hybridization techniques

Several other traditional DNA analytical techniques have also been widely used in microbial ecological studies. These include DNA re-association kinetic analysis and fluorescent in situ hybridization (FISH). Using fluorescently tagged specific probes, FISH allows the direct observation and estimation of micro-organisms from specific species, genera, families or phyla in a given environmental sample. In contrast, the analyses of DNA re-association kinetics can be used to provide estimates on the diversity of microbial genomes in environmental DNA samples.

More recently, the high throughput micro-array technology has been applied to analyse the distributions of genes and species in natural microbial consortia (Zhou 2003). DNA micro-arrays are glass surfaces to which arrays of specific DNA fragments of various lengths have been attached at discrete locations. These fragments serve as probes for hybridization. Under conditions suitable for hybridization, the DNA spots on the chip are exposed to a solution containing a complex sample of fluorescent-labelled DNA. These arrays may contain probes of lengths from 25 to several hundred or even over a thousand base pairs. While most micro-arrays are derived from single genomes, arrays containing specific genes from multiple genomes can also be very useful for studying the distributions and activities of groups of micro-organisms in nature (Zhou 2003; Lehner et al. 2005).

Representational difference analysis

Because large-scale DNA sequencing is still an expensive enterprise, for most species, only one of two strains will be completely sequenced. To study variation among strains in species with sequenced representatives, a technique called representational difference analysis (RDA) has been developed (Lisitsyn et al. 1993). This method combines several molecular techniques such as DNA–DNA re-association, selective PCR, cloning, and DNA sequencing. This technique is especially powerful for genome size and gene content comparisons among strains in prokaryotic species. This is because strains in many prokaryotes vary widely in their genome sizes and the differences often contribute to their metabolic and ecological differences (e.g. Bergthorsson & Ochman 1995, 1998; Table 1; see also below section ‘Unexpected microbial diversity from environmental sources as revealed by genomics tools’).

Table 1.  Genome size and ecological niche comparisons among 250 sequenced prokaryotic genomes (habitat classification and data are based on NCBI information as of August 2005)
HabitatNo. of genomesGenome size (Mb)
Mean (± SD)Range
Terrestrial 114.92 (± 1.13)3.28–7.25
Multiple 654.29 (± 1.87)1.40–9.12
Aquatic 263.14 (± 1.60)1.31–7.15
Host-associated1222.57 (± 1.64)0.49–9.11
Specialized 232.29 (± 0.92)0.71–5.37
Unknown  33.47 (± 2.37)0.80–5.31

Tools for the analyses of the transcriptome, proteome, and metabolome

Aside from advances in techniques for analysing DNA, technical breakthroughs for analysing messenger RNA (mRNA), proteins, metabolites as well as interactions among these cellular constituents have also become common. For example, the high throughput micro-array technology has greatly increased the efficiency of genome-wide gene expression studies, allowing the analysis of potential genome–environment interaction of microbial communities in both laboratory and natural settings. Similarly, 2-D gel electrophoresis, mass spectrometry, and gas chromatography are providing unprecedented access to the constituents of microbial community proteins and small metabolites.


Of all the methods mentioned above, none would have been successful in microbial ecological research without bioinformatics tools. Broadly defined, bioinformatics refers to the use of computers to seek patterns in the observed biological data and to propose mechanisms for such patterns. As can be seen from below, bioinformatics not only can help us directly address experimental research objectives but also can integrate information from various sources and seeks patterns not achievable through experimentation alone.

Genomics tools in ecological genetics studies of cultured microbial populations

This section highlights the impact of DNA-based molecular techniques on our understanding of microbial diversity at below the species level. I will provide examples in two specific areas. The first is on how multilocus sequence typing (MLST) has improved our understanding of microbial diversity and population structure, with a special focus on the inferences of the relative roles of clonality and recombination in generating genotype diversity in microbial populations. The second topic is on how the use of RDA can help us reveal the tremendous diversity in genome content among microbial strains.

Multilocus sequence typing

The development of highly affordable, reliable, and efficient DNA sequencing technology has accelerated many areas of scientific research. One prominent example is the multilocus sequence typing (MLST) of microbial populations. As the name suggests, MLST refers to the use of DNA sequences from multiple regions in the genome for discriminating strains in populations. Though the term was coined only in 1998 for typing human bacterial pathogens (Maiden et al. 1998), its use in microbial ecological and evolutionary analyses dates back more than two decades ago. It has various other synonyms such as multiple gene genealogical analysis (MGGA) or comparative genealogical analysis (CGA) (e.g. Xu et al. 2000; Xu 2005).

There are several advantages of analysing multiple loci over the analysis of data based on a single locus: (i) it can generate more information, thus generally more robust conclusions; (ii) it samples multiple regions of the genome and thus results are more representative of the whole genome; and (iii) in many prokaryotes, horizontal gene transfer is very common and if the selected single gene happened to have been horizontally transferred, information derived from this gene will not be representative of other parts of the genome (Xu 2005).

Compared to other types of strain-typing methods (e.g. multilocus enzyme electrophoresis or MLEE, random amplified polymorphic DNA or RAPD, amplified fragment length polymorphisms or AFLP, restriction fragment length polymorphisms or RFLP, PCR-RFLP and PCR fingerprinting) that have been applied to analyse microbial populations, DNA sequence-based typing has many advantages. First, nucleotides in a DNA sequence are unambiguous. Such certainty is essential for many analyses. Second, nucleotides in a given DNA fragment typically share extended evolutionary history. Such sharing cannot be assumed between genetic markers in different parts of genomes as those obtained with other methods. Third, DNA sequences can be easily stored in and retrieved from public databases such as GenBank. Existence of such public databases makes data-sharing among investigators possible. Fourth, many analytical tools for DNA sequences are available. Indeed, many methods have been developed to infer a variety of processes governing the changes in populations and species (Xu 2005).

MLST has been used to study the ecological genetics of many microbial populations. It provides fine-scale measures of gene diversity and genotype diversity among microbial populations. These patterns of diversity have been used to infer a variety of ecological and evolutionary processes such as gene flow, cryptic speciation, hybridization, and the relative importance of clonality and recombination among analysed populations (Box 4). In human pathogenic bacteria where much of the initial MLST work was carried out, MLST allows the identification of medically important strains and clones. There are several recent topical reviews for readers interested in MLST of human bacterial pathogens (e.g. Urwin & Maiden 2003; Feil & Enright 2004). In contrast, other environmentally more relevant groups of micro-organisms are less researched or discussed.

Table Box 4 .  MLST is a powerful method to address a variety of ecological issues in microbial populations
RecombinationNiche specialization
Speciation/historical divergenceHost shifts
Gene flow/dispersion/migrationAdaptive evolution

Using specific examples, the following two subsections illustrate how MLST has been used to address microbial ecological questions. The first subsection provides a brief description on how MLST has been used to address evolutionary divergence, dispersion, hybridization, and the origin of a population in a soil basidiomycete fungus, Cryptococcus neoformans. The second subsection highlights recent evidence for recombination in natural populations of viruses, bacteria, protozoa, algae and fungi.

MLST in C. neoformans. C. neoformans (=Filobasidiella neoformans) is a soil fungus that can cause significant infections in humans and other mammals throughout the world. This species has been traditionally classified into five serotypes — A, B, C, D, and AD. To understand the evolutionary relationships among strains, geographic populations, and serotypes and to address ecological genetic questions, a series of gene genealogy-based studies were conducted. The first analysed 34 strains from various locations around the world, including 14 serotype A strains, 7 serotype D strains, 3 serotype B strains, 5 serotype C strains, 3 serotype AD strains and 2 strains whose serotypes could not be determined (Xu et al. 2000). Fragments of four genes were analysed for each strain, three from different chromosomes of the nuclear genome and one from the mitochondrial genome. Phylogenetic analysis of each of the four genes indicated considerable divergence among serotypes A, D, B, and C, suggesting that individual serotypes A, D, B, and C are good phylogenetic species (Fig. 1). However, there was little geographic pattern of genetic variation. No correlation between geographic distance and DNA sequence divergence among strains was observed either within a serotype or the whole analysed population. The results are consistent with recent dispersals of C. neoformans throughout the world (Xu et al. 2000; Xu 2002).

Figure 1.

One most parsimonious tree for 34 isolates of Cryptococcus neoformans from each of the four gene regions sequenced. CI, consistency index; RI, retention index. Numbers above each branch are bootstrap values > 50% and based on 500 replicates. For URA5 and LAC trees, branches with > 50% of bootstrap values were also strict consensus branches. Strain designation indicates serotype, isolate name, and geographic origin (CA, California; NYC, New York City; NC, North Carolina, all from the USA). With the exception of five strains (see text), all major phylogenetic groups correspond to traditional classifications. Of the two serologically untypable strains, one (M0024) clustered consistently with the serotype D group and the other (M0053) clustered consistently with the serotype A group. Two of the three strains of serotype AD, CN110.97 and CN196.88, clustered consistently with the serotype A group, while the other (KW5) lacked a consistent affinity with any of the serotypes. Scale bar represents one nucleotide substitution. (Xu et al. 2000). Reproduced by permission.

Strains of serotype AD were quite different from those of strains A, B, C, and D. While most predominantly strains of serotypes A, B, C, and D examined so far were haploids, strains of serotype AD are diploid or aneuploid. Furthermore, direct sequencing of PCR products from serotype AD strains often failed to obtain clear chromatograms and DNA sequences. Such results suggested sequence heterogeneity within individual strains. To investigate their origin and relationships to strains of other serotypes, alleles of two different genes from strains of serotype AD were individually cloned, sequenced and compared to strains of serotypes A, B, C, and D (Xu et al. 2002; Xu & Mitchell 2003). Sequence comparisons revealed that most strains contained two different alleles with one allele highly similar to the serotype A group and the other to the serotype D group. Further phylogenetic analyses identified that these serotype AD strains were recent hybrids between strains of serotypes A and D, and that there have been multiple hybridization events in C. neoformans (Fig. 2; Xu et al. 2002; Xu & Mitchell 2003). A recent study applied the same MLST method to identify the origin of a Cryptococcus population responsible for an unusual outbreak in animal and human populations on Vancouver Island, British Columbia, Canada (Kidd et al. 2005). The analyses suggested that the Vancouver Island population contained at least two evolutionary divergent elements shared by strains from many other geographic areas, consistent with cryptic speciation and recent migration observed earlier for Cryptococcus (Xu et al. 2000; Kidd et al. 2005).

Figure 2.

One of the 10 most parsimonious trees for the 28 LAC sequences from 14 strains of serotype AD in Cryptococcus neoformans. For comparison, five representative sequences from serotype A (E1, CN-A, MMRL750, J10 and ZG280) and five from serotype D (B10, CN-D, J9, MMRL751 and MMRL757) were included in this figure. These 10 sequences were shown in Fig. 1 and represented the genetic diversity of serotypes A and D strains. Numbers above branches are bootstrap values > 50% and based on 1000 replicates. Designations for strains of serotypes A and D included the isolate name, geographic origin (CA, California; NYC, New York City, both in the USA), and serotype. For the 28 serotype AD sequences, strain designations are followed by ‘−1’ or ‘−2’ to indicate the two alleles within each strain. Midpoint rooting is used for this phylogeny but the tree topology is identical to that when serotype B or C sequences were used as outgroups. Scale bar represents one nucleotide substitution (Xu et al. 2002). Reproduced by permission.

Clonality and recombination in microbial populations. All microbes can reproduce asexually and generate clones and clonal lineages. As expected, in natural populations of all microbial species examined (including viruses, bacteria, protozoa, algae and fungi), signatures of clones and clonal lineages are commonly found. These population genetic signatures include (i) limited or lack of genetic variation among individuals, (ii) over-representation of certain genotypes, and (iii) significant associations among alleles located on the same or different genomic regions (Xu 2004).

While clonal reproduction is expected and commonly observed in natural microbial populations, the importance of recombination has been rather obscure (e.g. Lenski 1993; Maynard Smith et al. 1993; Feil & Enright 2004). Unlike plants and animals where sexual reproduction (hence recombination) can often be observed directly in nature, recombination in natural populations of micro-organisms has to be inferred using gene and genotype frequencies. The key notion of this inference is that in purely clonal populations, alleles from genes in different parts of the genome should give identical evolutionary patterns among individuals in the population and that these alleles should be in significant linkage disequilibria. In contrast, recombination would break up these associations and generate linkage equilibrium. Using MLST, congruent genealogies for genes distributed in diverse genomic locations would be consistent with clonality and incongruent genealogies suggest recombination.

Over the past two decades, numerous studies have confirmed that genetic recombination is ubiquitous in natural populations of viruses, bacteria, protozoa, algae, and fungi. Indeed, despite extensive investigations, and while the frequencies of recombination have been difficult to quantify, no ancient asexual microbial populations or species have been found (Box 5). Below are a few recent examples of genetic recombination identified in natural populations of representative groups of micro-organisms using MLST or whole genome sequences.

Recombination in viral populations. One of the best-known examples of viral sexuality is probably that of the influenza A virus — the causal agent of the common human flu. This virus has a genome with eight segments of single-stranded RNA. When co-infection of different viral strains occurs, a large number of recombinant influenza A viruses can be produced. These recombinants generate antigenic shifts and have been credited for some of the deadliest flu epidemics in recent human history (Capua & Alexander 2002). Recombination has also been observed in many bacteriophages (Hendrix 2003), plant viruses (Keese & Gibbs 1993), and animal and human viruses. Examples of human viruses exhibiting recombination include, but are not limited to, the dengue virus (Tolou et al. 2001), the human immunodeficiency virus (Yamaguchi et al. 2003), and the hepatitis B virus (Miyakawa & Mizokami 2003).

Recombination in prokaryotes. MLST has revealed abundant evidence for recombination in natural populations of prokaryotes. Some of the well-known examples include the common human pathogens Escherichia coli, Neisseria meningitidis, Streptococcus pneumoniae, Hemophilus influenzae, and Staphylococcus aureus (e.g. Feil & Spratt 2001). Different degrees of recombination were detected in populations of these species, with E. coli and H. influenzae showing relatively low rates of recombination while N. meningitidis, Str. pneumoniae, and Sta. aureus showed high rates (Feil & Spratt 2001). In nonhuman pathogenic bacteria such as the nitrogen-fixing bacterium Sinorhizobium meliloti, evidence for recombination is also pervasive (Sun S. and Xu J., unpublished). Recent comparative analysis of whole prokaryotic genomes identified that bacterial recombination often extends beyond the traditional species boundary. This phenomenon is commonly referred to as horizontal gene transfer or lateral gene transfer and includes genetic exchange between species from different genera, families, and occasionally across kingdoms and/or domains. Indeed, signatures of horizontal gene transfer are ubiquitous among the sequenced prokaryotic genomes (e.g. Koonin 2003).

Recombination in eukaryotic microbes. Similar to observations in natural viral and bacterial populations, molecular investigations have identified that almost all eukaryotic microbial populations show signatures of recombination in nature. Examples include those from the algal species Bostrychia moritziana (West & Zuccarello 1999); pathogenic protozoan species such as Trypanosoma cruzi (the causal agent of African sleeping sickness, Bogliolo et al. 1996) and the malaria parasites Plasmodium falciparum (Conway et al. 1999) and Plasmodium vivax (Putaporntip et al. 2002); fungal species such as C. neoformans mentioned above (Xu & Mitchell 2003). Interestingly, many of the fungal species previously thought to reproduce only asexually (the Deuteromycota or Fungi Imperfecti) have been found to contain signatures of recombination in natural populations (Xu 2005). Among examined fungi, the degrees of sexuality differ greatly, from panmictic to largely clonal (James 2005; Pujol et al. 2005; Xu et al. 2005). At present, plant and human pathogens dominate the examined species in the literature. However, limited evidence from other groups of fungi suggests a similar pattern: abundant evidence for clonality and limited but unambiguous evidence for recombination (James 2005; Pujol et al. 2005; Xu et al. 2005).

RDA in analysing prokaryotic gene content differences among strains

Variations in genome sizes among strains within and between species are common in bacteria. For example, the genomes of natural isolates of the common bacterium Escherichia coli can vary by more than 1 Mb (Parkhill & Thomson 2004). Among the serotypes of another common bacterium, Salmonella enterica (var. enteriditis; var. paratyphi; var. typhi, and var. typhimurium), chromosome sizes can differ by ∼300 kb (Parkhill & Thomson 2004). Among the sequenced prokaryotic species, the genome sizes vary by over 18 folds, from the obligate archaeon parasite Nanoarchaeum equitans that has a genome size of about 490 kb (Waters et al. 2003) to the soil bacterium Streptomyces avermitilis that has a genome size of over 9000 kb (Omura et al. 2001). The genomic differences among sequenced bacterial species will be discussed in a later section. In this section, the focus is on analysing the naturally occurring differences among bacterial strains within species. Up till now, the focus has been on human pathogenic bacteria, including N. meningitidis (Bart et al. 2000), Neisseria gonnorhoea (Tinsley & Nassif 1996), Vibrio cholerae (Calia et al. 1998), Bordetella spp. (23), and E. coli (Allen et al. 2001). Using RDA and down-stream functional characterization, many strain-specific genes in the above-mentioned species were found to play important roles in ecological adaptations such as host specificity, nutrient utilization, stress tolerance, pathogenicity, and antibiotic resistance. Below, I describe a recent example of using RDA to analysing genome size and gene content variation among strains of a nitrogen-fixing soil bacterium Sinorhizobium meliloti (Guo et al. 2005).

The sequenced Si. meliloti strain Rm1021 has a tripartite genome structure with one chromosome (3.65 Mb) and two megaplasmids pSymA (1.35 Mb) and pSymB (1.68 Mb). Using the RDA method (Fig. 3), a large number of novel DNA sequences not present in the sequenced laboratory model strain Rm1021 of Si. meliloti were identified. In this study, we used strain Rm1021 as the driver and the type strain of Si. meliloti ATCC9930, which has a genome size ∼370 kb bigger than strain Rm1021, as the tester. Among the 85 novel DNA fragments examined, 55 showed no obvious homologues anywhere in the public databases. Of the remaining 30 sequences, 24 contained homologs to the Rm1021 genome as well as unique segments not found in the Rm1021 genome; 3 contained sequences homologous to those published for another Si. meliloti strain but absent in Rm1021; 2 contained sequences homologous to other symbiotic nitrogen-fixing bacteria, Rhizobium etli and Bradyrhizobium japonicum and 1 contained a sequence with an 87% sequence identity to the 6-aminohexanoate-dimer hydrolase gene on the plasmid of Pseudomonas spp. NK87. Interestingly, this protein was found capable of degrading nylon oligomers (Yomo et al. 1992; Kanagawa et al. 1993). Nylon oligomers are among the compounds not present in natural environments until synthesized and released by humans very recently. The distribution of 12 of the above 85 novel sequences among a collection of 59 natural Si. meliloti strains were further analysed using PCR. The distribution varied widely among the 12 novel DNA fragments, from 1.7% to 72.9% (Guo et al. 2005). Our recent experiments show that micro-arrays fabricated based on the genome sequence of model strains can also be used very effectively to examine the distributions of genes among strains (Fig. 4; Guo & Xu, unpublished; Box 6). The exact ecological roles of some of these sequences are being examined.

Figure 3.

Overview of the representational difference analysis of genomic differences between strains of Sinorhizobium meliloti (modified from Guo et al. in press). Tester (T): ATCC9930. Driver (D): Rm1021. Filled black boxes: DNA adaptors. Unfilled boxes: tester DNA. Shaded boxes: driver DNA.

Figure 4.

Application of micro-array in the analysis of genomic differences between strains of Sinorhizobium meliloti. In this figure, red represents hybridization signal from one strain; green represents hybridization signal from a different strain; and yellow represents that both strains have the probe sequence. In each of the four subarrays, there are three vertically divided repeats. As can be seen from the arrays, repeatability is high of using micro-array to screen for gene content differences among strains.

Unexpected microbial diversity from environmental sources as revealed by genomics tools

In this section, I will focus on how modern genomics tools are helping us to reveal microbial diversity in natural microbial communities. Until very recently, microbial diversity in the environments is estimated using culture-dependent approaches. However, for two reasons, the culture-dependent methods cannot accurately describe naturally occurring microbial communities. First, our current culturing methods target only those we know how to culture. For most unknown micro-organisms, we simply don't know how to grow them. Second, even among culturable micro-organisms, the observed diversity on standard microbiological media may not be representative of those in nature. This is because while thousands of media and growth conditions have been developed over the years to culture various micro-organisms, very few researchers have the facility or manpower to experiment all the conditions for natural microbial samples. The application of culture-independent genomics tools in the last two decades is allowing more accurate estimations. Below, I provide a summary to show how four specific methods (phylogenetic analysis of the ribosome RNA (rRNA) genes, DNA–DNA re-association kinetics, metagenomics, micro-arrays) have been used to reveal microbial diversity in natural environments.

Phylogenetic analyses of environmental ribosomal RNA

The use of culture-independent methods to estimate microbial diversity in the environment started in the mid-1980s (Pace et al. 1985). The initial scheme involved isolating total DNA directly from the environment, cloning the DNA using vectors such as bacteriophage-lambda, and screening for clones that hybridized to the rRNA probes, and sequencing the positive clones. Many types of rRNA sequences not present among cultured microbes from the same samples were identified. The incorporation of gene-specific PCR before the cloning step in the late 1980s significantly streamlined the procedure and allowed more direct estimation. The very first application of PCR in phylogenetic analysis of mixed microbial communities in ocean waters led to the discovery of ubiquitous and abundant groups of new micro-organisms (Giovannoni et al. 1990). In addition, this study identified significant genetic microheterogeneity among closely related phylogenetic types. Since the beginning of the 1990s, there has been widespread application of PCR-based analyses of 16S rRNA to examine mixed microbial communities in diverse environments.

Phylogenetic comparisons of rRNA genes from environmental sources have led to the discovery of many novel microbial taxonomic groups. Indeed, many new major groups of micro-organisms have been found only through culture-independent surveys. The following sections highlight recent progresses for the major microbial groups (Box 7).

Bacteria.  In 1987, based on rRNA sequence data, Woese identified 12 major divisions (phyla) in the Domain Bacteria. The analysed bacteria represent almost all major cultured groups of Bacteria accumulated during the previous century of microbiological research. In just over a decade, culture-independent surveys identified that there are at least 40 well-resolved major bacterial divisions. That is, there are about 30 major bacterial divisions with no or very few cultured representatives in our collection (Hugenholtz et al. 1998; Konstantinidis & Tiedje 2004). These discoveries are now guiding a coordinated effort by the microbiology community to culture representatives from many of the unknown major divisions of Bacteria in order to study their genetic, physiological and ecological properties.

Archaea.  The culture-independent methods have also revealed major new types of Archaea. At present, there are about 300 cultured and named archaeal species, primarily belonging to phylum Euryarchaeota, with a few examples from phylum Crenarchaeota, one from Nanoarchaeota and none from Korarchaeota. Schleper et al. (2005) compiled over 8000 deposited archaeal rRNA gene sequences from various natural environments. Phylogenetic analyses suggested that Domain Archaea contains at least 50 distinct phylogenetic groups with 33 from the current Euryarchaeota, 13 from Crenarchaeota, 1 each from Korarchaeota, Nanoarchaeota, and the ancient archaeal group (AAG). The divergence among these phylogenetic groups is similar to those among many bacterial phyla. Among these 50 phylogenetic groups, only 13 have cultured representatives.

In addition, before the application of culture-independent methods, Archaea are thought to be only present in extreme habitats. Recent investigations have identified that Archaea are also widespread in diverse nonextreme habitats such as gardens and forests, water and sediments in marine and freshwater lakes, as well as extreme habitats such as hot springs, saline lakes and deep-ocean thermal vents (Black Smokers). For example, in the marine environment at depths 100–5000 m, the average Archaea density is about 1 × 105/mL, accounting for about 20% of all microbial cells in the ocean (Karner et al. 2001). In 2002, a tiny archaeon appropriately called Nanoarchaeum was reported. This archaeon was found to live in an obligate association with another archaeon in the genus Igneococcus. Phylogenetic analysis indicated that Nanoarchaeum has diverged significantly from all known archaeal rRNA sequences (Huber et al. 2002). However, it should be pointed out that a recent phylogenetic analysis using ribosomal protein gene sequences from many archaea species suggested significant uncertainty in the placement of Nanoarchaeum in the tree of life (Brochier et al. 2005).

Eukaryotic microbes from anoxic environments.  The most deeply divergent of known eukaryotic lineages are found in anaerobic or micro-aerobic environments. Ecologically and evolutionary, this group of organisms are also the least known among eukaryotes. Anoxic environments have existed throughout the history of Earth. Therefore, such environments may harbour unknown diversity of eukaryotic microbes. Indeed, Dawson & Pace (2002) identified a very high eukaryotic diversity from both marine and freshwater sediments. Their analysis identified seven major phylogenetic lineages distinctly different from all known eukaryotic kingdoms such as fungi, plants and animals.

Fungi. Approximately 80 000 fungal species have been identified and named, and these species are grouped into five main phyla: Chytridiomycota, Zygomycota, Glomeromycota, Basidiomycota, and Ascomycota (Moncalvo 2005). Several recent studies of environmental DNA identified major groups of unexpected fungal diversity in a variety of environments. For example, in the analysis of fungal DNA from the roots of the grass Arrhenatherum elatius, Vandenkoornhuyse et al. (2002) found 49 unique phylotypes from a random library of 200 18S rRNA clones. Surprisingly, only 7 of the 49 were found closely related to known sequences (> 99% identity). They found five distinct lineages significantly different from all known fungal sequences (in a pool of over 1200 at their time of analysis). In another study by Schadt et al. (2003), culture-independent methods were used to assess the seasonal dynamics of fungal diversity in tundra soil in Colorado. Results revealed three major groups of fungi significantly different from existing classes and phyla. Their results also demonstrated that fungi account for the majority of the biomass under snow in the analysed environment (Schadt et al. 2003). Results from these and other fungal community studies suggest that there are likely over 1.5 million species of fungi in Earth's biosphere, a number about 20 times of the currently named fungal species.

Viruses. Viruses are extremely abundant in natural environments. They contribute significantly to both prokaryote and eukaryote population dynamics. Current culture-independent studies identified that both DNA-based and RNA-based viruses are common in terrestrial as well as freshwater and marine environments (Edwards & Rohwer 2005). For example, in an analysis of picorna-like viruses (a group of positive-sense single-stranded RNA viruses that are major pathogens to plants and animals), Culley et al. (2003) identified high, unexpected diversity in the sea. Indeed, all of the picorna-like sequences from marine samples were different from known picorna-like viruses in the databases. Of specific note is a virus isolated in this study that is a lytic pathogen to a toxic-bloom-forming alga Heterosigma akashiwo. This result suggests that picorna-like viruses may be important contributors in the regulation of marine phytoplankton population dynamics.

DNA–DNA re-association kinetics

DNA–DNA re-association kinetics has long been used to determine the overall genomic relationships between organisms. The current operational definition of bacteria species concept using 70% hybridization is rooted in this kinetics. During DNA–DNA re-association, complementary single-stranded DNA re-anneal to each other to form double strands and the rate of re-annealing is positively correlated to the degree of similarity. Torsvik et al. (1998) extended this principle to analyse the complexity of environmental DNA samples. The basic idea is that more complex environmental DNA will take longer to re-anneal. The rate of re-association can be compared to known samples of complexity such as the Escherichia coli genome to derive the total genomic complexity of environmental DNA. They found that estimates of environmental genome complexity derived from DNA–DNA re-association kinetics were about 100 times higher than those derived from laboratory culture estimates (Torsvik et al. 2002). This result is similar to the comparison between phylogenetic methods based on fluorescent in situ hybridization (FISH) using signature prokaryotic sequences and culture-dependent method (Torsvik et al. 2002). Their analyses identified that terrestrial environments generally contain higher genome complexity than aquatic sediments. Among the three terrestrial niches compared, while the number of prokaryotic cells per cubic centimetre of soil is similar among them (about 10 billion), the pristine pasture and forest soils contain over 10 times the genome complexity (equivalent to 3500–8800 E. coli genomes) as that of the agricultural field soils (equivalent to 140–350 E. coli genomes) (Torsvik et al. 2002). Recently, improved analytical methods showed that in fact, more than 1 million distinct genomes might exist in the above-mentioned pristine soil, exceeding previous estimates by two orders of magnitude (Gans et al. 2005). Furthermore, it was estimated that metal pollution could reduce the genomic diversity of pristine environments by more than 99.9%, revealing the highly toxic effect of metal contamination, especially for rare microbial taxa (Gans et al. 2005).


Metagenomics refers to the study of the collective genomes in an environmental community. Such a community may be a soil or a marine water sample that contains substantially more genetic information than is available in the cultured subset. Studies of metagenomes typically involve cloning fragments of DNA isolated directly from microbes in natural environments, followed by sequencing and functional analysis of the cloned fragments. While most of the techniques for metagenomics have existed for quite some time and are used routinely in molecular biology research, their application in analysing unknown environmental DNA samples have opened a floodgate of exciting research findings.

The phylogenetic analysis of environmental microbial diversity was an early form of metagenomics. Over the years, several significant trends for metagenomic studies have emerged. First, the cloned DNA fragments have been getting larger and larger in attempts to clone long stretches of DNA from the same genome to allow the study of the structure and function of potentially whole unknown/uncultured genomes in the environments. Such an objective has propelled the development of new DNA isolation methods as well as improved cloning systems. At present, the bacterial artificial chromosome vector system is the most commonly used for metagenomic studies. Second, the study sites have expanded tremendously. At present, metagenomic libraries and DNA sequence information exist for microbial communities from many of the world's ecological niches. Third, the number of sequences generated in individual studies has been increasing. For example, a recent study obtained over 1.6 billion base pairs of DNA sequences and about 1.045 billion were nonredundant from a marine environment (Venter et al. 2004) (Box 8). Below I will briefly review and discuss recent metagenomic studies of microbial communities from the ocean, soil, and an acid mine drainage.

Metagenomic analysis of marine microbial communities. Marine microbial communities are among the first to be investigated using culture-independent genomics approaches (Giovannoni et al. 1990). Marine microbial communities are complex and contain heterogeneous micro-organisms including viruses, bacteria, archaea, and eukaryotic micro-organisms. Because of the size differences among these groups of organisms, typical studies use filters to first select the target size category of microbes. Phylogenetic analyses have identified numerous novel DNA sequences and phylogenetic groups in all groups of organisms surveyed. In combination with other genomics tools, these studies have led to other important discoveries. Two specific studies are highlighted below.

In a classical metagenomic study of genome fragments from a BAC library of marine picoplankton, Beja et al. (2000) identified a new class of genes of the rhodopsin family, named proteorhodopsin, from an uncultivated alpha-proteobacterium SAR86. At that time, this rhodopsin family was known to exist only in extremely halophilic (salt-loving) archaea and had never before been observed in cultured bacteria. Unlike the archaea rhodopsin that does not express properly in model laboratory strains, the proteorhodopsin gene from SAR86 expressed readily in the laboratory model bacterium E. coli and it functioned as a light-driven proton pump. Later studies identified that this new type of light-driven energy generation process is in fact widespread in the ocean and that there are optimized absorption spectra of bacterial rhodopsins at different depths of ocean water (Beja et al. 2001). In addition to this form of light energy harvesting, the widespread importance of oxygenic phototrophy in the ocean has been confirmed by metagenomic studies, and another phototrophy, the anoxigenic phototrophy, that was previously regarded as playing only a minor role in ocean water productivity has also been found to be very common in ocean surface waters (Beja et al. 2002).

One of the most extensive microbial metagenomic studies in the ocean was the shotgun sequencing of micro-organisms of size ranges from 0.1 to 3.0 µm in the Sargasso Sea in the Atlantic Ocean near Bermuda (Venter et al. 2004). Their study generated almost 2 million sequence reads, yielding over 1.6 billion base pairs of raw DNA sequence. Based on sequence relatedness and unique rRNA gene counts, the analysis suggested that these DNA fragments were derived from at least 1800 genomic species including 148 previously unknown bacterial phylogenetic types. Their analysis also identified spatial variation in species richness and relative abundance among the four sampled sites (Venter et al. 2004). Computational analysis of the data identified over 1.2 million potential unique protein coding genes. This number is astonishing considering that at that time, only about 140 000 protein data entries were available in the curated SwissProt protein database. Among the 1.2 million potential protein-coding genes, at least 782 new rhodopsin-like photoreceptors were identified, confirming the importance of this type of phototrophy in the open sea. Of the specific group of micro-organisms identified, one stood out. This organism, most likely a member of the genus Burkholderia, had 21-fold coverage and comprised 38.5% of the sequence data from one of the four samples. Burkholderia is typically found in terrestrial environmental samples and the identification of a species in this genus in the sea at such a high frequency led the authors to suggest that terrestrial environments or coastal animals might play an important role in marine microbial community structure. However, based on several lines of evidence, DeLong 2005) recently suggested that the high abundance of Burkholderia–like sequences in one sample might be due to contamination of the original water sample in the Venter et al. (2004) study. Such a revelation suggests that extreme caution should be taken when conducting microbial metagenomic analysis. Nevertheless, the reconstruction of complete genomes based on shotgun sequencing of environmental microbial community DNA indicated the powerfulness of this approach in future microbial ecology research.

Metagenomic analysis of soil microbial communities. Though microheterogeneity in aquatic environments has been found, its complexity pales that of soil environments. Typical soil comprises mineral particles of different sizes, shapes, and attached organic compounds such as humus. The structural and chemical compositions of soil determines their physical–chemical properties such as water-holding capacity, surface-to-volume ratio (hence oxygen availability) within the soil, pH and the availability of various nutrients. In addition, unlike aquatic habitats, soil surfaces may undergo dramatic daily or seasonal cyclic changes in its physical–chemical properties. Such spatial and temporal environmental microheterogeneity poses significant challenges for microbial ecologists. However, recent investigations especially those based on culture-independent approaches are revealing the amazing diversities of micro-organisms in the soil.

Many studies of soil microbial diversity have been carried out. Based on a variety of culture-independent methods, current estimates indicate that a single gram of soil may contain over 10 billion microbial cells representing several thousand to over a million distinct genomic species (e.g. Torsvik et al. 2002; Gans et al. 2005). This number is remarkable given that the total number of known prokaryotes listed in the website of the National Center for Biotechnology Information is about 17 000 (including uncultured prokaryotes). Comparisons of culture-dependent and independent methods revealed that in most soil environmental samples, only 0.1–1% of microbial species are cultured by standard microbiological methods. Therefore, a tremendous amount of microbial genetic, physiological and metabolic diversities in the soil remain to be discovered and explored. Significant efforts are underway to clone and analyse the soil metagenome diversity. Daniel (2005) summarized the studies of soil metagenomic libraries constructed to date. These libraries include soil samples from a variety of ecological niches, including meadows, crop fields, and forests.

Functional analyses of the soil metagenome are typically conducted by one of two approaches. The first is based on nucleotide sequences using either PCR or target-specific probes to screen the soil metagenome library. This approach has been used successfully to clone genes with highly conserved domains, e.g. the gluconic acid reductase, an essential enzyme during glucose metabolism (Eschenfeldt et al. 2001). The second approach is based on functional screening for metabolic activity of metagenomic clones. Several novel genes coding for proteases, lipases, amylases, agarases, alcohol oxidoreductases, antibiotics, and antibiotic resistance have been found through this screening (Voget et al. 2003). Some of these products hold great commercial potential and are actively pursued by biotechnology companies.

Metagenomic analysis of a microbial community from an acid mine drainage. Acid mine drainages are seminatural environments rich in extremophiles. These drainages are created as a result of mining and the exposure of predominantly ferrous iron in pyrite (FeS2) to the oxygen-rich atmosphere. Iron is one of the most abundant elements in Earth's crust and exists naturally in two oxidative states, ferrous (Fe2+) and ferric (Fe3+). In nature, these two forms cycle as a result of reduction and oxidation by micro-organisms and by abiotic geochemical processes. The reduction of Fe3+ to Fe2+ occurs in anoxic environment (e.g. bogs and waterlogged soil) by bacteria such as Shewanella putrefaciens, with organic compounds in these environments acting as the electron donor. In contrast, the oxidation occurs in oxygenic environment with O2 as the electron acceptor. Though the released energy is small during oxidation, several groups of chemolithotrophic organisms (e.g. Acidithiobacillus ferrooxidans and Leptospirillum ferrooxidans) can actively participate in the reaction and thrive in such environments by oxidizing a large amount of ferrous iron. Because pyrite (FeS2) is one of the most common forms of iron in nature, the oxidation of pyrite will release large amounts of sulphate (inline image) and sulfuric acid, allowing the development of acid conditions in the surrounding environment with pH values as low as 0. Mixing of acidic mine water with natural waters in rivers and lakes causes major environmental problems.

The metagenomic analyses of a single biofilm sample from an acid mine drainage from the Richmond Mine at Iron Mountain, California, have provided important insights into the microbial community structure (Tyson et al. 2004). From the 78 Mb sequences obtained from this sample, the genomes of the dominant species were constructed. These included the dominant bacterium Leptospirillum group II (10X coverage) and the dominant Archaeon, Ferroplasma acidarmanus (also 10X coverage). Ferroplasma is a group of cell wall-less prokaryotes. These two species were also found to be dominant in this community by other analytical methods. In addition to the above two genomes, other reconstructed partial genomes were also identified, including that of a group III Leptospirillum (3X coverage), and an unknown species in the genus Sulfobacillus (0.5X coverage) that is closely related to the cultured Sulfobacillus thermosulfidooxidans.

Bioinformatics analyses of the metagome sequence data identified several interesting results. First, the Leptospirillum group III strain was found to contain genes homologous to those for biological nitrogen fixation. This knowledge subsequently led to the design of a selective isolation strategy that allowed the isolation of this organism (Allen & Banfield 2005). Second, genes involved in essential pathways (such as nitrogen and carbon dioxide fixation and iron metabolism) in the above chemolithoautotrophs were revealed. Third, the genomic sequence data identified genetic polymorphisms for many genes and suggested evidence for genetic recombination in the Ferroplasma acidarmanus population of this community. The metagenome sequence information established a solid foundation for fine-scale comparisons of microbial communities. In addition, a recent proteomic analysis of this community identified an abundant novel protein, a cytochrome, as an essential component to iron oxidation and acid mine drainage formation (Ram et al. 2005). These results have the potential to guide the remediation of sites contaminated by acid mine drainages.


Micro-array technology is a powerful, high throughput experimental system that allows the simultaneous analysis of thousands to hundreds of thousands of genes at the same time. Originally developed for monitoring whole-genome gene expressions, micro-arrays have been used for other purposes such as the genome-wide mutational screening for single nucleotide polymorphisms and the distributions of species and strains in natural microbial communities. Recently, several types of micro-arrays have been developed and evaluated for bacterial detection and microbial community analysis. These arrays include (i) phylogenetic oligonucleotide arrays that contain signature sequences from rRNA of specific groups of organisms; (ii) community genome arrays that contain highly specific signature gene sequences from known cultured microbial species; and (iii) functional gene arrays that contain conserved domains of genes involved in specific metabolic pathways such as the biogeochemical cycling of carbon, nitrogen, sulphate, phosphate and metals (Zhou 2003). The number of genes and the sizes of arrayed DNA fragments in the functional gene arrays can vary according to analytical purposes.

Preliminary evaluations suggested micro-arrays have a great potential for the detection, identification and characterization of micro-organisms in natural habitats (Wu et al. 2004). For example, Loy et al. (2002) constructed a micro-array with 132 16S rRNA-targeted oligonucleotide probes (18 nucleotides long) representing all recognized groups of sulphate-reducing prokaryotes and showed that this micro-array could be used to distinguish most of the reference strains. Using this array, they determined the diversity of sulphate-reducing prokaryotes in periodontal tooth pockets and a hypersaline cyanobacterial mat. Results from the micro-array study were similar to those from cloning and sequencing of environmental 16S rRNA. These analyses have been recently extended to other groups of organisms such as the Rhodocyclales in beta-proteobacteria (Loy et al. 2005) and Enterococcus species (Lehner et al. 2005).

Despite these successes, significant challenges remain with regard to specificity, sensitivity, and quantification of microbes in natural habitats. This is mainly because microbial communities contain highly heterogeneous groups of organisms with undefined/unknown genomic relationships. The highly skewed distribution of microbial species, the potential of cross-hybridization between closely related species, the genetic variation among strains within species, and the differential efficiencies of isolating DNA from among the species can all bias our results and influence the interpretations of the data. Further evaluations are needed to understand the specific experimental conditions appropriate for the analyses of various environmental samples using the different types of micro-arrays.

Inferences of microbial diversity and activity from completed microbial genome sequences

Micro-organisms are the first and most abundant species to be completely sequenced. While most of the original objectives for microbial genome sequencing were guided by their practical applications such as understanding disease progression mechanisms of human pathogens and the potential generation of useful products and services from these microbes, the microbial genome sequencing efforts have helped reveal much about their ecological roles in their natural environments as well as the potential genomic diversities within and between species. Currently, the sequenced microbial genomes are highly biased towards pathogens of plants, animals and humans. There are many detailed comparisons and reviews on these microbial genomes (e.g. Fraser et al. 2004). In the following paragraphs, I briefly summarize several important features with regard to the relationship among microbial genome size, gene content and their ecology (Box 9).

Table Box 9 .  The published 250 prokaryotic genomes as of September 2005 suggest several general features of these genomes relevant to microbial ecology:
1.Prokaryotic genomes are highly variable in genome size and gene content among strains from both within and between species.
2.Microbial species with narrow ecological niches generally have smaller genomes than those with broader ecological niches.
3.A large fraction (20–40%) of identified open reading frames in sequenced microbial genomes code for proteins with unknown functions.
Most of these genes are likely regulated by ecological-niche specific factors.

First, microbial genome sequence comparisons have revealed that prokaryotic genomes are highly variable in both genome size and gene content (Table 1). Among the completely sequenced and annotated 250 unique prokaryotic genomes (four strains were sequenced twice for a total of 254 completed genomes as of August 2005), the genome sizes vary by over 18 folds, from the smallest archaeon Nanoarchaeum equitans (0.49 Mb, Waters et al. 2003) to the largest Streptomyces avermitilis (9.12 Mb, Omura et al. 2001). The genome sizes vary not only among species but also among strains within individual species. An example is the common Escherichia coli where whole genome sequences of four strains are now available: the model laboratory strain K12, the enterohemorrhagic O157:H7 RIMD and O157:H7 EDL933, and the uropathogenic CFT073 (Parkhill & Thomson 2004). While all three pathogenic strains have genomes essentially colinear with each other and with the nonpathogenic K12, both the genome size and gene content vary considerably among the four strains. For example, the two pathogenic O157:H7 strains have genomes over 5.5 Mb, almost 1 Mb bigger than that of strain K12 (4.6 Mb) and about 300 kb bigger than that of strain CFT073 (5.2 Mb). Furthermore, about 25% of the genes in the pathogenic O157:H7 strains were not found in strain K12. When all four strains are considered, only about 3000 of the total genes were shared from the total of 4288, 5349, 5361 and 5379 predicted protein-coding genes, respectively, for strains K12, O157:H7 RIMD, O157:H7 EDL933 and CFT073, respectively. Most of these extra genes have unusual sequence characteristics and were likely obtained through horizontal gene transfer events from external sources and by the action of mobile genetic elements. Some of these genes play important roles in their ecological adaptation, including adhesion to specific host cell types. Comparisons between strains in other human pathogenic bacteria (e.g. Streptococcus pneumoniae and Burkholderia cepacia) as well as the nonpathogenic plant symbiont Si. meliloti revealed similarly highly variable genome size and gene contents (Fraser et al. 2004; Guo et al. 2005; Sun S., unpublished). At present, population-level studies of genome size and gene content variations are still very limited to human pathogens.

Second, species with narrow ecological niches (e.g. obligate human pathogens) on average have smaller genomes than those capable of living in diverse ecological conditions (Table 1). For example, the obligate intracellular pathogen Mycoplasma genitalium has a genome size of 580 kb (encoding 484 genes) and that of the amphids Buchnera aphidicola has a genome size of 650 kb (504 genes). These genomes lack many of the genes essential for metabolic functions in many free-living organisms. The deletion and degeneration of such genes were likely due to their nonessential functions in obligate parasites because the hosts can provide such resources to the cells. Indeed, in several obligate intracellular parasites such as Rickettsia prowazekii and Rickettsia conorii, there is evidence that their genomes are in the processes of deteriorating and shrinking (Andersson 2004). Though the 250 sequenced prokaryotic genomes may not be representative of the community genomes in various natural environments, there seemed a correlation between genome size and habitat. Among the six groups of prokaryotes classified based on habitats, those from terrestrial environments have, on average, the largest genomes, followed by prokaryotes that live in multiple habitats, in aquatic environments, and in specialized environments (Table 1). Some of the largest bacterial genomes are found in those with complex lifestyles such as the social bacteria Myxococcus xanthus (> 10 Mb), the facultative nitrogen-fixing plant symbiont Bradyrhizobium japonicum (> 9 Mb), and the antibiotic-producing, free-living soil bacteria Streptomyces (> 9 Mb).

Third, in almost all microbial genomes sequenced, a significant percentage (20–40%) of the putative open-reading frames show no obvious homology to any known proteins or to any sequences in the database, including those from other micro-organisms and macro-organisms. While one reason for this high percentage of unknown open-reading frames is due to our limited knowledge about the microbial world (e.g. limited genomes that have been sequenced and limited knowledge about the functional properties of these sequenced genomes even in standard laboratory conditions), the ubiquitous distribution of such unknown sequences suggest their potential importance in natural environments. Indeed, a transcriptome analysis of the radiation-resistant Deinococcus radiodurans revealed that about 48% of the poorly characterized or uncharacterized genes were highly expressed in at least one experimental condition (Liu et al. 2003). Systematic investigations into the potential roles of this group of genes are now underway in the nitrogen-fixing bacterium Si. meliloti using a high throughput gene knockout, systematic screening of hundreds of growth conditions for these mutants, and the genome-wide transcriptome and metabolome analyses (Finan T. et al., personal communication).

Conclusions and perspectives

With the development and application of genomics tools, microbial ecology is undergoing a renaissance. Genomics tools have allowed us unprecedented access to natural microbial diversity and their potential activities. However, genomics tools have also exposed how little we know about the vast diversity of micro-organisms colonizing and transforming our planet Earth. Indeed, many fundamental questions remain to be addressed. For example, how many microbial species are there on Earth? How many unknown metabolic pathways are there in the microbial world? What is the relationship between microbial diversity and microbial activity in natural environments? Do laboratory analyses of microbial activity reflect those in natural environments? And, how best to use microbial ecological data gained through genomic analysis in practical applications such as mining, environmental remediation, the control of infectious diseases, the modulation of the global climate, and the production of biotechnology goods and services?

To address these questions, an interdisciplinary systems approach is needed. This approach requires the integration of the analyses at various levels of ecological organization, from subcellular and cellular levels to those of individuals, populations, communities and ecosystems. The approach also requires the development and complementary analysis of biological variations at the genome, transcriptome, proteome and metabolome levels. Indeed, the American Society of Microbiology has issued a call to create systems microbiology and systems microbial ecology to coordinate such efforts and to set it a priority area for future development (Buckley 2005). There is no doubt that such coordinated efforts will reveal many exciting new discoveries.


  • Box 2

    The current species concepts for prokaryotes (bacteria and archaea) and eukaryotes (plants, animals and eukaryotic microbes such as fungi, protozoa and algae) are not comparable.

  • Box 5

    Genomic studies suggest all microbial populations have a clonal component. However, signatures of recombination are pervasive in natural populations of viruses, bacteria, fungi, algae and protozoa. Despite significant efforts, no ancient asexual microbes have been convincingly demonstrated.

  • Box 6

    Representation difference analysis (RDA) and micro-array technology are powerful methods for discovering whole genome differences among natural prokaryotic strains.

  • Box 7

    Genomic analysis of natural microbial communities are revealing extremely rich and highly variable DNA sequences from forest soils, pastures, aquatic environments in both pristine and contaminated environments. Bioinformatic analyses of such sequences suggest the existence of many uncultured taxonomic groups of viruses, bacteria, archaea, fungi and protozoa.

  • Box 8

    Metagenomic studies have identified many novel microbial genes coding for metabolic pathways such as energy acquisition, carbon and nitrogen metabolisms in natural environments that were previously considered to lack such metabolisms.


I thank Dr Hong Guo for preparing Figs 3 and 4 and Dr Turlough M. Finan for comments on the manuscript. During the preparation of this review, research in my lab is supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, the Ontario Premier's Research Excellence Award, and Genome Canada.

J-P Xu's general research interests are in microbial ecology and evolutionary genetics. The current research in his laboratory attempts to understand the origins and maintenance of genetic variation in microorganisms. His research group examines both natural microbial populations from the environment and clinics and experimental populations evolved in the laboratory. Specifically, by using microbiological, molecular and computational tools, their research seeks to determine the rate and effect of spontaneous mutations on microbial life history traits; the rate and route of spread of microbes in natural environments and human populations; the origins of novel strains and species, and the origin and evolution of sex.