Diversity can be described in two ways, as species diversity or biodiversity. Species diversity consists of three parameters, species richness (the number of species), species evenness (the significance of species in terms of abundance, biomass or activity) and species difference (the taxonomic relatedness). These parameters can be used to calculate diversity indices; however, they are not routinely applied to microbes. Biodiversity might be described as the total (structural and functional) diversity of life (Wilson, 1992). In other terms, diversity is the amount and distribution of (genetically encoded) information or the ‘dictionary of living nature’ (Margalef, 1997). Microbial biodiversity is even less known than species diversity; however, metagenomics (community genome sequence analysis) provides us with a tool to start tackling this Herculean task.
Prokaryotic classification and the prokaryotic species concept are controversial topics. This is partially because of technological progress, which results in a reconsideration of such concepts (Rossello-Mora and Amann, 2001), but it has also been argued that a more ‘natural view’ of microorganisms should be adopted regarding species in their ecological and evolutionary context (Ward, 1998). DNA–DNA hybridization is a crucial parameter for a species definition of prokaryotic isolates; however, it is not sufficient alone to define a ‘species’ in a strict taxonomic sense. A 70% DNA–DNA hybridization, which corresponds to ≈ 96% sequence similarity, is typically used as delineation between species (Stackebrandt and Goebel, 1994). 16S ribosomal RNA gene analysis is the most widely used method of estimating diversity in environmental samples. This is based on the observation that a threshold of 97% similarity of the entire 16S rRNA gene often matches the concept of 70% DNA–DNA hybridization (Stackebrandt and Goebel, 1994). However, growing evidence suggests that this threshold might be higher (Rossello-Mora and Amann, 2001). Assessing species richness based on 16S rRNA gene sequences is sometimes regarded as a futile enterprise because of selective amplification of sequences and the presence of multiple operons, which can differ with respect to their sequence (Wintzingerode et al., 1997; Dahllof et al., 2000). Also, rare species are probably not detected by the methods used. Moreover, it has been argued that sequence differences might not always correspond to different species and that the phylogeny of a gene, such as encoding for the 16S rRNA gene, might not match the phylogeny of the genome or cell line of a species. For example, data on lateral gene transfer, such as mediated by viruses, have even invoked questions on the validity of the universal tree of life. LGT may ‘shake the tree of life’ (Pennisi, 1998), as swapping of genes among species could make a universal classification based on phylogenetic reconstructions impossible (Doolittle, 1999; Martin, 1999). Finally, 16 rRNA gene sequences alone are not a valid tool for bacterial species description sensu strictu (Rossello-Mora and Amann, 2001). However, 16S rRNA gene approaches may serve as a pragmatic way of determining the number of phylotypes in the environment. Indeed, it has been argued that the methods that microbial ecologists are applying to assess species diversity are often roughly as precise as methods used by ecologists of eukaryotes (Hughes et al., 2001). Thus, the phylogeny of a conserved gene (or several conserved genes) may be used for reconstructing phylogenetic relatedness, and the phylogeny of non-conserved genes may reflect evolutionary plasticity and adaptation. Analysing both types of genes then becomes useful in tackling species and biodiversity of prokaryotes.
Culture-independent 16S rRNA gene and other DNA-based method have provided us with an estimation of the number of ‘types’ of prokaryotes in various environments, which will be used as an indication of prokaryotic species richness in this review. Using various ways to estimate richness, between a few and 160 bacterial species can be expected in a bulk water sample of several to tens of litres from marine or freshwater pelagic systems (Table 1). This concurs well with a maximum possible richness of 163 species per water sample estimated on the basis of log-normal species abundance curves (Curtis et al., 2002).
Attempts have been made to estimate prokaryotic species richness in parts of the ocean or in the entire ocean. Using entire 16S rRNA gene data from cultured and uncultured sequences found in GenBank and a species threshold of 97% sequence similarity, it was suggested that there are about 1200 species in the surface ocean (Hagström et al., 2002). If this number holds roughly true, we have not detected the tip but the bulk of the iceberg of species richness in marine surface waters. However, using log-normal species abundance curves, the maximum possible species richness in the entire ocean should be < 2 × 106 (Curtis et al., 2002). Archaea, which are as abundant as Bacteria in the deep ocean and may account for about one-third of the pelagic prokaryotes in the world's oceans (Karner et al., 2001), cannot be responsible for this difference, as their maximum richness was estimated to be only 20 000 species (Curtis et al., 2002). Although such data provide first insights into prokaryotic diversity, assessing the ‘true’ species richness in single samples and in ecosystems remains a challenge for future research.
Genome complexity of prokaryotes in pelagic systems is basically unknown. Torsvik et al. (2002) have estimated the genome complexity of communities in marine sediments based on genome equivalents relative to the Escherichia coli genome and assessing bacterial species by reassociation kinetics of DNA. Genome complexity was 4.8 × 1010 bp in a pristine sediment and 2 × 108 bp in marine fish farm sediment. The genome complexity of microbes in the fish farm sediment is probably similar to pelagic systems, as about 50 genome equivalents were present, which falls within the range of species richness of marine bacterioplankton (Table 1). Assuming that there are a maximum of 163 ‘species’ or genome equivalents in a sample (Curtis et al., 2002) and using the genome size of E. coli, maximum genome complexity would be 6.7 × 108 bp in a given water sample. However, as the genome of an average marine bacterium might be smaller than that of E. coli (Button et al., 1998; Rappéet al., 2002), this might be a slight overestimation. Using the estimates of species richness outlined above, the genome complexity in the surface ocean would be at least 4.9 × 109 bp, and genome complexity in the entire ocean should be < 8.2 × 1012 bp. These estimates are uncertain because of the unknown number of species and the unknown genetic variability within and between species. Although genetic variability within species can be large, it should be smaller than among species. If not, concepts such as DNA similarity, e.g. a 70% DNA–DNA hybridization, would make no sense as crucial parameters for species descriptions. Despite these uncertainties, such estimates may describe the possible range of prokaryotic genome complexity in the ocean.
Viral diversity and diversification
The species concept has been applied to viruses by the International Committee on the Taxonomy of Viruses (ICTV) (Regenmortel, 1992). Recently, Rohwer and Edwards (2002) have proposed a genome-based taxonomy for phages based on the overall sequence similarity of predicted protein sequences of 105 completely sequenced phages. This taxonomy is roughly similar to the one proposed by the ICTV, although there are also some differences. It has also been argued that a definition of viral species is not meaningful in the presence of excessive gene transfer (Lawrence et al., 2002).
Assessing viral diversity in the environment is difficult. This is because of not only conceptual difficulties in defining a viral species but also the great plate count anomaly, as hosts have to be cultured before phages can be isolated. Moreover, there is no common molecule for viruses, not even for the tailed dsDNA phages, which could be used in an analogous way to the 16S rRNA gene for cellular microorganisms (Hendrix et al., 1999). However, it might be possible to use conserved or ‘core’ genes as genetic markers for different viral groups. Sequence differences of such genes would then correspond to genotypes, and the number of genotypes can be used as an estimation of viral species richness. Genetic marker molecules have been found in several viral groups. For example, primers are available for algal viruses (Phycodnaviridae), targeted against the DNA polymerase gene (Chen and Suttle, 1995; Short and Suttle, 2002), and for cyanophages, targeted against the gene of the g20 capsid protein (Fuller et al., 1998; Wilson et al., 1999; Zhong et al., 2002), although there is some evidence that the cyanophage primers also cover some other phages. Techniques such as denaturing gradient gel electrophoresis (DGGE) or clone libraries were used to assess genotype richness of cyanophages and algal viruses (Short and Suttle, 2002; Zhong et al., 2002). Also, a community approach has been developed based on the separation by size of viral genomes using pulsed-field gel electrophoresis (PFGE; Wommack et al., 1999a,b). The number of viral genotypes, estimated by the number of genome size classes, is conservative because more than one genotype might be hidden in a single band.
Assuming for the sake of argument that every prokaryotic species typically has at least one specific virus, we can make a rough estimation of viral diversity. Using this line of argument, species richness of viruses should be at least as high as the species diversity of microbes. Using various approaches, many of them only targeting algal or cyanophage diversity, up to 36 different viral types were found in water samples (litres to tens of litres) from marine systems (Table 2). Recently, assessing the metagenome of two natural viral communities from 200 l of coastal waters using shotgun cloning, Breitbart et al. (2002) have estimated between 374 and 7114 viral genotypes in the samples. This number is higher than that for bacterial species, thus supporting the idea that viral diversity is enormous and higher than that of cellular microbes. Curiously, the diversity of phages, as assessed by the metagenomics approach, is about an order of magnitude higher than the diversity of microbes (Tables 1 and 2), concurring with a viruses-to-bacteria ratio of 10 typically found in coastal surface waters (Wommack and Colwell, 2000). This would indicate that there are on average about 10 specific viruses per bacterial species, which corresponds well with data from isolated bacteriophages (Weinbauer, 2003). At a viral abundance of 107 viruses ml−1 for coastal systems, the average abundance of single viral genotypes would be ≈ 104 ml−1. Using an average genome size of 5 × 104 bp for a marine virus, as determined by PFGE of viral communities (Steward and Azam, 2000) and the species richness data from the metagenomics approach, a genome complexity for viruses between 1.5 × 107 bp and 3.5 × 108 bp per sample can be calculated. These numbers are close to the maximum estimated genome complexity of bacterioplankton per sample. Thus, there seems to be a comparable genome complexity of viruses and prokaryotes in a system.
Viruses are a source of diversity and harbour specific genes. These particular genes involve structural genes (e.g. encoding for capsid and tail fibre proteins), genes encoding for insertion sites into the host genome or enzymes to lyse host cells. Other unique genes enable a lytic phage to infect a lysogenized host. Viruses can also carry genes for repairing DNA damage. For example, homologues of the DenV gene originally found in the T4 phage were also detected in Chlorella viruses, i.e. viruses infecting a symbiotic alga (Furuta et al., 1997). These genes have no resemblance to genes of cellular organisms. Whole-genome sequencing of marine viral isolates and metagenomics of viral communities also suggest a predominance of genes specific for viruses (Rohwer et al., 2000; Breitbart et al., 2002; Chen and Lu, 2002; Paul et al., 2002). In general, databank analysis of genes has shown that phage and viral genes were not stolen from hosts, but ‘. . . the large majority of phage and viral genes are unique to various families of viruses, not hosts . . .’ (Villarreal, 2001).
The genomes of the Caudovirales seem to be composed of modules such as head assembly, tail assembly or lysogeny and lysis cassettes, which consist of genes belonging together functionally (Hendrix et al., 1999; 2000; Lawrence et al., 2002). This mosaic organization of the viral genome favours the exchange of functional genetic units between phages. Such events will occur most frequently during a co-infection with two or more phages or when a phage infects a cell containing a prophage (Moineau et al., 1994; 1995). Thus, cells might be ‘phage factories’, which release a wide variety of recombinant phages in the environment (Ohnishi et al., 2001), and these chimeric phages might finally result in new viral species.