Estimation of metagenome size and structure in an experimental soil microbiota from low coverage next-generation sequence data


  • T. Frisli,

    1. Faculty of Education and Science, Hedmark University College, Hamar, Norway
    2. Centre for Evolutionary and Ecological Synthesis (CEES), Department of Biology, University of Oslo, Blindern, Oslo, Norway
    Search for more papers by this author
  • T.H.A. Haverkamp,

    1. Centre for Evolutionary and Ecological Synthesis (CEES), Department of Biology, University of Oslo, Blindern, Oslo, Norway
    2. Microbial Evolution Research Group, MERG, Department of Biology, University of Oslo, Blindern, Oslo, Norway
    Search for more papers by this author
  • K.S. Jakobsen,

    1. Centre for Evolutionary and Ecological Synthesis (CEES), Department of Biology, University of Oslo, Blindern, Oslo, Norway
    2. Microbial Evolution Research Group, MERG, Department of Biology, University of Oslo, Blindern, Oslo, Norway
    Search for more papers by this author
  • N.Chr. Stenseth,

    1. Centre for Evolutionary and Ecological Synthesis (CEES), Department of Biology, University of Oslo, Blindern, Oslo, Norway
    Search for more papers by this author
  • K. Rudi

    Corresponding author
    1. Department of Chemistry, Biotechnology and Food Science, Norwegian Universíty of Lìfe Sciences, Ås, Norway
    • Faculty of Education and Science, Hedmark University College, Hamar, Norway
    Search for more papers by this author


Knut Rudi, PO Box 4010, 2306 Hamar, Norway. E-mail:



A major challenge in metagenome studies is to estimate the true size of all combined genomes. Here, we present a novel approach to estimate the size of all combined genomes for low coverage next-generation sequencing (NGS) data through empirically determined copy numbers of random DNA fragments.

Methods and Results

Size estimates were made based on analyses of two experimental soil micro-ecosystems – simulating soil with and without earthworms. Our analyses showed combined genome sizes of about log 11 nucleotides for each of the soil micro-ecosystems, as estimated from qPCR determined copy numbers of random DNA fragments. This corresponds to more than 20 000 unique bacterial genomes in each sample. There seemed, however, to be a bacterial subpopulation in the earthworm soil, not being present in the nonearthworm soil. To describe the structure of the metagenomes, both total DNA and amplified 16S rRNA gene sequence libraries were generated with 454-sequencing. Bioinformatic analysis of 454 sequence libraries showed a large functional but low taxonomic overlap between the samples with and without earthworms. A neutrality test indicated that rare species have a competitive advantage over abundant species in both micro-ecosystems providing a potential explanation for the large metagenome sizes.


We have shown that the soil metagenome is very large and that the large size is probably a consequence of top-down selection of the dominant bacterial species.

Significance and Impact of the Study

Estimates of metagenome size from low coverage NGS data will be important for guiding future NGS set-ups.


The measurement of microbial diversity and the estimation of the sequencing depth of complex samples still represent tremendous challenges. It is well-established knowledge that microbial diversity by far exceeds that of larger animals and plants and that only a limited number of environmental micro-organisms can be grown in culture (Amann et al. 1995). This has until recently challenged the possibility to describe the true diversity of micro-organisms in an environmental sample (Venter 2004). Even though there has been huge progress in the field of metagenomics, our understanding of the microbial diversity in complex ecosystems today still remains limited.

Next-generation sequencing (NGS) has tremendously increased the throughput of DNA sequencing and improved the description of the microbial diversity through random shotgun sequencing of environmental DNA or through in-depth targeted sequencing of amplified products like the 16S rRNA gene (Rothberg and Leamon 2008). One of the main challenges performing NGS analyses of the microbiota in complex samples is to estimate the expected number of unique sequences and taxa. The main approaches until now in estimating metagenome diversity have been extrapolation based (Raes et al. 2007). However, for most metagenomes, the coverage is still low rendering it difficult to obtain reasonable diversity estimates by extrapolation.

The microbiotic systems in soil are still not well understood, despite their essential role in biogeochemical cycling on earth. Earlier estimates suggest that 1 g of soil contains about 103–107 microbial species (Torsvik et al. 1990, 1998; Ovreas and Torsvik 1998; Ovreas et al. 1998; Gans et al. 2005), and the general belief is that <1% of the bacteria from soil have been cultured (Torsvik et al. 1990, 2002; Torsvik and Ovreas 2002; Schloss and Handelsman 2003; Curtis and Sloan 2004). This makes soil a particular interesting system for studying microbial diversity. An international initiative on characterizing the soil metagenome diversity has for this reason recently been initiated (Vogel et al. 2009).

The aim of the work presented here was to estimate the size (total length of unique sequences) and structure of two experimental soil micro-ecosystems where we simulate soil with and without earthworms from low coverage NGS data. This was performed using a novel approach where the total metagenome size was estimated based on direct measurements of random DNA fragments, in combination with state of the art standard metagenome and 16S rRNA gene analytical approaches. The main rationale for estimates based on low coverage NGS data is to provide data for design of large-scale projects to prevent under powering by too few sequence reads to uncover the true diversity.

Materials and methods

An overview of the analytical strategy is presented in Fig. 1.

Figure 1.

An overview of the workflow from the two micro-ecosystems representing soil with and without earthworms. DNA was extracted from start and the endpoint of a time-series experiment, and both 454 random shotgun sequencing and in-depth targeted sequencing of V3 and V4 region of the 16S rRNA gene were chosen as approaches to sequence DNA from the endpoint DNA samples. The DNA sequence information from the sequencing was quality checked in Prinseq, filtrated using Mothur and CD-hit before the data were analysed using BlastN and BlastX, Megan, MG-Rast and Stamp. Random DNA fragments from the total DNA sequence libraries and the amount of 16S rRNA gene were measured using real-time PCR using the Taqman technology in samples isolated from the two soil micro-ecosystems, and these results were used to estimate the size of the metagenomes and the amount of 16S rRNA copies present in the start and endpoint of the two soil micro-ecosystems.

Experimental set-up and sampling

Soil micro-ecosystems were created in aerated containers (14″×20″×7″) using a commercial earthworm soil (Magic Products Inc., Amherst Junction, WI, USA) in two parallel experiments. In one of the two micro-ecosystems, the earthworm Denroboena veneta (Oves sportsfiske, Sandefjord, Norway) were added. Every week 50 g of Magic® Worm Food (Magic Products Inc.) in water slurry was added in the two boxes. The food was composed of around 12% protein, 1% fat, 81% carbohydrates and 6% fibre ( Soil samples were collected at the start point and after a period of 14 weeks and stored at −80°C until further use. The sampling was performed by collecting 1 kg of soil from the centre of each of the containers. The soil was further manually mixed, before random sampling of 0·5 g soil samples for further analysis.

Chemical characterization

The amount of total carbon, nitrogen, organic carbon, inorganic carbon, pH and ammonium was measured on the starting point and the two endpoint soils. The chemical measurements were taken by Eurofins (

DNA isolation

DNA was isolated from endpoint samples using E.Z.N.A. Soil DNA Kit (Omega Bio-Tek Inc., Norcross, GA, USA) according to the manufacturer's instructions. 0·6 ng of an external standard (plasmid containing a random DNA sequence not found in nature) was added to the soil samples before DNA isolation if the samples were used in real-time PCR analyses. The other samples were extracted without the external standard.

Preparation for metagenome sequencing

Four DNA samples for whole 454 metagenome DNA sequencing were pooled and precipitated to obtain the amount of DNA needed and further prepared according to the instructions by the manufacturer (Roche, Basel, Switzerland).

Total bacterial quantitative real-time PCR

Real-time PCRs targeting the 16S rRNA gene using the forward primer sequence 5′-TCCTACGGGAGGCAGCAGT-3′, reverse primer sequence 5′-GGACTAC-CAGGGTATCTAATCCTGTT-3′ and probe sequence (6-FAM) 5′-CGTATTACCGCGGCTGCTG-GCAC-3′(TAMRA) (Nadkarni et al. 2002) were performed on DNA isolated from the soil samples containing a known amount of externally added reference DNA. Standard curves were created using a known concentration of the 16S rRNA gene product and externally added reference DNA. This was performed to quantify the amount of 16S rRNA gene copies per gram of soil from the micro-ecosystems.

For the real-time PCR measurements, we used 25-μl reaction volumes containing 1 unit DyNAzyme II Hot Start DNA polymerase (Thermo Fisher Scientific Inc., Finnzymes, Vantaa, Finland), 0·2× Rox (Life Technologies, Invitrogen, CA, USA), 0·2 μmol l−1 probe (Life Technologies, Applied Biosystems, CA, USA), 0·4 μmol l−1 of each primer (Life Technologies, Invitrogen), 1× DyNAzyme II Hot Start Reaction buffer (Thermo Fisher Scientific Inc.), 400 μmol l−1 dNTPs (Thermo Fisher Scientific Inc., Waltham, USA) with the cycling conditions starting with a activation step at 94°C for 10 min followed by 40 cycles with 95°C of 30s, 63°C of 30s and 72°C for 1 min.

PCR amplification for 454-sequencing

Three replicates of extracted DNA from soil samples from each endpoint were PCR amplified using primers targeted against the V3/V4 region of 16S rRNA gene (modified from Nadkarni et al. 2002) (Table S7). Each reaction was performed using the same procedure as presented for the 16S rRNA gene PCR presented above. All samples were amplified with the same complementary primers; however, they were tagged with different MID (10-base Multiplex Identifier) sequences. The concentration of the PCR-amplified samples were then measured using Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA), pooled and purified using Agencourt AMPure XP (Beckman Coulter Inc., Brea, CA, USA) before the amplicons were sequenced using the 454 GS FLX instrument at the Norwegian High-Throughput Sequencing Centre (


The total DNA samples were sequenced on half a plate each with the Genome Sequencer FLX System, while the amplicons were pooled in equal amounts and sequenced on 1/16 part of a plate with Genome Sequencer FLX Titanium System. Signal intensity from the sequencing was converted to sequence reads with the standard software provided by 454 Life Sciences.

Quality check of the 454-sequences

The quality of the data was checked using the Web-based application Prinseq (Schmieder and Edwards 2011). The sequences have quality values above 20 with the majority of sequences having values between 35 and 40. Sequences containing ambiguous bases and homopolymers longer than 10 bases and primer sequences were removed from the shotgun libraries using Mothur (Schloss et al. 2009). Exact duplicate sequences were removed with an in-house Perl script (Kumar et al. 2011). Artificial duplicates were removed using the program cdhit-454 with default settings except for the parameters: sequence identity threshold (-c) set to 0·96, alignment (-g) set to 1 and band with (-b) set to 10 (Niu et al. 2010).

Metagenome analyses

Taxonomic classification was performed on the filtered 454-sequences using BlastX against the NCBI nonredundant Protein database with the e-value cut-off set to 10−3 in Bioportal ( Blast output files were analysed in Megan (Huson et al. 2007) using default settings. Megan uses the Blast result (comparison against a reference database) from sequencing reads as the starting point and assigns the sequence reads to nodes in the NCBI taxonomy using an algorithm called Last Common Ancestor or ‘LCA algorithm’. The filtered 454-sequences were submitted to the MG-Rast database (Meyer et al. 2008). The MG-Rast results were further analysed in the program Stamp (Parks and Beiko 2010), created for statistical analysis of phylogenic and metabolic metagenome data. Statistical settings in Stamp were set to Fisher's exact test, two sided, 95% confidence interval and CI method Newcombe–Wilson with Bonferroni correction.

Analysis of 454-amplicon sequences

The amplicon sequences were split into groups according to their MID sequence after the 454-sequencing. This resulted in six amplicon sequence libraries that were named Soil + worms amplicon rep.1, Soil + worms amplicon rep.2, Soil + worms amplicon rep.3, Soil − worms amplicon rep.1, Soil − worms amplicon rep.2 and Soil − worms amplicon rep.3, representing three replicates from the soil containing earthworms micro-ecosystem and three replicates from the micro-ecosystem without earthworms.

Mothur (Schloss et al. 2009) was used to filter the sequence reads. Reads that had ambiguous bases, homopolymers over 10 bases, amplicon length <200 bp and sequences with deficient primer sequence (two mismatches were allowed) were filtered away. To remove chimeric sequences, the implemented chimera slayer algorithm in Mothur was used (Haas et al. 2011). Chimeric amplicon sequences were removed by aligning the amplicon sequences to the reference data set alignment (Haas et al. 2011) and further comparing them with chimera slayer in Mothur. After removal of chimeric sequences, the remaining sequences from each sample were preclustered to reduce the number of artefacts in the data (Huse et al. 2010; Haas et al. 2011).

Mothur was used for the analysis of the amplicon libraries. The sequence reads were first classified against the RDP training set (, a distance matrix was created (cluster command, average neighbour algorithm), and the sequences were further assigned to operational taxonomic units (OTUs) where the distance matrix formed the basis. The number of OTUs was calculated for each replicated sample until a similarity cut-off of 0·5 was reached. To describe the diversity within each sample, the richness and alpha-diversity estimates Shannon, Chao 1 and ACE were calculated for the OTU values: unique, 0·01, 0·03, 0·05, 0·1, 0·15 and 0·3. The beta-diversity was studied by creating Venn diagrams showing how much of the OTUs are shared between the different replicates of DNA extractions from the same micro-ecosystem and between the two different micro-ecosystems. The sequences were compared to the NCBI nr nucleotide database using BlastN with an e-value cut-off 10−3 in Bioportal ( The Blast output files were analysed in Megan (Huson et al. 2007).

The neutral theory assumes that if an organism dies in a saturated community, then it will be replaced by the other community members by chance, just based on their numbers (Rosindell et al. 2011). The neutrality testing was based on species richness distributions (0·03 OTU cut-off) using the program Parthy (Jabot and Chave 2011). This program provides an extension to Hubbell's neutral theory with the parameterization of the deviation from neutrality through the parameter δ. This parameter is negative when rare species are more frequent than expected based on the neutral theory, while being positive if abundant species are more frequent than expected.

Estimates of metagenome size from randomly picked DNA fragments using real-time PCR

Eight random DNA fragments were chosen from the two total DNA sequence libraries. The DNA fragments were named Rand.seq.1, Rand.seq.2, Rand.seq.3, Rand.seq.4, Rand.seq.5, Rand.seq.6, Rand.seq.7 and Rand.seq.8, and the levels of these fragments were measured in the DNA samples isolated from the endpoints of the two soil micro-ecosystems. We used real-time PCR using primers and probes designed to detect the random DNA fragments (Table S8).

The real-time PCR measurements were taken with the same reaction volumes and concentrations as described earlier and with the cycling conditions starting with an activation step at 95°C for 10 min followed by 40 cycles with 95°C of 30s and 60°C for 1 min.

For each qPCR amplicon, we first established the relation between ct values and DNA copies of synthetic templates using standard curves. The number of copies of the synthetic template was deduced by the number of moles divided by Avogadro's number. Based on the standard curves, we could then estimate the number of copies of the qPCR amplicon targets in the respective DNA samples. Then, using the knowledge about the molecular weight of the double stranded amplicons, we further calculated the weight of the target. Finally, the total weight of the DNA in the sample was divided by the weight of the target. We then obtained the number of DNA fragments with equal weight and approximate size as the target. With the assumption that these DNA fragments are in the same copy numbers as the target, one can calculate the total metagenome size. A complete description of this calculation including formulas can be found in the Appendix S1.


The chemical composition of the two micro-ecosystems

The chemical parameters: total carbon, nitrogen, total organic carbon, total inorganic carbon, pH and ammonium were measured on the starting point soil and the two endpoints from the two soil micro-ecosystems (Table S1). The chemical measurements showed that the pH was lowered from 7·6 to 7·2 in the soil with earthworms (soil + worms), and to 7·0 for the nonearthworm soil (soil − worms) from start to end the end of the micro-ecosystem experiment. The amounts of total carbon changed from 33·7% dry mass to 34·4% dry mass in the soil sample taken from the micro-ecosystem containing earthworms and to 32·2% dry mass in the soil sample where no earthworms were present. The amount of dry mass nitrogen was measured to 0·6% in the start sample and lowered in both endpoint samples to 0·5% in the sample from soil with earthworms and 0·3% in the sample from soil without earthworms.

Characteristics of metagenome data

DNA extracted from both soil micro-ecosystems was used to create two 454 shotgun data sets. The soil + worm data sets have a total of 258 995 reads with average read length 202·16 (std. 56·56) and a total amount of 52 358 980 bases. The soil − worms data sets contain 225 319 reads with average read length 209·22 (std. 52·10) and a total of 47 142 252 bases. After filtration, the total number of sequences was 233 291 sequences for the soil + worms sequence library and 201 868 sequences for the soil − worms sequence library (Table S2).

Direct estimates of metagenome size by qPCR

To obtain a direct estimate of the metagenome sizes of the two micro-ecosystems, we used real-time PCR for quantification of four sequences randomly chosen from each of the two shotgun DNA sequence libraries (Table 1). There was a surprisingly even distribution of the copy number of the random sequences relative to the total amount of DNA within each sample. The average number of nt/copy was calculated to be log 11·1 nt (std. 1·62) for the micro-ecosystem soil containing earthworms, while the micro-ecosystem soil without earthworms had an estimated metagenome size of log 11·1 nt (std. 1·38). The log values detected for each random DNA fragment are presented in Table S11.

Table 1. Estimated metagenome sizes presented in log nucleotide values. The values are estimated from the eight different random DNA fragments measured on DNA isolated from soil + and − worms
Random DNA fragment nameSequence originDNA from soil +/− wormsEstimated metagenome sizeaStandard deviation
  1. N.a., not applicable.

  2. a

    Numbers are presented in log nt.

Rand.seq.1Random DNA fragment from 454-metagenome Soil + worms+12·770·25
N. a.N.a.
Rand.seq.2Random DNA fragment from 454-metagenome Soil + worms+10·430·12
Rand.seq.3Random DNA fragment from 454-metagenome Soil + worms+9·220·19
Rand.seq.4Random DNA fragment from 454-metagenome Soil + worms+12·170·29
Rand.seq.5Random DNA fragment from 454-metagenome Soil − worms+9·880·10
Rand.seq.6Random DNA fragment from 454-metagenome Soil − worms+8·950·06
Rand.seq.7Random DNA fragment from 454-metagenome Soil − worms+11·830·76
Rand.seq.8Random DNA fragment from 454-metagenome Soil − worms+10·430·06

We found no significant differences between random sequences isolated from nonearthworm soil sequence library measured in DNA samples isolated from soil with and without earthworms. On the other hand, all the random sequences isolated from earthworm soil sequence library showed a 10- to 100-fold overrepresentation in earthworm soil as compared to the nonearthworm soil (P < 0·02, Mann–Witney U-test). This suggests that a bacterial subpopulation is overrepresented in the micro-ecosystem soil containing earthworms.

Taxonomic signatures in the metagenome libraries

Results from MG-Rast showed that 96% of the sequences for the shotgun DNA sequence libraries were assigned to the bacteria domain using the Seed database as reference. MG-Rast showed Proteobacteria as more abundant than the other groups at the phylum level, with close to 80% of the assigned reads. The Alphaproteobacteria was the most abundant group in the Proteobacteria phylum. For this group, there was a clear difference between the two samples, where the soil − worms seq. library have 45% of reads assigned and soil + worms seq. library have 37% reads assigned. In the Alphaproteobacteria group, the Rhizobiales show the main differences, with 56% of the total amount of hits in the Proteobacteria class for the soil + worms seq. library and 65% for the soil − worms seq. library. For Gammaproteobacteria, the Pseudomonadales and Xanthomonadales show clear differences, with the Pseudomonadales being 23% higher represented in soil containing earthworms, and Xanthomonaldales being 26% higher represented in the soil without earthworms (Fig. S1). Stamp comparison at the order level (Fig. 2) shows that Rhizobiales, Xanthomonadales and Spingomonadales were significantly overrepresented (P < 1e-15) in the metagenome from soil without earthworms, while Pseudomonadales, Rhodocyclales, Planctomycetales, Solibacteres and Verrucomicrobiae were among the groups overrepresented in the metagenome from soil with earthworms.

Figure 2.

Stamp comparisons of metagenomes. The yellow sample represents the soil without earthworms, and the blue represent the soil containing earthworms. Both samples are compared at the order level according the MG-Rast system. These results were tested using Fisher's exact test, two sided, 95% confidence interval and CI method (difference between proportions): Newcombe–Wilson with Bonferroni correction.

BlastX was used to compare the two metagenome sequence libraries with the NCBI nr Protein database. The BlastX results compared using Megan showed the same patterns as the MG-Rast analysis (Fig. S1).

Metabolic pathways in the metagenome libraries

Metabolic profiles for both metagenomes were created in MG-Rast. These analyses gave 24·08% hits for the soil + worms metagenome, while the soil − worms metagenome had 26·72% hits. Comparative analyses of the metabolic profiles using subsystem level 3 with Stamp gave only four significantly different subsystems (Fig. S2). A comparison using subsystem level 1 between the two total DNA metagenome libraries revealed more hits for the clustering-based subsystem group, followed by the carbohydrate, amino acids and protein metabolism subsystems (Table S3).

Based on only two whole-genome shotgun samples, it is very difficult to connect the bacterial taxonomic composition and functions to earthworm additions; however, these results give first evidence for which functions are connected to earthworm additions. The taxonomic composition can be explored using statistical tests on the amplicon data.

16S rRNA gene copy estimates and phylogenetic composition

Quantitative real-time analyses of the 16S rRNA gene copies gave estimates in the range from log 11·02 to log 11·93 copies per gram of soil at the start of the experiment (Table S4). During the time period of the experiment, the level of 16S rRNA gene copies increased to above log 13 copies per gram for the endpoints from both soil micro-ecosystems.

The pyrosequencing of the 16S rRNA gene amplicon resulted in average 4336 raw sequence reads in the range from 2828 to 5481 sequences. The 16S rRNA gene amplicon libraries were filtered using a five-step filtering approach in Mothur removing ambiguous bases, homopolymers over 10 bases, amplicon length <200 bp, sequences with deficient primer sequences and chimeric sequences using chimera slayer and preclustering. The resulting libraries were used in further analyses (Table S5).

Rarefaction curves (Fig. 3a–f) showed a nearly straight line. This indicated that our library did not cover a large portion of OTUs and that the amplicon sequencing gave a relatively low OTU coverage. The number of OTUs detected reached 10 or less for all replicates at the 80% similarity level, and a maximum of between 1203 and 775 at 100% similarity (Fig. 3g), indicating that there is a high level of closely related phylotypes. The high diversity was confirmed by the richness estimators Sobs (richness), ACE, Chao1 (Table S6).

Figure 3.

Rarefaction curves at different cut-off levels. The operational taxonomic unit (OTU) similarity cut-offs at levels 0·00, 0·01, 0·02, 0·03, 0·05, 0·01 and 0·2 are plotted for the six different 16S rRNA gene amplicon libraries (Soil + worms rep1, 2 ad 3 and Soil − worms rep1, 2 and 3). Graph G shows the number of OTUs plotted against the different similarity cut-offs. Here, Soil +/− e. rep 1, 2 and 3 represent the same samples as Soil +/− worms rep. 1, 2 and 3. (image) 0·00; (image) 0·01; (image_n/jam12035-gra-0003.png) 0·03; (image_n/jam12035-gra-0004.png) 0·05; (image_n/jam12035-gra-0005.png) 0·10; (image_n/jam12035-gra-0006.png) 0·20; (image_n/jam12035-gra-0001.png) Soil + e.rep1; (image_n/jam12035-gra-0002.png) Soil + e.rep2; (image_n/jam12035-gra-0003.png) Soil + e.rep3; (image_n/jam12035-gra-0004.png) Soil−e.rep1; (image_n/jam12035-gra-0005.png) Soil−e.rep2 and (image_n/jam12035-gra-0006.png) Soil−e.rep3.

Venn diagrams were used to illustrate the β diversity between sequence libraries. The Venn diagrams (Fig. S3) show that the three samples from the same condition are relatively diverse. The Venn diagrams for the OTUs from the two different micro-ecosystems show that only 377 of the total richness of 3970 OTUs are shared between all the samples, while 198 of 2214 OTUs are shared between all three replicates from soil with earthworm and 196 of 2133 OTUs are shared between the nonearthworm samples.

Taxonomic classifications were performed on the filtered amplicon libraries using BlastN against the NCBI nr nucleotide database. The Blast results were analysed in Megan using the same approach as for the total DNA data. The Blast results were also analysed using MG-Rast. The Megan and MG-Rast mainly reveal the same trends (Fig. S1). The Proteobacteria were the largest compared to the other groups at the phylum level. At deeper taxonomic levels, the Alpha- and Gammaproteobacteria groups showed that Rizobiales and Xanthomonadales were clearly overrepresented in the DNA extracted from soil without earthworms and Pseudomonadales and Enterobacteriales in DNA isolated from soil containing earthworms.

Based on the 0·03 OTU cut-off, we determined the deviation from neutrality (using the δ parameter) for the species richness distributions using the Parthy software. These analyses showed a clear tendency of a correlation between sequencing depth and negative deviation from neutrality (Fig. 4), indicating that there is a major overrepresentation of rare species in the data set compared to what was expected for neutral communities and that the sequencing depths used were not sufficient to uncover the true δ.

Figure 4.

Correlation between δ and the size of the clone library. Red diamonds represent samples from soil with earthworms, and blue diamonds represent nonearthworm samples.


We found a relatively equal number of copies for the different amplicons evaluated by our direct estimation approach, which indicate an equal distribution of most of the genomes in the soil metagenome. The even distribution is also supported by the neutrality test, showing an overrepresentation of rare species. Furthermore, OTU analysis of the alpha- and beta-diversity showed that relatively few of the OTUs were shared between the replicates and the rarefaction curves were not asymptotic. Together with the direct estimate of a combined metagenome sizes at about 100 billion base pairs, these data support large soil metagenome sizes with a low content of eukaryotic species (<5% of all random DNA fragments). Most likely, the true size of the metagenome in natural soil is even larger than the estimates for the experimental soil microbiota.

Setting these estimates into perspective, the human genome has over 2·85 billion base pairs (IHGSC 2004), which is equal to a log value on 9·45 and the Escherichia coli genome a little bit over 4·5 million base pairs (Blattner et al. 1997), which correlate to a log value of 6·65. This indicates that the metagenomes from the two micro-ecosystem samples sequenced here are more than 30-fold larger than the human genome, and more than 20 000-fold that of an average bacterial genome. Given one gene per kb, our data indicate that there are about 100 million genes in the soil metagenome. Although our total diversity estimates are rough, they suggest that we still are far away from a complete coverage of the soil metagenome, both with respect to sequence characterization and functional annotation. The information presented here should thus be considered for designing large-sale studies of the soil microbiota (Vogel et al. 2009), to avoid underpowered studies.

Despite the limited coverage in our study, we found relatively consistent evidence for functional redundancy due to the major differences in taxonomic composition and small differences in metabolic profiles. Furthermore, the overrepresentation of rare species and negative selection of dominant species suggest the importance of top-down selection, while suggesting that bacterial competition plays a minor role in the soil micro-ecosystem. Given a negative selection for the dominant species, this can explain the enormous diversity of the soil microbiota (Wildschutte et al. 2004).

Top-down selection can either be mediated by predation (Meyer and Kassen 2007) or parasitism (Reyes et al. 2010). The low level of eukaryotic DNA (<5%), however, suggests that eukaryotic predators do not play a major role in our micro-ecosystems. Bacteriophages represent common bacterial parasites. In a preliminary search not included in the current study, we actually found DNA sequences with homology to a range of bacteriophages. Whether these are important in controlling the dominant bacteria in the soil micro-ecosystem needs to be verified by additional experimentation.

The differences in the microbiota between the soil samples with and without earthworms are interesting. In particular, the possibility that earthworms could have introduced a microbial subpopulation is intriguing. We have recently shown that the gut microbiota among earthworms are changed to a homogenous state upon feeding (Rudi et al. 2009). This is consistent with a subpopulation of bacteria in soil with the ability to colonize the earthworm gut. However, whether the observed subpopulation is an effect of earthworms also needs to be confirmed in follow-up experiments.

In conclusion, we have shown that the soil metagenome is very large and that the large size is probably a consequence of top-down selection of the dominant bacterial species. The mechanisms for the top-down selection, however, need to be verified by further experimentation.


This work was financially supported by the Hedmark University College and Hedmark Sparebank. The Norwegian Sequencing Centre ( is acknowledged for the 454-sequencing of metagenomes and amplicons, and we would like to thank Else-Berit Stenseth for excellent technical assistance.