COVER: a priori estimation of coverage for metagenomic sequencing
Article first published online: 17 APR 2012
© 2012 Society for Applied Microbiology and Blackwell Publishing Ltd
Environmental Microbiology Reports
Volume 4, Issue 3, pages 335–341, June 2012
How to Cite
Tamames, J., de la Peña, S. and de Lorenzo, V. (2012), COVER: a priori estimation of coverage for metagenomic sequencing. Environmental Microbiology Reports, 4: 335–341. doi: 10.1111/j.1758-2229.2012.00338.x
- Issue published online: 10 MAY 2012
- Article first published online: 17 APR 2012
- Received 8 August, 2011; revised 27 February, 2012; accepted 1 March, 2012.
Fig. S1. The accuracy of the estimation of the fraction of 16S rRNA sequences belonging to unobserved OTUs (Good's sample coverage). The results were obtained using a simulated data set composed of 16S rRNA sequences corresponding to 200 genomes, with abundances following a log-normal distribution (upper panel) or a broken-stick distribution (lower panel). Both distributions are used in ecology: the first is widely found in many natural communities, whereas the second is predicted for communities where the resources are partitioned into niches at random. Although microbial communities usually do not follow the broken-stick distribution, we wanted to test the performance of our calculation under this model of extremely high evenness. The insets show a rank-abundance graph showing the shapes of the respective distributions, with species ranked by abundance on the x-axis. The expected number of sequences is calculated using Good's estimator, as described in the main text, whereas the real numbers are obtained by the random sampling of the number of sequences indicated by the x-axis.
Fig. S2. Accuracy of the estimation of unknown genome sizes. Upper: The difference in the genome size (expressed as |S1 − S2/max(S1, S2)|, with S1 and S2 representing the real sizes of the genomes) for pairs of genomes of known sizes, in relation to their taxonomic proximity. The relationship between the genome size and taxonomic relatedness is apparent. For instance, genomes related at the species level (i.e. different strains from the same species) usually have less than a 10% difference in genome size. If the genomes belong to the same genus, the difference can extend to 25%, although in most cases, it remains at 10% or less. Lower: Use of the genome sizes of sequenced species to infer the sizes for species currently being sequenced (species ‘in progress’ in the NCBI database, http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi, whose size has been estimated, usually via PFGE). The plot shows the probability of inferring the size correctly using the sizes of other species at different taxonomic ranks. For instance, the case marked by a dashed line in the plot corresponds to the estimation of the size of some species using the known sizes of other species from the same genus. In that case, there is an approximate 75% probability that we can infer its genome size with less than 10% error.
Fig. S3. Accuracy of the estimation of the 16S rRNA copy number. Differences in copy number (expressed as |C1 − C2/max(C1, C2)|, with C1 and C2 representing the numbers of 16S copies in the genomes) for pairs of genomes of known copy number, in relation to their taxonomic proximity.
Fig. S4. Variation of the estimated coverage in relation to the number of 16S rRNA sequences provided. A community of 100 species was simulated, and the estimated coverage for the first 10 members was calculated by COVER using different initial numbers of 16S sequences, supposing a sequencing effort of 500 000 reads of 400 base pairs each. It can be seen that the estimates of coverage oscillate greatly when few sequences are provided, indicating that the community composition is still not well determined. When a substantial amount of 16S sequences is provided (between 2000 and 3000, in this case), the estimated coverage values stabilize and are very similar to the real coverage values (last point in the plot).
Fig. S5. Results of the estimation of coverage for a controlled data set composed of 100 genomes, with abundances following a log-normal distribution. The results are obtained by simulating the sequencing of 500 000 reads of 400 bp each. The plot shows the real coverage for each species (red line) and the obtained coverage predicted by COVER (green points). Species (genomes) are sorted according their abundances. Estimated coverage values match the real values very well. Some instances have no coverage estimated. These species have been merged with closely related ones because the 16S identity for the related species is 98% or more. For example, Burkholderia cenocepacia is given a coverage of zero because it was merged with Burkholderia pseudomallei, whose coverage is, thus, overestimated. Both species share 98% identity in their 16S rRNA. There was a similar occurrence for two more cases in this experiment: Bacillus anthracis was merged with Bacillus cereus, and Escherichia fergusonii was merged with Escherichia coli.
Table S1. Upper: Number of taxa for each rank, as listed in NCBI's taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy) and the number of taxa containing at least one member with known size (from either complete genomes, genomes in progress or genomes with PFGE size estimates, http://www.genomesize.com/prokaryotes). Lower: Presence of families without any members of known genome size in the environmental samples (http://metagenomics.uv.es/envDB). In a set of 3035 samples, 810 contain a member from one of these families.
Table S2. Results obtained for the estimation of the number of reads needed for obtaining coverage 5× for the most represented genome in a controlled data set composed of 300 genomes, with abundances following a log-normal distribution. For studying the influence of inaccurate estimations of genomic sizes, we allowed these sizes to vary by some percentage of their original values. We draw a random value between 0 and a given percentage of the estimated genomic size, and added or subtracted that value to the estimation. The results obtained allowing 20% and 50% of variation are shown. The values change around 10% when allowing 20% of variation in the estimated sizes, and barely 25% when allowing 50% of variation.
Table S3. Comparison of the real and expected results for two metagenomic sequencing projects. The metagenomes were kindly provided by Dr Alejandro Mira (CSISP, Valencia, Spain), and they consist of two coupled sets of 16S and metagenomic sequences from oral samples. The first was obtained by sequencing amplicons from clone libraries. The contig length distributions for the real and expected instances were calculated as described in the text.
|EMI4_338_sm_FigS1.jpg||165K||Supporting info item|
|EMI4_338_sm_FigS2.jpg||208K||Supporting info item|
|EMI4_338_sm_FigS3.jpg||61K||Supporting info item|
|EMI4_338_sm_FigS4.jpg||94K||Supporting info item|
|EMI4_338_sm_FigS5.jpg||46K||Supporting info item|
|EMI4_338_sm_TabS1.doc||23K||Supporting info item|
|EMI4_338_sm_TabS2.doc||28K||Supporting info item|
|EMI4_338_sm_TabS3.doc||34K||Supporting info item|
Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.