Fig. S1. The relationship between accuracy of metagenomic abundance estimates and genomic G+C%. The ratio between observed genomic coverage and known genome abundance in the community is plotted relative to the GC contents of that genome for each of the sequencing platforms. Shading zones indicate a low level of bias (dark: < 1.5-fold, light: 1.5- to 2-fold) from the perfect agreement value of 1. Genomes above or below those zones display an increased bias, correlated with low or high GC content.


Fig. S2. A. Depth of coverage by 454 and Illumina sequence reads on the Archaea–Bacteria metagenome. Genomes of all included organisms served as the reference and the plot displays, for each GC content level, the mean read coverage of 100 bp reference segments with that GC content. The overlapping shaded area represents the quantitative GC content distribution in the reference metagenome (moving 100 bp sequence segments), not scaled to y-axis and included only for distribution shape comparison with the 454 and Illumina data.

B. Differential GC bias in metagenomic quantitative inferences between 454 and Illumina platforms. The three GC window intervals (27–40%, 40–60% and 60–70%) were used for pairwise t-test comparisons.


Fig. S3. Depth of coverage by 454 and Illumina sequence reads of three genomes with low (Nanoarchaeum equitans, 31.6%), medium (Nitrosomonal europaea, 50.7%) and high (Salinispora tropica, 69.5%) average GC content. Each genome contains regions of GC content that depart significantly from the average value (e.g. in ribosomal RNA genes, in non-coding or repetitive regions). To enable overlapped representation of the coverage bias, the y-axis scale is in relative units, the absolute values being different depending on genomes and sequencing platform.


Fig. S4. Taxonomic diversity composition of the Archaea–Bacteria community inferred by IMG/M and MG-RAST. The accuracy ratio was calculated between the percentage of sequences (454 or Illumina) assigned to individual phyla by the two analysis systems and the known quantitative distribution of those taxa in the community. The shaded region indicates a twofold accuracy window. Sequences with no assignments or assigned to non-present phyla were not taken into account.


Fig. S5. MEGAN-based analysis of taxonomic accuracy for 454 and Illumina metagenomes. (A) Heatmap of the accuracy ratio (observed abundance/expected abundance) at species (s), genus (g) and family (f) level. The heatmap uses the megablast and blastn output against three different databases: (Ref) only genomes of the synthetic community organisms; (All) all microbial genomes; and (XRef) microbial genomes excluding the synthetic community members. Illumina sequence distribution (domain level) for each organism when sequences that mapped to the reference sequences were megablasted or blasted against the XRef database, shown for each organism (B and C) or globally for each blast output (D). No hit indicates the percentage of sequences that were not mapped to any genome. Bacteria and Archaea represent the percentage of sequence that were correctly mapped to corresponding genomes. Other Bacteria and Other Archaea represents the fraction of sequences incorrectly mapped to genomes not present in the synthetic community.


Fig. S6. SSU rDNA primer pair sequence coverage map. The consensus for all the sequences in the synthetic community and the occasional differences observed for some taxa are illustrated. Nucleotide differences that correlate with observed sequencing bias are in red rectangles. For the various taxa and rRNA amplicons, filled squares indicate a level of more than twofold over (blue) or under estimation (red) for most or all species from that taxon. Isolated cases of bias are indicated by a triangle.


Fig. S7. Pairwise sequence identity levels between species/strains in different amplicons.


Fig. S8. See main article, Fig. 3 legend.


Fig. S9. Log-linear representation of increased technical replicates variability (standard deviation) as a function of measured organism abundance in the synthetic community, based on bacterial V13 amplicon data.


Table S1. List of organisms used for the synthetic community and their sources.


Table S2. Taxonomic distribution of Archaea–Bacteria community sequences based on MG-RAST and IMG-M analysis.


Table S3. List of SSU rRNA primers used for amplicon-based diversity characterization of the synthetic communities.


Table S4. Summary of 454 amplicon reads and the error rates before and after the different QC steps.


Table S5. PCR conditions used to generate the different SSU rRNA gene amplicons.

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.