Multiple marker parallel tag environmental DNA sequencing reveals a highly complex eukaryotic community in marine anoxic water


  • Research Interests: T.S. and H.-W.B are interested in ecology, diversity and biogeography of marine protests, extremophile protists, anoxic marine systems and eukaryote evolution; D.B. works on microbial ecology and interactions, protist/fungal evolution and phylogenetics, studies on the phylum Cercozoa, uniting phenotypic and genotypics data in environmental microbiology; M.N. works on applied bioinformatics of sequence data and algorithmic analyses of RNA secondary structure predictions; R.C. is interested in computational processing of high-throughout sequencing data; and finally, T.A.R. and M.D.M.J. work on the comparative genomics of eukaryotic microbes and the complexity and branching order of the eukaryotic tree of life.

Thomas A. Richards, Fax: +44 (0)1392 263434; E-mail:, Thorsten Stoeck, Fax: +49 631 205 2496; E-mail: stoeck@rhrk.uni-kl.deThorsten Stoeck and David Bass contributed equally.


Sequencing of ribosomal DNA clone libraries amplified from environmental DNA has revolutionized our understanding of microbial eukaryote diversity and ecology. The results of these analyses have shown that protist groups are far more genetically heterogeneous than their morphological diversity suggests. However, the clone library approach is labour-intensive, relatively expensive, and methodologically biased. Therefore, even the most intensive rDNA library analyses have recovered only small samples of much larger assemblages, indicating that global environments harbour a vast array of unexplored biodiversity. High-throughput parallel tag 454 sequencing offers an unprecedented scale of sampling for molecular detection of microbial diversity. Here, we report a 454 protocol for sampling and characterizing assemblages of eukaryote microbes. We use this approach to sequence two SSU rDNA diversity markers—the variable V4 and V9 regions—from 10 L of anoxic Norwegian fjord water. We identified 38 116 V4 and 15 156 V9 unique sequences. Both markers detect a wide range of taxonomic groups but in both cases the diversity detected was dominated by dinoflagellates and close relatives. Long-tailed rank abundance curves suggest that the 454 sequencing approach provides improved access to rare genotypes. Most tags detected represent genotypes not currently in GenBank, although many are similar to database sequences. We suggest that current understanding of the ecological complexity of protist communities, genetic diversity, and global species richness are severely limited by the sequence data hitherto available, and we discuss the biological significance of this high amplicon diversity.


Over the past 10 years culture-independent molecular techniques for detecting and elucidating microbial eukaryote communities have been used with increasing regularity across a broad range of terrestrial and aquatic environments, including many ‘extreme’ environments such as anoxic and deep sea habitats (e.g. Bass & Cavalier-Smith 2004; López-García et al. 2003; Lopez-Garcia et al. 2007; Moon-van der Staay et al. 2001; Richards & Bass 2005; Richards et al. 2005; Stoeck & Epstein 2003; Stoeck et al. 2003). These eukaryotic studies were based on earlier pioneering gene cloning techniques for the environmental detection and enumeration of prokaryotes (Medlin et al. 1988; Giovannoni et al. 1990; Lane 1991; Pace 1997). This approach involves PCR amplification of a marker gene—usually small subunit ribosomal RNA genes (SSU rDNA)—from DNA extracted from environmental samples. The amplicon products of this ‘environmental PCR’ are then ligated into linearized plasmid DNA, resulting in circular recombinant molecules that are then used to transform competent bacterial cells. Positive transformants are grown clonally, each replicating a single PCR amplicon, which is then purified and sequenced using the Sanger method (Sanger et al. 1974).

Results from clone library studies have shown that genetic diversity both within known protist taxa and representing new taxa is much greater than was previously suspected. This is because diversity assessment using clone library methods does not rely on traditional culturing techniques, which are highly selective and detect only a very small proportion of organisms in an environmental sample. The term ‘diversity’ is used here only in the sense of the quantification of unique types and does not imply relative abundance of different types. Furthermore, it is now clear that the degree of genetic diversity within morphologically similar strains (both as a result of morphological conservations and convergent evolution) is often very high (e.g. Bass et al. 2007; Howe et al. 2009). However, even the largest SSU rDNA clone library has not approached sampling saturation and fails to accurately characterize the diversity of eukaryotic microbes in an environmental sample (e.g. Edgcomb et al. 2002; Stoeck et al. 2006, 2007). These shortcomings are in part due to the massive genetic diversity of protists in most environments but also because of a number of methodological issues with the clone library approach. These methodological limitations include biases inherent in the plasmid ligation step leading to preferential ligation of shorter SSU rRNA gene homologues (e.g. Huber et al. 2009), and limitations imposed by the relatively high expense and labour-intensity of sequencing sufficient numbers of clones using the Sanger method. This systematic under-sampling of diversity limits the power of subsequent analyses, especially when comparing microbial diversity between environments or temporal samples.

The recent development of high-throughput sequencing technology based on micro-fabricated high-density picolitre reactors—454 sequencing (Margulies et al. 2005)—reduces these limitations by eliminating the ligation and transformation steps of the classical cone library approach (Sogin et al. 2006). Furthermore, 454 sequencing produces hundreds of thousands of sequences in a single run, which means that samples can be sequenced much more deeply and comprehensively at a fraction of the cost and effort (per sequence) compared to the classical approach.

It is well established that bacterial biodiversity in an ecosystem comprises relatively few actively growing and abundant taxa with many rare taxa that are dormant or less active. These rarer taxa are difficult or impossible to detect by clone library-based molecular techniques (Pedrós-Alió 2006). Sogin et al. (2006) used the first generation of 454 sequencing to show that the SSU rDNA diversity of prokaryotes in marine habitats was one or two orders of magnitude greater than previously reported, and that most of this diversity was represented by rare sequence types (tags) detected at low frequency. The authors referred to this long tail of amplicon diversity as the ‘rare biosphere’ (Sogin et al. 2006). Although rank abundance ‘tails’ of rare taxa are a well-known feature of some microbial eukaryote communities (e.g. Caron 2009; Countway et al. 2005; Howe et al. 2009) it is currently unclear to what extent microbial eukaryotes show this trend.

Based on previous 454 environmental diversity surveys of bacterial communities it is reasonable to assume that microbial eukaryotes may also be many times more diverse than clone libraries have recorded, and that 454 sequencing would provide a much more realistic and less biased estimate of their total diversity and community structure. Therefore, we used GS-FLX 454 sequencing technology to recover 602 070 ∼250 bp (±50 bp) sequence tags from a 10-L sample of anoxic water from the redoxcline of a Norwegian fjord. This fjord is characterized by a stable stratification and a permanently anoxic water column with steep physico-chemical gradients below the redoxcline at 18 m depth. We chose this site as an extensive SSU rDNA clone library sequence collection and associated metadata are available from previous studies (Behnke et al. 2006). Using primers targeting a wide diversity of eukaryotic taxonomic groups, we amplified the V4 and V9 regions of the SSU gene, generally the most variable regions of this well-characterized ribosomal RNA-encoding gene (Van de Peer et al. 1997; Wuyts et al. 2000). This PCR strategy was designed to generate relatively short amplicons suitable for 454 sequencing technology and to maximize amplicon diversity (Huber et al. 2009). As variation in the distant V9 region does not exactly track that in the V4, the use of both regions provides two partially independent calibration points for investigating eukaryotic diversity within our sample.

Materials and Methods

Sampling sites and collection procedure

A 10-L water sample was collected from the central basin of the Framvaren Fjord located in Southwest Norway (58.09°N, 06.45°E) from the oxic/anoxic interface at 20 m depth in May 2008. No oxygen was detectable at the point of collection so the sample was judged anoxic. The sampling protocol as well as the protocols for measurement of physico-chemical and biological parameters are in accordance with those described elsewhere (Behnke et al. 2006). In short, samples were taken with Niskin bottles and drawn onto 0.45 μm Durapore membranes (Millipore) under anoxic conditions with no prefiltration step. Samples were frozen immediately in liquid nitrogen until further processing in the laboratory. Environmental parameters at the time of sampling are shown in Table 1.

Table 1.   SSU rDNA data and environmental metadata of the Norwegian Framvaren Fjord anoxic water sample
  1. Physico-chemical parameters were determined as described in (Behnke et al. 2006; Stoeck & Epstein 2003). nd, not detectable.

  2. a Sequences deposited in the short read archive of GenBank, under the accession number SRA009779.1 (all V9 tags are available in the SRP001150 accession file and all V4 tags are available in the SRP001151 accession file).

  3. b Defined as sequence reads > 100 bp, no ambiguous or unidentified nucleotides, complete and correct PCR-primer sequence, correct calibration key sequence.

  4. c Data sets provided as Supporting Data.

  5. d At 80%blast similarity.

N 454-reads totala271 197 330 873
Total (unique) high qualityb tags254 521 (39 396)270 454 (17 190)
Total (uniquec) protist tags including fungi 245 935 (38 116c) 264 008 (15 156c)
Total (unique) Metazoa tags7627 (662)3455 (1373)
Total (unique) Archaea tags0 (0)152 (69)
Total (unique) Bacteria tags5 (4)1610 (312)
Total (unique) unassignabled tags954 (587)1229 (280)
Temperature (°C)10.7
Depth (m)20
Salinity (pss)27
Nitrate (μmol/L)nd
Ammonium (μmol/L)0.22
O2 (μmol/L)nd
H2S (μmol/L)nd
Bacteria (×106 cells/mL)4.6
Bacterial production (H3-Leu, mg/m2/day)670
Chlorophyll a (μg/L)1.22
DNA concentration (ng/μL)160
Water volume sampled10 L
Sampling dateMay 2008

DNA isolation, PCR amplification, and 454 sequencing

DNA was isolated from the environmental sample (chloroform-isoamyl extraction with a high-salt buffer) and quality-checked as described previously (Stoeck & Epstein 2003). PCR amplification and pyrosequencing principally follows the protocol of Sogin et al. (2006). The two PCR primer pairs we used target the variable V9 and V4 regions of the eukaryote SSU rRNA gene. Highly conserved priming sites flank these two variable regions, making them well-suited for the development of PCR primers targeting a maximally broad phylogenetic range of eukaryotic organisms. For V9, the primers were 1391F (5′-GTACACACCGCCCGTC-3′ (Lane 1991), S. cerevisiae NCBI GenBank nucleotide database accession # U53879 position 1629-1644), and EukB (5′-TGATCCTTCTGCAGGTTCACCTAC-3′ (Medlin et al. 1988), S. cerevisiae position 1774-1797). The forward primer is a highly conserved three domain-level primer, thus risking the recovery of non-eukaryotic tags even when used in combination with a eukaryote-specific PCR primer (Stock et al. 2009). However, the highly conserved nature of this 1391F primer appears to recover more higher-taxon groups compared to an alternative primer targeting the same gene region but designed for eukaryotes only (Amaral-Zettler et al. 2009).

For V4, the newly-designed primers were TAReuk454FWD1 (5′-CCAGCA(G/C)C(C/T)GCGGTAATTCC-3′, S. cerevisiae position 565-584) and TAReukREV3 (5′-ACTTTCGTTCTTGAT(C/T)(A/G)A-3′, S. cerevisiae position 964-981). These were designed to anneal to conserved regions adjoining the 5′ and 3′ region of the V4 rRNA loop and were identified by manual inspection of an alignment of over 1000 eukaryote SSU rDNA sequences from all major taxonomic groups and including environmental clone sequences. The primers identified were generally suitable for all eukaryotes with some exceptions in the excavates and the microsporidia. The primer sites were chosen to minimize amplicon length (Huber et al. 2009) but still encompass the whole V4 region. Because the choice of primer sites meeting these conditions was limited the reverse primer was 2 bp shorter than the forward primer and had a considerably lower annealing temperature (because of length and GC content differences). Therefore, to maximize specificity to the V4 region during thermo-cycling we designed a two-step PCR approach involving 10 cycles where only the forward primer would operate and the V4 template population would be increased by forward priming, followed by 25 cycles of PCR at a lower annealing temperature where both forward and reverse primers would amplify. Details of the thermo-cycling protocol are given below.

Both the primers and thermo-cycling protocol were tested and optimized on a range of marine and freshwater DNA samples using standard clone libraries and Sanger sequencing to identify protocols generating maximum eukaryotic sequence diversity. 454 Life Science’s A (5′-GCCTCCCTCGCGCCATCAG-3′) and B (5′-GCCTTGCCAGCCCGCTCAG-3′) sequencing adapters were added on to the primer sequences at the 5′ end of the forward and reverse primers respectively. With each primer pair we ran 10 independent 50 μL PCR reactions with reaction mix consisting of 5U of ProofStart high fidelity taq polymerase (Qiagen), 1× ProofStart reaction buffer, 200 μM dNTPs, 0.5 μM final concentration of each primer and 3–10 ng environmental genomic DNA as template.

The PCR protocol for V9 amplification employed an initial activation step at 95 °C for 5 min, followed by 30 3-step cycles consisting of 94 °C for 30 s, 57 °C for 45 s, and 72 °C for 1 min; and a final 2-min extension at 72 °C. The V4 amplification comprised an initial activation step at 95 °C for 5 min, followed by 10 three-step cycles consisting of 94 °C for 30 s, 57 °C for 45 s, and 72 °C for 1 min, which was followed by 25 further cycles consisting of 94 °C for 30 s, and varying annealing temperatures, each for two of the eight samples (45, 47, 48, and 49 °C) for 45 s, and 72 °C for 1 min; and a final 2-min extension at 72 °C. PCR products from 10 separate amplifications using the same primer pair were pooled and cleaned by using the MinElute PCR purification kit (Qiagen).

The quality of the products was assessed on a Bioanalyzer 2100 (Agilent) using a DNA1000 LabChip. Only sharp, distinct amplification products with a total yield of >200 ng were used for 454 sequencing. 454 sequencing was conducted externally by the University of Liverpool Advanced Genomics Facility using the standard 454 amplicon sequencing protocol for pyrosequencing on a Genome Sequencer FLX system (Roche). All V4 tags were sequenced from the forward primer (5′-end) and all V9 tags from the reverse primer (3′-end). In total we recovered 271 197 V4 sequence reads (tags) and 330 873 V9 tags that were then subjected to quality controls. We deposited all 454 sequences in GenBank (accession numbers SRP001150 for the V9 tags and SRP001151 for the V4 tags).

Bioinformatic removal of low-quality sequence tags

All tags that met any of the following conditions were considered as ‘low quality’ and removed from further analyses: sequences <100 nucleotides, inaccurate calibration key, incomplete or wrong primer sequence, presence of a biased nucleotide or N (unidentified nucleotide).

Identification of unique sequence tags (dereplication) and similarity clustering

Unique tags were identified in two steps: In the first step we compared all tags to each other identifying those that never occurred as a substring (identical in primary structure but not in length) called unique candidates. In a second step, the unique candidates were checked for replicates (identical in length and primary structure) yielding one representative of each unique tag. For each unique tag the number of sequences occurring either as a substring or as a replicate was counted. The gathered information has been stored in an SQL-database (MySQL 5.0.24-community via TCP/IP,

To investigate the clustering boundaries of ‘unique’ environmental sequences we clustered both the V4 and V9 datasets at 99%, 98%, and 97% identity using the DNA sequence-editing program Sequencher (Gencodes) and plotted the tag diversity at these four differing clustering boundaries.

Taxonomic assignment

We assigned taxonomy to each tag by conducting blastn searches (using parameters -m 7 -r 5 -q -4 -G 8 -E 6 -b 50) of each unique tag against a local installation of NCBI’s nucleotide database (nr/nt, release 168). Only unique tags with a best blast hit of at least 80% sequence similarity were assigned to a taxonomic category, giving a reliable assignment at least at the class level (data not shown). The remaining tags were assigned to an artificial category ‘others’. This information was stored in our SQL database. To investigate the taxonomic diversity within the dinoflagellates (the taxon group most extensively sampled) we extracted all sequences that had the closest blast hit above 80% to a named dinoflagellate sequence. Based on this same taxonomic assignment protocol we could identify the species that is closest to our tag. However, for reliability we only assigned taxonomic labels to the class level.


Using the methods described above we identified 16 676 (6.1% of total) low quality V4 reads and 60 419 (18.3%) low quality V9 reads leaving 254 521 V4 and 270 454 V9 tags for further analysis. Using blastn we identified all tags from bacteria, archaea, and metazoa using an 80% similarity-gathering threshold (Table 1), and excluded these from the rest of the analyses, leaving 245 935 V4 and 264 008 V9 tags. The datasets were then clustered by grouping identical tags, resulting in 38 116 unique V4 and 15 156 unique V9 tags of average length 270 and 200 bp respectively (unique sequences are provided in two separate files as Supporting Data).

Investigating breadth and depth of sequence tag diversity

To investigate completeness of our sampling strategy we plotted accumulation curves of unique tags against total number of tags recovered (Fig. 1). These analyses demonstrated that accumulation of unique tags does not demonstrate sample saturation even after ∼250 000 sequence reads; this is particularly evident in the V4 sampling analysis. Rank abundance analyses of protist and fungal data alone (Fig. 2) show a similarly high proportion of rare tags (singletons) in both the V4 (75%—28 550 singleton tags) and V9 (68%—10 359 singletons) datasets.

Figure 1.

 Accumulation curves of V4 and V9 sequence tags. These curves illustrate sampling saturation of unique sequences (and therefore do not control for possible sources of error) and include only fungal and protist sequence tags.

Figure 2.

 Ranked abundances of unique eukaryotic SSU rDNA tags in 10 L of anoxic fjord water.

Community diversity

To show the sequence diversity in the sample when the unique tags are clustered at different levels of similarity we clustered the tags at 99%, 98%, and 97% sequence similarity (Fig. 3). The higher levels of diversity indicated by V4 tags relative to V9 at each similarity level is consistent with the principle that the V4 is the most variable SSU region (Wuyts et al. 2000) (perhaps also that it is more prone to sequencing errors than V9; discussed below). Particularly striking is the differential increase in diversity detected by V4 between 99% (14 098 tags) and 100% (38 116 tags) identity: almost two-thirds of the diversity (24 018 tags) indicated by the V4 is reflected by 1% (two-nucleotide) difference between tags (Fig. 3).

Figure 3.

 Clustering profiles for the V4 and V9 datasets at 97%, 98%, 99%, and 100% similarity. The differences in the number of unique sequence clusters between the V4 and V9 tags is especially evident when the V4 dataset is clustered at 99% (2 nucleotide differences) in comparison to 100% similarity. The estimated combined PCR nucleotide misincorporation and sequencing mis-read error rate in our data is 0.2503% and is illustrated by a dashed line. The 98% cluster group diversity estimates are given by arrows.

Taxonomic composition of the anoxic marine eukaryotic microbial community

We used a custom built taxonomic assignment pipeline (see Materials and Methods) to investigate the taxonomic composition of the target community. To compare our 454 methods with a standard clone library method we plotted taxonomic diversity of the V4 and V9 sequence matches against the taxonomic diversity of a 1000-sequence clone library dataset. This dataset was generated from three clone libraries sampled at three different times/seasons; each sampling time point was conducted using four different primer sets, each of which targeted nearly the full length SSU rDNA sequence (Behnke et al. 2006 and Behnke & Stoeck, unpublished). Figure 4a shows all taxonomic groups accounting for ≥1% of the diversity of the 454 tags and Fig 4b shows groups accounting for <1% of the diversity. Of the 36 higher taxonomic groups listed in the 454 dataset, clone libraries detected only 14 taxonomic groups, confirming that microbial eukaryote diversity in the fjord water is considerably more complex and inclusive of divergent evolutionary groups than suggested by the clone library approach and that this diversity can only be accessed by deep sequencing as afforded by the 454 approach.

Figure 4.

 Taxonomic diversity profile of V4 and V9 454 tags clustered at 100% and 98% similarity in comparison with a clone library-derived diversity profile, clustered at 98%. (a) Higher rank taxon groups that were represented by a proportion ≥1% of all unique tags in at least one of the two sets of amplicons used for 454 sequencing is shown. The category ‘others’ denotes tags that could not be assigned to a taxonomic entity based on an 80%blastn similarity threshold and tags which fell into defined taxon groups but were represented by <1% of the diversity of unique tags in either of the two PCR amplicon libraries used for 454 sequencing (Fig. 4b). Only 0.42% of V4 tags and 0.46% of V9 tags could not be assigned to a taxon group using the 80%blast similarity threshold (Table 1). Clustering at 98% did not substantially change the taxonomic distribution of tags. (b) Higher rank taxon groups that were present in at least one of the two PCR amplicon libraries used for 454 sequencing or the clone libraries but that were represented by <1% of unique tags. It appears that although the clone-library was subjected to relatively large sequencing effort (for clone library studies; c. 1000 clones), this approach missed a substantial number of higher taxonomic groups, most of which were only recovered in low abundance within 454-derived tags. The V9 amplicon library retrieved a greater number of higher rank taxonomic groups than the V4 amplicon library. CL=clone library-derived data (from Behnke et al. 2006 and Behnke and Stoeck, unpublished).

Furthermore, although there are some differences between the diversity profiles of the V4 and V9 regions, they are far more similar to each other than to the results of the clone library analysis. The taxonomic group showing the highest sequence diversity in the 454 data is the dinoflagellates, for both V4 and V9 tags. In the V4 dataset dinoflagellates represented 10.6% (4180) of all unique tags clustered at 100% identity and 12.3% (575) tags clustered at 98%. For V9 the corresponding Figs are 30.1% (5170) for 100% similarity clusters and 29.2% (789) for 98% similarity clusters. However, the V4 and V9 markers detected very different diversity profiles within the dinoflagellates (Fig. 5). V4 taxon diversity was dominated by Gymnodiniaceae (63.3%; 2021 unique tags) and Heterocapsaceae (28.2%; 901 unique tags), while V9 taxon diversity was dominated by Prorocentraceae (63.6%; 862 unique tags). Possible explanations of these differences are given in the Discussion.

Figure 5.

 Comparison of taxonomic distributions of V4 and V9 tags within dinoflagellates. The 454 sequencing results demonstrated that the dinoflagellates were the most diverse taxonomic group sampled (Fig. 4a). The V4 and V9 primer pairs detect very different taxonomic profiles, further confirming the extensive diversity of dinoflagellate sequences within the sample.


Potential sources of sequence error and estimating community diversity

We detected 38 116 and 15 156 high quality unique V4 and V9 sequence tags respectively. Of these, 75% of the V4 and 68% of the V9 tags were detected only once. These proportions may directly represent very large fractions of rare genotypes. However, it is necessary to take into account three potential sources of error which may artificially inflate the apparent levels of diversity detected: the combined effects of nucleotide misincorporation and read errors during PCR and sequencing, PCR chimaera formation, and intragenomic polymorphism among multiple copies of the rRNA cistron within a single nucleus.

The estimated nucleotide misread error rate in our data including the Taq polymerase error (2.34-3.6 × 10-6 (Qiagen Technical Service, personal communication)) and the 454 sequencing accuracy (99.75% with low quality reads removed (Huse et al. 2007)) equals a combined error rate of 0.002503 substitutions per site (or 0.2503%). This equates to a minimum of 0.676 and 0.501 nucleotide changes due to error per V4 and V9 sequence read (of 270 and 200 bp average length) respectively. Fig. 3 shows diversity estimates corrected for the minimum expected error rate. However, the 454 sequencing error rates can be higher in gene regions where homopolymers (runs of the same nucleotide) occur. Such errors include mis-counting the number of nucleotide repeats and incorporation of an erroneous base after the homopolymer run (Huse et al. 2007).

As the V4 region is longer than the V9 region and the eukaryotic V4 region incorporates a very large diversity of secondary structure types (Wuyts et al. 2000) it is likely that this region encompasses a higher number of homopolymers than the V9 region. Indeed, an analysis of 35 040 eukaryotic V4 regions extracted from 18S rDNA sequences and 3640 V9 regions deposited in GenBank (a relatively low number of these SSU rDNA sequences include the 3′-end of the gene) revealed that the mean number of homopolymers per sequence (4 or more of the same nucleotide in a row) is 6.8 times higher in the V4 region compared to the V9 region (Christen and Stoeck, unpublished; data available upon request from the authors). Therefore, the V4 sequences sampled are likely to be prone to an increased frequency of 454-homopolymer read errors than the V9 region. It is thus possible that some of the increased diversity detected in the V4 dataset at the 100% clustering boundary is the product of an increased homopolymer error rate. Because the homopolymer error rate is impossible to accurately predict on such a large sample of novel sequences, it is difficult to identify specific biological variation from experimental artefact. We therefore recognize the need to adjust estimates of diversity and taxonomic profile in light of this, and other possible sources of error (see below).

Chimaera formation has been shown to be an important source of error in clone library analyses that has led to over-estimation of taxonomic novelty and diversity (Berney et al. 2004; Hugenholtz & Huber 2003; von Wintzingerode et al. 1997). However, our protocol amplifies relatively short DNA sequences (<500 bp) in comparison to Sanger sequence reads from standard clone library approaches (>1 kb), which theoretically minimizes the incidence of chimaera formation (Cronn et al. 2002).

Degrees of sequence variation among copies of the SSU rDNA gene within individual nuclei (intragenomic polymorphism) can vary considerably among taxa despite the homogenizing effects of concerted evolution (Arnheim et al. 1980): unequal crossing over and/or gene conversion (Dover 1982). Larger cells (with larger nuclei and therefore more rRNA cistron copies per nucleus (Zhu et al. 2005)) may have higher levels of intragenomic polymorphism than smaller cells, as has been observed in large-celled foraminifera (Pawlowski et al. 2002). The taxonomic groups producing the highest number of unique tags in our data, dinoflagellates and chlorophytes, contain many large-celled taxa, and have been shown to contain many rRNA cistron copies (Zhu et al. 2005). The same is true for diatoms, in which intragenomic SSU polymorphism levels of c. 0.5%—2% have been reported (e.g. Alverson & Kolnick 2005). Conversely, other taxa show very low levels of intragenomic polymorphism, e.g. some species of Cercomonas (Bass et al. 2007). It is therefore not possible to make a single estimate of intragenomic polymorphism level within our sample. However, the %-cluster plot (Fig. 3) enables us to identify and plot a minimum boundary for PCR/sequencing error and estimate diversity accounting for previously observed intra-nuclear polymorphism levels.

We suggest that the 2135 V4 and 1224 V9 types recovered at 97% similarity (which is commonly used to define operational taxonomic units (OTUs) in bacteria (Pedrós-Alió 2006)) is too conservative for estimating taxon diversity in microbial eukaryotes as there is growing evidence that the SSU in many protozoa does not evolve quickly enough to resolve species (or OTU-equivalent) boundaries. Numerous lines of evidence now suggest that variation in the faster-evolving ITS regions provide a better marker for speciation in many groups (Strüder-Kypke et al. 2006; Amato et al. 2007; Bass et al. 2007; Coleman 2007; Chantangsi & Lynn 2008; Bass et al. 2009). However, as discussed above, we cannot accurately distinguish between the most closely related SSU-types in this study because of the minimum combined effects of PCR/sequencing error, untraceable intragenomic polymorphism, and the undetectable incidence of PCR recombination (chimaeras) within our 200–270 bp sequence reads. Therefore, it is important to consider clustering of tags below the 100% level in order to identify diversity estimates and taxonomic profiles where these sources of error are minimized.

Clustering at the 98% level would allow for an average of four errors within the V9 sequence reads and 5 errors within the V4. We therefore conducted taxonomic assignment both at the 100% level and the 98% level in order to identify the taxonomic profile of this community accounting for the possible sources of error and the possibility that errors would have a higher rate of representation among the V4 sequences (Fig. 3). This suggests a minimum tag diversity of 3993 V4 tags and a minimum of 2633 V9 tags within the 10 L sample of anoxic marine water. Such deep sampling is critical for enabling biologically meaningful and statistically robust comparisons between communities and addressing fundamental issues in ecology including true richness, evenness of microbial communities, biogeography, and community dynamics. Progress in these areas has been limited because of the unavoidably high levels of under-sampling in clone library analyses (Ramette et al. 2005; Green & Bohannan 2006; Hughes Martiny et al. 2006). Consequently, technologies equivalent to the 454 methods reported here are required for investigating these important ecological subjects.

Taxonomic composition of the anoxic marine community

A key issue to resolve is whether the caveats described above radically affect the taxonomic distribution and diversity patterns shown on Figs 3, 4a, b. The striking differential increase in diversity suggested by the V4 and V9 tags between 99% and 100% identity (Fig. 3) may be caused by a higher incidence of homopolymers in the longer and structurally more complex V4 region combined with the deterioration of sequencing accuracy towards the end of the V4 read. At 100% the two most diverse groups are dinoflagellates and chlorophytes. However, even at 98% similarity clustering, these two groups remain the most diverse: dinoflagellates accounting for 575 (12.3%) V4 and 789 (29.2%) V9 tags, and chlorophytes accounting for 244 (5.2%) V4/104 (3.8%) V9 tags. Fig. 4a shows that the distribution of the sequence tags across the taxonomic profile remains relatively similar between the datasets clustered at 98% and 100% identity.

Across our complete eukaryote dataset both V4 and V9 tags independently produce broadly similar taxonomic profiles (Fig. 4a). Obtaining this result with two separate PCR primer pairs strongly suggests that it is not an artefact of primer bias. Within the dinoflagellates however, there is evidence of detection bias between V4 and V9 as the two sets of amplicons represent different subsets of dinoflagellate diversity (Fig. 5). This may be caused by the different primer sequences preferentially detecting different dinoflagellate subgroups. In this case using two separate markers would reveal more diversity than either on their own. The alternative explanation is that sets of dinoflagellate taxa represented in GenBank by V4 and V9 SSU regions only partially overlap, which would artefactually lead to apparently different taxa being detected. In support of this possible explanation it is true that GenBank sampling contains many more V4 sequences than V9 sequences (discussed above). The reality is likely to be a combination of these factors.

Interestingly, for some groups (including dinoflagellates) the diversity detected by V9 is greater than that by V4—this may be because the V9 primers are better suited to amplifying genotypes within those groups, or because in these groups—as in some cercomonads (Bass et al. 2009)—the V4 is not always the most variable SSU region. We compared the variability within these two regions in 7503 pairwise comparisons using publicly available dinoflagellate SSU rDNA sequences that include both of these gene regions and showed that the V4 region in dinoflagellates is generally less variable compared to the V9 region (Christen & Stoeck, unpublished; data available from the authors upon request). Dinoflagellates and their close relatives are increasingly recognized as important primary producers, symbionts, parasites (of each other as well as other organisms), and heterotrophic grazers in marine systems (Baillie et al. 2000; Chambouvet et al. 2008; Gast & Caron 1996; Guillou et al. 2008; Knowlton & Rohwer 2003; Rowan & Powers 1992). Our results suggest they represent a highly complex community in anoxic marine habitats; the roles of these diverse genotypes in such ecosystems deserve detailed study.

Comparison of the V4 and V9 SSU markers for understanding eukaryotic community diversity.  The 454 V9 sequencing protocol detected a wider range of higher taxonomic groups than the V4 protocol (Fig. 4b), although groups not detected by V4 were at low levels in the V9 data. This suggests that the V9 provides a more comprehensive overview of the community across all eukaryotes whereas V4 more powerfully differentiates between closely related strains within a less inclusive set of higher-level taxa. However, this observation is based on a very unbalanced representation of V4 and V9 sequences in existing databases (discussed above). Therefore, this conclusion may require revision as database taxonomic sampling of the eukaryotic SSU rDNA increases. It is also possible that V4 detects fewer of the high level taxonomic groups than V9 because the higher sampling of ‘microdiversity’ within some groups reduces the capacity for the detection of others. Perhaps with deeper sequencing of the pool of V4 amplicons and improved database taxon sampling for both the V4 and V9 regions our protocol may detect a comparable set of higher-level taxonomic groups for both markers. However, it is clear that these two molecular markers have different advantages depending upon the taxonomic group and particular biological question being studied.

Novel taxon diversity

The majority of tags identified in our study (71% and 40% for V4 and V9 respectively) matched unnamed environmental sequences in GenBank rather than previously identified species. While this is to some extent a consequence of the increasingly large representation of environmental sequences in GenBank, it also demonstrates that a vast proportion of eukaryotic microbial diversity is currently undescribed. This result therefore emphasizes the multitude of uninvestigated protist and fungal branches on the tree of life, many of which will represent important ecological agents and evolutionary forms critical for understanding the evolution of microbial life (e.g. Massana et al. 2002, 2004; Not et al. 2007).


Our results are consistent with the understanding that an extremely complicated microbial eukaryote community underpins marine ecosystems, and suggest that this complexity may be much greater than previously thought. This community is composed of a wide range of taxonomic groups including representatives from all of the six eukaryotic supergroups (Simpson & Roger 2004). However, a large proportion of the SSU sequence diversity affiliates with dinoflagellates, demonstrating the importance of this group to marine ecosystems. The 454 sequence datasets are dominated by a very high diversity of amplicons detected in very low numbers. The significance of this high proportion of rare tags awaits elucidation, and depends on controlling for factors that artificially inflate the apparent diversity detected: PCR and sequencing errors, levels of intragenomic SSU rDNA polymorphism, and PCR chimaera formation. Correct interpretation of environmental sequence data also depends on knowledge of the degree of sequence differences between biologically distinct strains—this is unknown for many protists (for the majority of which morphological and autecological data are not available), and may vary significantly among groups. The nature of the eukaryote and prokaryote ‘rare biosphere’s raises interesting ecological questions: To what extent are ‘rare’ strains active and ecologically significant? Do rare strains become more abundant in response to particular environmental changes, and do they act as ecosystem ‘buffers’? Which environmental changes effect shifts in community structure, and how? Understanding the biological significance of the huge genetic diversity of microbial eukaryotes is the next major challenge in microbial ecology, and is necessary in order to understand the functioning of all ecosystems.


This research was supported by a CoSyst (BBSRC, NERC, Linnean Society, and Systematics Association) grant awarded to TAR, and grant STO414/3-1 by the Deutsche Forschungsgemeinschaft awarded to TS. TAR thanks the Leverhulme Trust for fellowship support. We thank Neil Hall and Margaret Hughes at The University of Liverpool Advanced Genomics Facility for sequencing and general advice. We also acknowledge the suggestions of two anonymous reviewers who helped to improve the first version of this manuscript.

Conflicts of interest

The authors have no conflict of interest to declare and note that the sponsors of the issue had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.