Insidious effects of sequencing errors on perceived diversity in molecular surveys


Soil communities are increasingly described using molecular methods, with very high levels of diversity frequently reported, including estimates as high as 2240 fungal species in 4-g samples of forest soils (Buée et al., 2009) and similar high diversity reports of bacterial communities in a variety of habitats (e.g. Sogin et al., 2006; Keijser et al., 2008). In molecular surveys, species or operational taxonomic units (OTUs) are detected and characterized on the basis of DNA sequence divergence, with very few of these putative taxa matched to actual physical specimens. Many of these sequence-based taxa are only found once, termed ‘singletons’, with singletons comprising > 60% of taxa in some surveys (Buée et al., 2009). Singletons are problematic, as they represent inherently unreplicated data. Tedersoo et al. (2010) have now shown that most singletons in pyrosequencing of fungal communities are probably the result of DNA sequencing errors creating false sequence-based taxa. Tedersoo et al. (2010) suggest that errors in sequence-based species detection were predominately caused by insertions (see also Huse et al., 2007). Further errors in detecting species can occur owing to deletion errors, low-quality and variable length reads, nontarget amplification, inadequate clustering (Huse et al., 2010) and chimeric sequences, where DNA from different species combines to create a hybrid sequence (Jumpponen & Johnson, 2005), not all of which are detected by chimera-checking software (Gonzalez et al., 2005), among other causes. While it has been recognized that sequencing errors may inflate diversity estimates by creating false taxa (Quince et al., 2009; Reeder & Knight, 2009; Kunin et al., 2010), the quantitative effect of errors in sequence-based species detection and extrapolation to estimate species richness has not been widely considered.

To address this, I developed and utilized a community simulation model to assess the effects of DNA sequence species detection errors on perceived species richness based around two assumptions: DNA sequence-based species identification has a non-zero error rate; and erroneous species are likely to be singletons. The model is particularly relevant to pyrosequencing surveys of communities (e.g. 454, Tedersoo et al., 2010), but applies to any present or future molecular technique that fits the two assumptions. Both assumptions are supported by empirical data (Ashelford et al., 2005; Porazinska et al., 2009; Quince et al., 2009; Tedersoo et al., 2010), although some errors can be systematic, resulting in nonsingleton errors (Gomez-Alvarez et al., 2009). Accepting that many errors can be detected and controlled for with sufficient care and data curation, I assumed a final error rate of 1% for the model, which is probably conservative (but see Huse et al., 2010). I tested the model across a range of assumed theoretical species abundance distributions (broken stick, unified neutral theory, power function, log normal, geometric; Magurran, 2004) and at a range of real species diversity (from 50 to 2000 species). The model was run in r (2.11.1; R Development Core Team, 2010) and the complete, annotated script is given in the Supporting Information, Notes S1.

Given these assumptions, observed species richness of a community substantially exceeded real species richness at high sequencing rates, never reached an asymptote and became increasingly inaccurate as the number of DNA sequences analysed was increased (Fig. 1a). The effect of sequencing error is particularly large if species richness extrapolation is performed (e.g. Chao or Jackknife estimators, Fig. 1a). The Chao estimator, which has been recommended for fungal communities (Unterseher et al., 2008) was particularly sensitive, as it scales with the square of the number of singletons. These, and similar, metrics rely on the number of singletons to estimate undetected species (Magurran, 2004). As singletons become an increasingly large proportion of the community, these species richness estimators therefore become increasingly inaccurate. The patterns of a failure to saturate species accumulation curves and increasingly inaccurate species richness estimation were qualitatively robust to varying the underlying species distribution and diversity.

Figure 1.

 The effect of increasing numbers of sequences on perceived species richness (a), including number of species observed (blue), and three different species richness extrapolations; (b) the proportion of species represented by singletons in perceived community; and (c) rarefied species richness at a sample size of 100. This simulation was run with a true species richness of 500 (black line), a broken stick species abundance distribution, and an assumed error rate of 1% of sequences. Equivalent figures for a true species richness of 2000 with 1% error rate, and for a 500-species community measured without error are given in the Supporting Information, Figs S1 and S2, respectively, and an error sensitivity analysis is given in Fig. S3.

An interesting signature of erroneous data is seen in plotting the proportion of singletons in a perceived community as a function of sequence numbers obtained (Fig. 1b). As real diversity increases, the nadir of the curve moves to the right and the degree of subsequent rise is reduced (see Fig. S1) but the general pattern remains similar. This pattern is driven by two simultaneous processes. In the absence of errors, the number of singletons as a proportion of species will decline from 100% to approach zero asymptotically (Fig. S2). If errors are present, however, they will increase in number with sampling effort and hence the proportion of recorded species that represent erroneous singletons will increase from near zero to approach 100% asymptotically. The resulting pattern (Fig. 1b) could provide a useful diagnostic for sequencing errors in real datasets and for selecting an optimal sequencing effort in real communities.

While species richness estimators such as Chao or jackknife are based on extrapolation of a sampling curve, an alternative approach to comparing diversity across studies is rarefaction analysis. Rarefaction analysis is often used to construct sampling accumulation or collector’s curves (Magurran, 2004). In contrast to observed richness and species richness extrapolation, rarefaction analysis (sampling random subsets of the community) was not adversely influenced by increasing the number of sequences obtained (Fig. 1c). As erroneous singletons represent a very small proportion of total individuals, they are essentially eliminated by subsampling the community at low or moderate effort levels in rarefaction. This suggests that rarefaction can be a robust method for comparing diversity across studies, unlike species richness extrapolation.

While there have been suggestions made to reduce error rates (Reeder & Knight, 2009; Huse et al., 2010; Kunin et al., 2010), it is unlikely that sequencing errors can ever be completely eliminated and the assumption of 1% error rates is likely to be conservative. Sensitivity analysis (Fig. S3) suggests that the effect of even small errors is large when species richness extrapolation is performed. Researchers must therefore carefully consider the effect of errors on the statistical analyses used. A conservative approach would be to treat all singleton sequences as suspect, and delete singletons from analysis (Medinger et al., 2010; Tedersoo et al., 2010), although this may well eliminate some real species as well. Species richness estimators that rely on singleton frequency are probably inappropriate for sequence-based community analyses, particular where large numbers of samples are sequenced. Developing more robust techniques for richness estimation will be required if the enumeration of species numbers remains a research objective.

While the effect of sequencing errors is not unique to any one method of molecular analysis, the development of very high-throughput molecular techniques has vastly increased the number of sequences obtained in ecological studies of soil organisms (Buée et al., 2009; Tedersoo et al., 2010) and future techniques may drive these numbers even higher. It is generally assumed that an increase in sampling effort will increase the accuracy of diversity estimates in ecological studies. The results from this modeling exercise show that the exact opposite is true for studies where taxa are detected and identified based on DNA sequences. In sampling with non-zero species detection error rates, increased sampling effort results in erroneous data dominating perceived species richness.


I thank L. Tedersoo, G. Grelet, P. Bellingham, S. Richardson, M. McGlone, M. St John, A. Marburg, T. Easdale, and D. Peltzer for helpful discussions, and three reviewers for their constructive input. This research was funded by the Ecosystem Resilience OBI (Outcome Based Investment) of the FRST (Foundation for Research, Science and Technology) of New Zealand.