Massively parallel rRNA gene sequencing exacerbates the potential for biased community diversity comparisons due to variable library sizes


E-mail; Tel. (+1) 865 576 3982; Fax (+1) 865 576 8646. The submitted manuscript has been authored by a contractor of the US Government under contract DE AC05-00OR22725. Accordingly, the US Government retains a nonexclusive, royalty-free licence to publish or reproduce the published form of this contribution, or allow others to do so, for US Government purposes.


Technologies for massively parallel sequencing are revolutionizing microbial ecology and are vastly increasing the scale of ribosomal RNA (rRNA) gene studies. Although pyrosequencing has increased the breadth and depth of possible rRNA gene sampling, one drawback is that the number of reads obtained per sample is difficult to control. Pyrosequencing libraries typically vary widely in the number of sequences per sample, even within individual studies, and there is a need to revisit the behaviour of richness estimators and diversity indices with variable gene sequence library sizes. Multiple reports and review papers have demonstrated the bias in non-parametric richness estimators (e.g. Chao1 and ACE) and diversity indices when using clone libraries. However, we found that biased community comparisons are accumulating in the literature. Here we demonstrate the effects of sample size on Chao1, ACE, CatchAll, Shannon, Chao–Shen and Simpson's estimations specifically using pyrosequencing libraries. The need to equalize the number of reads being compared across libraries is reiterated, and investigators are directed towards available tools for making unbiased diversity comparisons.