Fungi play fundamental roles in the nutrient cycling process in most terrestrial ecosystems, notably through forming symbiotic associations such as mycorrhiza with plants and through decomposition of wood and plant debris (Stajich et al., 2009). The fact that fungi spend most of their life cycle below ground or within substrates has left the scientific community with a fragmentary understanding of fungal diversity, and a modest c. 7% of the estimated 1.5 million extant species of fungi have been described (Kirk et al., 2008). The poor correlation between the presence of fungal fruiting bodies or other macroscopic structures and the full diversity of the mycobiome at any sample site has shifted the focus in fungal ecology from fruiting bodies to molecular (DNA sequence) data, and nearly all recent attempts to characterize fungal communities are based on sequence data (Taylor, 2008). Such studies have hitherto been limited in sequence depth by the high cost and investment of effort associated with traditional Sanger sequencing of large numbers of samples, but recent methodological progress in the form of next-generation sequencing (NGS) technologies (Shendure & Ji, 2008) offers a remedy to these problems. One of these NGS technologies – massively parallel (‘454’) pyrosequencing (Margulies et al., 2005) – has the capacity to generate more than a million sequences of c. 500 base-pairs (bp) length in the course of a day, making it a groundbreaking tool for environmental sequencing of fungi.
For all the research venues opened by the NGS in general and pyrosequencing in particular, the technologies remain fairly complicated and may, in the absence of generally acknowledged standards, even prove counterproductive to fungal ecology and mycology at large. Various types of incompletely understood biases are introduced at different steps of the analyses (Quince et al., 2009; Bellemain et al., 2010; Tedersoo et al., 2010), and these are often paid little attention to during the interpretation of the results. Approaches to delimitation of species or operational taxonomic units (OTUs) from molecular data differ widely among users, as do ideas on how to handle abundance data, taxonomic standards, and ecological classifications (Hibbett et al., 2011). New web-based software has been developed specifically for sequence processing and analysis of fungal pyrosequencing data – for example, CLOTU (http://www.bioportal.uio.no), SCATA (http://scata.mykopat.slu.se), and PlutoF (http://unite.ut.ee) – and these resources approach sequence clustering and identification in different ways. Other areas where standards are lacking include what data and files to make available with the study in question and what estimates, statistics, and level of detail to report as part of the results. As a consequence, most NGS studies measure and quantify slightly to fundamentally different things and often do so in ways that are neither very clear to the reader nor directly amenable to independent repetition and verification.
Hibbett et al. (2011) provided an overview of the 10 first 454-based studies of fungal communities. While each study represented a significant achievement, differences in the nature and level of detail reported made precise comparison difficult. In more than half of the cases, email exchange with the corresponding authors was necessary for clarification, verification, and access to additional data. If allowed to continue unabated, the heterogeneous, nonstandardized reporting of NGS data on fungal communities will prevent the detailed comparisons of communities and biological processes that the NGS technologies were hoped to enable. That is not in anyone’s interest, and in this letter we introduce a set of core elements we feel every NGS-based study of fungal communities should report on. A standard for how environmental NGS data should be generated and analysed is probably not warranted – or even desirable – at this stage, since many aspects of the processing and analysis of NGS data remain tentative and are likely to vary with the scope of the study. Our proposals are instead oriented towards how data and results should be reported and made available to the scientific community (Table 1 and see later). We hope that they will be considered a lower bound on the level of detail necessary, although we recognize that it may not be possible to generate all of them for every NGS dataset.
|Elements to report on||Example(s)|
|Sequence data: filtering, denoising, and availability|
|Raw sequence data file||The 454 SFF file and the corresponding unprocessed FASTA and tag/barcode files were deposited in the European Nucleotide Archive (ENA) as ERR00000X|
|Filtering and trimming||Leading and trailing, but not intercalary, ambiguity symbols were pruned in Flower 0.7 (http://biohaskell.org/Applications/Flower) before analysis. Sequences with more than 2% intercalary ambiguity symbols were discarded. All sequences shorter than 250 base-pairs (bp). after the removal of barcodes, tags, and primers were discarded. All sequences longer than 450 bp were trimmed down to 450 bp from the 3′-end|
|Sequence denoising||AmpliconNoise 1.2 (Quince et al., 2011) was used to denoise the entries; default settings were used|
|Number of discarded and retained sequences||Out of a total of 140 000 sequences, 18 234 were found to be of poor read quality, 15 213 too short, and 5430 potentially chimeric, leaving 101 123 (72%) sequences for downstream analyses|
|Sequencing depth||(A) Of the 101 123 sequences retained, 24 832 represented sample A, 31 323 sample B, 25 118 sample C, and 19 850 sample D. (B) The number of sequences per sample ranged from 4201 to 7912 in the 15 samples|
|FASTA files||All processed sequences passing the filtering step are available in the FASTA format as Supplementary Item X together with separate FASTA files representing each sample|
|Sequence data: analysis and taxonomic assignment|
|Details of the genetic marker used||(A) The full ITS1 of the nuclear ITS region as extracted using Nilsson et al. (2010) was used. (B) The V2 and V3 regions – and their conserved intercalary segment – of the 18S was used (Hartmann et al., 2010). Sequences covering < 75% of the full length of the target region were discarded|
|Type and specifics of sequence clustering||Complete-link clustering at 98.5% similarity (global alignment) was done in UCLUST 2.1 (Edgar, 2010) with the most abundant sequence types serving as cluster seeds. A list of the reads in each operational taxonomic unit (OTU) is provided in Supplementary Item X|
|Sequence data used for taxonomic annotation||The most frequent sequence type in each cluster was used for the BLAST searches, and the corresponding FASTA file is provided as Supplementary Item X|
|Specification of the taxonomic reference database||All fully identified entries in INSD (Benson et al., 2011) and UNITE (Abarenkov et al., 2010) as of December 2010 were used as reference sequences|
|Specification of the taxonomic annotation procedure||BLAST 2.2.22 (Altschul et al. 1997) was used. ≥ 97% similarity across the entire length of the pairwise alignment was taken to indicate conspecificity; however, ≥ 99% was required in the cases of Cortinarius, Aspergillus and Penicillium. If only one reference sequence was available for some given species, or if the taxonomic annotation was contradicted by another, equally close reference sequence, an asterisk and a question mark, respectively, was added to the taxonomic annotation of the sequence. Greater than or equal to 90% similarity was taken to approximate the genus level (e.g. Hydnum sp.) and ≥ 60% similarity the ordinal level (e.g. Boletales sp.). Only sequences determined at least to ordinal level were used for the phylum-level comparison of Fig. X. Sequences not having a ≥ 60% BLAST match to any reference sequence were considered potentially compromised; were marked as such; and were excluded from the ecological statistics but not from the list of OTUs recovered|
|Handling of singletons and OTUs with few reads||(A) All singletons were discarded. (B) All OTUs with fewer than 5 reads were excluded from further analysis. (C) Only singletons at least 90% identical to the reference sequence of a species not otherwise recovered in the study were kept|
|Count of OTUs recovered||A total of 242 nonsingleton OTUs were recovered. Forty-two were unique to single samples, and the rest were shared by two or more samples (Supplementary Item X)|
|List of OTUs recovered||A complete annotated list of the abundance of all fungal OTUs recovered in each sample is provided in the QIIME format (http://qiime.sourceforge.net/documentation/file_formats.html) as Supplementary Item X|
|Taxonomic affiliations||The OTUs were identified to species, genus, or order as applicable, and the taxonomic affiliations are provided as a separate file (Supplementary Item X)|
|Proportion of fully identified OTUs||We tentatively identified 72 (30%) of the 242 OTUs to species level. Of the remaining 170 OTUs, we tentatively identified 60 to genus and 88 to order level|
|Phylum-level distribution||Ninety-seven per cent of the OTUs were of fungal origin. Of these, 58% belonged to Basidiomycota; 22% to Ascomycota; 12% to Glomeromycota; and 8% were found to belong to other fungal lineages|
Most current NGS studies of fungal communities rely on pyrosequencing, but this is a situation that may change. Whereas our recommendations should be at least conceptually compatible with other existing and emerging NGS technologies such as Illumina sequencing (Bentley, 2006), it is likely that both refinement and adaptation will eventually be needed.