Towards standardization of the description and publication of next-generation sequencing datasets of fungal communities


Fungi play fundamental roles in the nutrient cycling process in most terrestrial ecosystems, notably through forming symbiotic associations such as mycorrhiza with plants and through decomposition of wood and plant debris (Stajich et al., 2009). The fact that fungi spend most of their life cycle below ground or within substrates has left the scientific community with a fragmentary understanding of fungal diversity, and a modest c. 7% of the estimated 1.5 million extant species of fungi have been described (Kirk et al., 2008). The poor correlation between the presence of fungal fruiting bodies or other macroscopic structures and the full diversity of the mycobiome at any sample site has shifted the focus in fungal ecology from fruiting bodies to molecular (DNA sequence) data, and nearly all recent attempts to characterize fungal communities are based on sequence data (Taylor, 2008). Such studies have hitherto been limited in sequence depth by the high cost and investment of effort associated with traditional Sanger sequencing of large numbers of samples, but recent methodological progress in the form of next-generation sequencing (NGS) technologies (Shendure & Ji, 2008) offers a remedy to these problems. One of these NGS technologies – massively parallel (‘454’) pyrosequencing (Margulies et al., 2005) – has the capacity to generate more than a million sequences of c. 500 base-pairs (bp) length in the course of a day, making it a groundbreaking tool for environmental sequencing of fungi.

For all the research venues opened by the NGS in general and pyrosequencing in particular, the technologies remain fairly complicated and may, in the absence of generally acknowledged standards, even prove counterproductive to fungal ecology and mycology at large. Various types of incompletely understood biases are introduced at different steps of the analyses (Quince et al., 2009; Bellemain et al., 2010; Tedersoo et al., 2010), and these are often paid little attention to during the interpretation of the results. Approaches to delimitation of species or operational taxonomic units (OTUs) from molecular data differ widely among users, as do ideas on how to handle abundance data, taxonomic standards, and ecological classifications (Hibbett et al., 2011). New web-based software has been developed specifically for sequence processing and analysis of fungal pyrosequencing data – for example, CLOTU (, SCATA (, and PlutoF ( – and these resources approach sequence clustering and identification in different ways. Other areas where standards are lacking include what data and files to make available with the study in question and what estimates, statistics, and level of detail to report as part of the results. As a consequence, most NGS studies measure and quantify slightly to fundamentally different things and often do so in ways that are neither very clear to the reader nor directly amenable to independent repetition and verification.

Hibbett et al. (2011) provided an overview of the 10 first 454-based studies of fungal communities. While each study represented a significant achievement, differences in the nature and level of detail reported made precise comparison difficult. In more than half of the cases, email exchange with the corresponding authors was necessary for clarification, verification, and access to additional data. If allowed to continue unabated, the heterogeneous, nonstandardized reporting of NGS data on fungal communities will prevent the detailed comparisons of communities and biological processes that the NGS technologies were hoped to enable. That is not in anyone’s interest, and in this letter we introduce a set of core elements we feel every NGS-based study of fungal communities should report on. A standard for how environmental NGS data should be generated and analysed is probably not warranted – or even desirable – at this stage, since many aspects of the processing and analysis of NGS data remain tentative and are likely to vary with the scope of the study. Our proposals are instead oriented towards how data and results should be reported and made available to the scientific community (Table 1 and see later). We hope that they will be considered a lower bound on the level of detail necessary, although we recognize that it may not be possible to generate all of them for every NGS dataset.

Table 1.   Elements we suggest all next-generation sequencing (NGS) studies of fungal communities should report on in a clear and comprehensive way
Elements to report onExample(s)
  1. One or more examples (A–C as applicable) are given for each element; they remain examples however and should not necessarily be seen as recommendations of methodology or specific software packages. We have no opinion on the exact form in which the elements should be reported in a publication, and we view this item as a checklist rather than as a mandatory table.

Sequence data: filtering, denoising, and availability
Raw sequence data fileThe 454 SFF file and the corresponding unprocessed FASTA and tag/barcode files were deposited in the European Nucleotide Archive (ENA) as ERR00000X
Filtering and trimmingLeading and trailing, but not intercalary, ambiguity symbols were pruned in Flower 0.7 ( before analysis. Sequences with more than 2% intercalary ambiguity symbols were discarded. All sequences shorter than 250 base-pairs (bp). after the removal of barcodes, tags, and primers were discarded. All sequences longer than 450 bp were trimmed down to 450 bp from the 3′-end
Sequence denoisingAmpliconNoise 1.2 (Quince et al., 2011) was used to denoise the entries; default settings were used
Number of discarded and retained sequencesOut of a total of 140 000 sequences, 18 234 were found to be of poor read quality, 15 213 too short, and 5430 potentially chimeric, leaving 101 123 (72%) sequences for downstream analyses
Sequencing depth(A) Of the 101 123 sequences retained, 24 832 represented sample A, 31 323 sample B, 25 118 sample C, and 19 850 sample D. (B) The number of sequences per sample ranged from 4201 to 7912 in the 15 samples
FASTA filesAll processed sequences passing the filtering step are available in the FASTA format as Supplementary Item X together with separate FASTA files representing each sample
Sequence data: analysis and taxonomic assignment
Details of the genetic marker used(A) The full ITS1 of the nuclear ITS region as extracted using Nilsson et al. (2010) was used. (B) The V2 and V3 regions – and their conserved intercalary segment – of the 18S was used (Hartmann et al., 2010). Sequences covering < 75% of the full length of the target region were discarded
Type and specifics of sequence clusteringComplete-link clustering at 98.5% similarity (global alignment) was done in UCLUST 2.1 (Edgar, 2010) with the most abundant sequence types serving as cluster seeds. A list of the reads in each operational taxonomic unit (OTU) is provided in Supplementary Item X
Sequence data used for taxonomic annotationThe most frequent sequence type in each cluster was used for the BLAST searches, and the corresponding FASTA file is provided as Supplementary Item X
Specification of the taxonomic reference databaseAll fully identified entries in INSD (Benson et al., 2011) and UNITE (Abarenkov et al., 2010) as of December 2010 were used as reference sequences
Specification of the taxonomic annotation procedureBLAST 2.2.22 (Altschul et al. 1997) was used. ≥ 97% similarity across the entire length of the pairwise alignment was taken to indicate conspecificity; however, ≥ 99% was required in the cases of Cortinarius, Aspergillus and Penicillium. If only one reference sequence was available for some given species, or if the taxonomic annotation was contradicted by another, equally close reference sequence, an asterisk and a question mark, respectively, was added to the taxonomic annotation of the sequence. Greater than or equal to 90% similarity was taken to approximate the genus level (e.g. Hydnum sp.) and ≥ 60% similarity the ordinal level (e.g. Boletales sp.). Only sequences determined at least to ordinal level were used for the phylum-level comparison of Fig. X. Sequences not having a ≥ 60% BLAST match to any reference sequence were considered potentially compromised; were marked as such; and were excluded from the ecological statistics but not from the list of OTUs recovered
Handling of singletons and OTUs with few reads(A) All singletons were discarded. (B) All OTUs with fewer than 5 reads were excluded from further analysis. (C) Only singletons at least 90% identical to the reference sequence of a species not otherwise recovered in the study were kept
Post-clustering/taxonomic results
Count of OTUs recoveredA total of 242 nonsingleton OTUs were recovered. Forty-two were unique to single samples, and the rest were shared by two or more samples (Supplementary Item X)
List of OTUs recoveredA complete annotated list of the abundance of all fungal OTUs recovered in each sample is provided in the QIIME format ( as Supplementary Item X
Taxonomic affiliationsThe OTUs were identified to species, genus, or order as applicable, and the taxonomic affiliations are provided as a separate file (Supplementary Item X)
Proportion of fully identified OTUsWe tentatively identified 72 (30%) of the 242 OTUs to species level. Of the remaining 170 OTUs, we tentatively identified 60 to genus and 88 to order level
Phylum-level distributionNinety-seven per cent of the OTUs were of fungal origin. Of these, 58% belonged to Basidiomycota; 22% to Ascomycota; 12% to Glomeromycota; and 8% were found to belong to other fungal lineages

Most current NGS studies of fungal communities rely on pyrosequencing, but this is a situation that may change. Whereas our recommendations should be at least conceptually compatible with other existing and emerging NGS technologies such as Illumina sequencing (Bentley, 2006), it is likely that both refinement and adaptation will eventually be needed.

Data availability

Detailed comparisons and meta-analyses of studies are only possible if the underlying data are made available for download. This is not always the case, however, necessitating email exchange with authors regarding files that may no longer exist (Whitlock et al., 2010). To make the data available through the authors’ personal web pages is similarly a makeshift solution that does not meet the criterion of long-term availability to the scientific community (Wren, 2008). We propose that all data relevant to the re-analysis and interpretation of fungal NGS studies should be deposited in central data archives. The European Nucleotide Archive (ENA; Leinonen et al., 2011) is the recommended resource for storage of raw NGS data, including flowgram (‘SFF’) and unprocessed FASTA files in the case of pyrosequencing datasets. In addition, we propose that all relevant processed and derived files that were used to generate the results of any given NGS study – and that normally cannot be archived in ENA – should be deposited at the publisher’s site as supplementary data along with the article presenting the study in question.

Description of sample site and laboratory procedures

For the description of the sample site and sampling conditions, we propose that the MIMARKS/MIxS standard (Yilmaz et al., 2010) should be followed. In so far as the specifics of each sample differ, we argue that full metadata should be given for each sample. We advocate that the laboratory procedures should be described in comprehensive detail – including full primer sequence data, polymerase chain reaction (PCR) enzyme specifics, and other points routinely left out by many authors – and we discourage the practice of referring to other articles instead of providing the corresponding information in writing.

Sequence data: analysis and taxonomic assignment

Fully automated, high-quality species identification from fungal sequence data is presently not possible for any non-trivial assemblage of fungal lineages, leaving caution and taxonomic expertise as important elements in molecular identification of fungi. In particular we wish to discourage the use of single, static similarity thresholds for species demarcation as far as possible; these thresholds should be allowed to vary to better account for differences among fungal lineages (e.g. Nilsson et al., 2008; Hughes et al., 2009; Seifert, 2009). Any specific threshold value tailored for, e.g., the internal transcribed spacer (ITS) region, will typically carry over poorly to other genetic markers, such as the ribosomal small subunit. The molecular identification procedure is underspecified in many NGS studies, making independent repetition difficult.

Discretionary consideration

Molecular identification of fungi is fraught with methodological complications, but above all it is severely hampered by the lack of reference sequences for much of the extant diversity of fungi. It would seem appropriate to plan each NGS-study of environmental samples so that taxonomic expertise is accounted for among, or available to, the authors of the study. If also some part of the budget could be allocated to generating reference sequences from fruiting bodies relevant for the ecosystem and geographical region under study, then those NGS studies would contribute to the reference sequence databases and ultimately the possibilities to disentangle the diversity discovered (Brock et al., 2009).


Each NGS study represents a considerable investment in terms of time and money, and for that investment to be of maximum use to the broader scientific community, the data generated and results obtained should be presented and made available in a comprehensive, transparent way. We believe the elements discussed earlier are a significant contribution to the development of such a standard, and their specification is unlikely to take more than a few hours. The NGS is the most exciting development in fungal ecology for many years, and correctly employed it will enable great strides to be made towards a much deeper understanding of fungi and their trophic roles in ecosystems.