SEARCH

SEARCH BY CITATION

Keywords:

  • massively parallel sequencing;
  • community profiling;
  • sequence identification

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Supporting Information

The advent of new high-throughput DNA-sequencing technologies promises to redefine the way in which fungi and fungal communities – as well as other groups of organisms – are studied in their natural environment. With read lengths of some few hundred base pairs, massively parallel sequencing (pyrosequencing) stands out among the new technologies as the most apt for large-scale species identification in environmental samples. Although parallel pyrosequencing can generate hundreds of thousands of sequences at an exceptional speed, the limited length of the reads may pose a problem to the species identification process. This study explores whether the discrepancy in read length between parallel pyrosequencing and traditional (Sanger) sequencing will have an impact on the perceived taxonomic affiliation of the underlying species. Based on all 39 200 publicly available fungal environmental DNA sequences representing the nuclear ribosomal internal transcribed spacer (ITS) region, the results show that the two approaches give rise to quite different views of the diversity of the underlying samples. Standardization of which subregion from the ITS region should be sequenced, as well as a recognition that the composition of fungal communities as depicted through different sequencing methods need not be directly comparable, appear crucial to the integration of the new sequencing technologies with current mycological praxis.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Supporting Information

Mycologists face the daunting task of characterizing the very large and unwieldy fungal kingdom in a taxonomic context. Estimated at 1.5 million species and reported from little short of all biota on Earth, fungi are thought to be responsible for many key ecological functions such as wood and litter decomposition, mycorrhizal associations, and other forms of nutrient recycling (Hawksworth, 2001). Inconspicuous by default, fungi are typically noticed only when they form above-ground fruiting bodies or other propagules. The study of fungi is thus plagued by a reliance on ephemeral structures whose presence or absence is only weakly correlated with the actual mycoflora of the collection site (Porter et al., 2008). Adding to the complexity, even outwardly very similar or identical fruiting bodies often prove to represent several distinct (cryptic) species (Geml et al., 2006; Paulus et al., 2007). These observations make a compelling case for DNA sequences as a vital information source in contemporary mycology, and the scientific study of fungi is indeed as much a molecular as a morphological enterprise today (Blackwell et al., 2006; Hibbett, 2007).

The last few years have witnessed a surge in the interest in characterizing the mycoflora of entire localities and ecosystems (Bruns et al., 2008; Taylor, 2008). The desire to sequence whole communities of fungi from any given study site imposes very high demands in terms of high-throughput sequencing as to question the use of the presently popular capillary (Sanger)-based techniques in the first place (c.f. Metzker, 2005; Kahvejian et al., 2008). Indeed, emerging sequencing technologies with the capacity to generate hundreds of thousands of limited-length sequences within a matter of hours promise to take over the sequencing role for environmental studies (Strausberg et al., 2008). Three major platforms (Applied Biosystems SOLiD, Illumina Sequencing, and 454 Life Science/Roche massively parallel pyrosequencing) are presently in use for high-throughput sequencing, but only pyrosequencing yields long enough DNA templates to be considered for rigid use in a species-level classification framework (SOLiD, 31 bp; Illumina, 36 bp; pyrosequencing, c. 250 bp; Shendure & Ji, 2008). Although the current pyrosequencer GS FLX Standard is bound by an upper sequence length of about 250 bp, pyrosequencing of target genes and regions known to be sufficiently variable should in theory yield enough information to allow identification to the species level (Liu et al., 2008).

In mycology, the internal transcribed spacer (ITS) region of the nuclear ribosomal repeat unit is by far the most commonly sequenced region for queries of systematics and taxonomy at and below the genus level. Although the ITS region is not entirely unproblematic (Feliner & Rosselló, 2007), >100 000 fungal ITS sequences have been deposited in the International Nucleotide Sequence Databases (INSD; Benson et al., 2008) since the early 1990s (Nilsson et al., 2008). The roughly 650-bp region is normally obtainable in a single round of Sanger DNA sequencing, and of its three subregions (the spacers ITS1 and ITS2 and the 5.8S gene), two (ITS1 and ITS2) show a high rate of evolution and are typically species specific (Bruns & Shefferson, 2004; Kõljalg et al., 2005). The large number of ITS copies per cell (upwards of 250; Vilgalys & Gonzalez, 1990) makes the region an appealing target for sequencing substrates where the initial amount of DNA is low, such as in environmental samples from soil and wood. Jointly, these observations make a compelling case for the ITS region as a prime target for pyrosequencing – targeted at either the ITS1 or ITS2 – of environmental samples of fungi. Based on the 39 200 available environmental ITS sequences of fungi, the present study investigates the ramifications of choosing either of these two subregions over the other, a well as over the whole ITS region, for purposes of molecular characterization of fungal communities. Questions of how to make the most of the data from high-throughput sequencing of environmental samples are cast in a taxonomic perspective.

Materials and methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Supporting Information

All fungal ITS sequences annotated as such in INSD as of November 2008 were downloaded and divided into two datasets: those that were identified to the species level (fully identified sequences, FIS) and those that were not (insufficiently identified sequences, IIS) following the procedure of Nilsson et al. (2005). The fungus-specific Hidden Markov Models of Ryberg et al. (2009) were used to locate and extract the ITS1 and ITS2 from the sequences, and all entries were stored in a local MySQL database (http://www.mysql.com). The IIS are, to a large degree, obtained through environmental sampling such that their nature makes them attractive as query sequences in studies addressing the properties of environmental sequencing. Thus, to simulate the authentic situation where unidentified sequences have been obtained through sequencing of environmental samples and are queried against the INSD for taxonomic affiliation using blast (Altschul et al., 1997), all IIS featuring both the ITS1 and the ITS2 (in full or in part; defined as >40 bp) were compared in full against the FIS dataset using blast 2.2.18. These comparisons were repeated using only the ITS1, and then the ITS2, of these IIS to mimic limited-length pyrosequencing data. All entries were tagged with the best blast match to the FIS dataset for the complete sequence data as well as for each of the ITS1 and ITS2. To minimize the impact of questionable matches and technical artefacts (Nilsson et al., 2006), only sequences where both the ITS1 and ITS2 found relevant matches (blast E-value threshold, ≤10−10) among the FIS were used for comparison. To examine the impact of partial vs. full ITS1 and ITS2 data, respectively, the results from blast analysis of the entire ITS1 region of four sets of 10 000 ITS sequences were contrasted with the results obtained through analysing only the first 100 bp of the same sequences (and similarly for the ITS2; Supporting Information, Appendix S1). The IIS for which one or both of the ITS1 and ITS2 were missing are not treated any further in this study and are excluded from the statistics reported below. Synonyms and anamorph–teleomorph relationships were established through the Centraalbureau voor Schimmelcultures databases (Crous et al., 2004; http://www.cbs.knaw.nl/databases/) and are accounted for in the following.

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Supporting Information

A total of 100 639 fungal ITS sequences from 1992 and onwards were downloaded from INSD. Sixty-one percent (61 471 sequences) were identified to species level, leaving 39% (39 168 sequences) insufficiently identified. A complete or partial ITS1 was extracted and found to have a sufficiently good match to the FIS dataset for 77% of the IIS; the corresponding value for ITS2 was 80%. A total of 26 577 sequences (68% of the IIS) fulfilled all the criteria as to have ITS1 and ITS2 of sufficient length and to produce sufficiently good matches to the FIS dataset for both ITS1 and ITS2; these were designated as the query sequences of the study. The average length of the full IIS under scrutiny (including all three subregions and any part of the flanking ribosomal subunit genes) was 646 bp; that of the ITS1 was 182 bp; and that of the ITS2 was 183 bp.

A moderate 22% of the entries were found to have the same INSD entry (accession number) as their best blast match regardless of which one of the regions (full sequence, ITS1, or ITS2) was used as a query (Table 1). The choice of region had a clear impact on the perceived taxonomic affiliation of the sequence, with not less than 51% of the IIS showing not just another INSD entry but another species altogether as their best match (and in 21% of the total number of cases even a different genus) depending on which one was used for comparison. The three subregions disagreed completely on the species level in 14% of the cases (but in only 4% on the genus level). Thus, with respect to taxonomic affiliation, only in 49% of the cases did the choice of target region not matter at all. The full ITS region yielded the same blast results in terms of taxonomic affiliations as one, but not both, of its ITS1 and ITS2 26% of the time, with ITS2 (14%) concurring more often than the ITS1 (12%) with the taxonomic affiliation suggested by the entire ITS region. The ITS1 and ITS2 reported the same species, which was not suggested by the complete sequence, as their best blast match in a total of 11% of the cases. Roughly 20% of the ITS1 sequences under examination were assigned a different taxonomic affiliation by blast depending on whether the full ITS1 data or only the first 100 bp of the ITS1 were used (ITS2, 22 %; Appendix S1).

Table 1.   Summary statistics for the fungal ITS sequences in INSD as of December 2008 and the results from their analysis (in full and as broken up into constituent subregions) using blast
Number of ITS sequences in INSD100 639
Number of ITS sequences with >40 bp ITS190 200
Number of ITS sequences with >40 bp ITS293 655
Number of ITS sequences with >40 bp of both ITS1 and ITS285 914
Percentage of cases where the whole ITS region, its ITS1 and ITS2 are best matched by the same INSD entry (accession number)22%
Percentage of cases where the whole ITS region, its ITS1 and ITS2 are best matched by the same species49%
Percentage of cases where the whole ITS region, its ITS1, and its ITS2 each are matched by different species14%
Percentage of cases where the ITS1 and ITS2 are best matched by the same species, but the whole region is best matched by another species11%
Percentage of cases where the ITS1 and ITS2 are best matched by different species40%
Total number of species in the whole FIS dataset13 351
Total number of species in the FIS ITS1 dataset12 699
Total number of species in the FIS ITS2 dataset13 103
Proportion of IIS/FIS in the whole dataset0.64
Proportion of IIS/FIS in the ITS1 dataset0.60
Proportion of IIS/FIS in the ITS2 dataset0.61

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Supporting Information

Present pyrosequencing methods yield read lengths up to about 250 bp, a marked improvement over the 80–100 bp generated by the first generation of pyrosequencing machines, but only a third or perhaps half of both the length of a typical capillary sequencing round and the length needed to cover the ITS region in full for a wide selection of fungi. Improvements in the length of pyrosequencing reads are anticipated over time, but, at present, the user interested in sequencing the ITS region with pyrosequencing technology has to make a choice as to what part of the ITS to target. As if to underline the dangers of taking this decision too lightly, the present study shows that the choice of target region will have an effect on one's perception of the taxonomic diversity in the sample at hand. This is, at some level, expected due to the variable nature of the ITS1 and ITS2, which is made full justice to only when compared separately from the very conserved flanking and intercalary genes. Furthermore, the partial state of some ITS sequences in INSD, with either the ITS1 or the ITS2 missing entirely from a proportion of the sequences (10% of the FIS and 21% of the IIS), can be expected to introduce a degree of bias in such comparisons. Even so, the magnitude of the discrepancies is such that it is likely to find its way into large pyrosequencing datasets where automated processing of the output is the only feasible approach to species identification. More worrying still is the observation that ITS1 and ITS2 disagree over the taxonomic affiliation of the underlying query sequence in no less than 40% of the cases (Table 1), although this figure is in part explained by the presence of species groups with no or little interspecific variation. The blast output order for hits with identical match statistics – even though the species names may differ – is for all practical purposes random. Incorrectly annotated sequences, too, are likely to have influenced these estimates somewhat (c.f. Bidartondo et al., 2008).

These results show that species-oriented ecosystem studies based on the whole of the ITS region – as is normally done today – and those based on pyrosequencing of either the ITS1 or the ITS2 – an approach expected to gain popularity rapidly over the next few years – may portray different pictures of the fungal diversity under scrutiny, a fact that strongly mitigates against ecological conclusions based on a direct comparison of such sets of results. The present study, along with others, testify to the benefits of analysing the ITS1/ITS2 in isolation (i.e. with the flanking and intercalary, highly conserved genes removed), at least if the goal is to identify the sequences to the species or the genus – as opposed to ordinal or phylum – level (c.f. Bruns & Shefferson, 2004). In the interest of comparison of ecosystems from different studies, the mycological community would be best off if it would standardize one of the two subregions of the ITS as the basis of such pyrosequence-based studies of environmental samples. The two subregions are roughly equal in terms of variability and length, but there are more ITS2 than ITS1 available for comparison in INSD (Table 1). More importantly, the gene in the downstream region of the ITS2 (encoding the ribosomal large subunit, or the 25/28S) is known to be substantially more useful for species identification and phylogenetic inference up to the ordinal level than the gene downstream of the ITS1, i.e. the very conserved 5.8S. Thus, any additional downstream region retrieved while sequencing the ITS2 may contribute a further signal to the identification procedure while those downstream of the ITS1 are less likely to be helpful. These observations, together with the wide selection of auxiliary resources available for the ITS2 (e.g. Selig et al., 2008; Coleman, 2009; Keller et al., 2009), make a case for the ITS2 as the better choice for parallel sequencing, although the issue of primer optimization in the fungal ITS region needs further attention.

The data presented above leave little room for interpretation on one pressing issue: the largest obstacle to routine, en masse identification of fungal sequences to the species level is the striking paucity of well-identified, extensively annotated, and sequence coverage-wise complete reference sequences, preferably stemming from vouchered specimens kept in public herbaria, in INSD (c.f. Brock et al., 2009). Indeed, the sheer number of sequences from any pyrosequencing study is likely to further dilute the already limited presence of FIS in the blast hit lists so as to complicate any identification procedure even more. The present study shows the INSD to contain FIS from the ITS region – regardless of their suitability as reference sequences – for about 13 350 species, a very modest 0.9% of the estimated number of extant fungal species. Of the many issues elaborated on in the barcoding and molecular identification debate, taxonomy may well be the least considered and furthermore the one where progress is the slowest and most painstaking. The mycological community will soon be awash in data in the form of unidentified – and often unidentifiable – fungal ITS sequences from an abundance of study sites and ecosystems, data with which taxonomy in its current practise cannot be expected to hold pace. It would be a severe set-back for mycology if such unidentified taxa were to be given a different, ad hoc name in each study they were recovered as this would effectively close the route to interpreting the taxa in the light of other studies. A temporary system for formalizing clusters of unidentifiable and to all appearances conspecific sequences into standardized and nonarbitrarily named molecular species pending formal taxonomic interpretation and assignment is likely to prove to be the only sustainable way to maintain data comparability across studies and sites (c.f. Ryberg et al., 2008; Horton et al., 2009). Any other, nonstandardized way of delimiting and referring to such sequence clusters will only serve to add further to the mounting burden of the progressively fewer, and severely underfinanced, still active fungal taxonomists. High-throughput sequencing represents an amazing technological feat that promises to reshape mycology, but unless a unified infrastructure for processing and interpretation of the results in a taxonomic context can be agreed upon and implemented, the benefits of community profiling may well come at the price of the integrative nature of current public sequence repositories.

Acknowledgement

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Supporting Information

R.H.N. and K.A. acknowledge infrastructural support from the Fungi in Boreal Forest Soils network.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Supporting Information

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Materials and methods
  5. Results
  6. Discussion
  7. Acknowledgement
  8. References
  9. Supporting Information

Appendix S1. Additional statistics pertaining to the IIS and FIS datasets.

Please note: Wiley-Blackwell is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the article.

FilenameFormatSizeDescription
FML_1618_sm_appendixS1.pdf123KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.