An outlook on the fungal internal transcribed spacer sequences in GenBank and the introduction of a web-based tool for the exploration of fungal diversity

Authors


Author for correspondence:
Martin Ryberg
Tel: +46 31 786 48 07
Fax: +46 31 786 25 60
Email: martin.ryberg@dpes.gu.se

Summary

  • • The environmental and distributional data associated with fungal internal transcribed spacer (ITS) sequences in GenBank are investigated and a new web-based tool with which these sequences can be explored is introduced.
  • • All fungal ITS sequences in GenBank were classified as either identified to species level or insufficiently identified and compared using BLAST. The results are made available as a biweekly updated web service that can be queried to retrieve all insufficiently identified sequences (IIS) associated with any fungal genus.
  • • The most commonly available annotation items in GenBank are isolation source (55%); country of origin (50%); and specific host (38%). The molecular sampling of fungi shows a bias towards North America, Europe, China, and Japan whereas vast geographical areas remain effectively unexplored. Mycorrhizal and parasitic genera are on average associated with more IIS than are saprophytic taxa. Glomus, Alternaria, and Tomentella are the genera represented by the highest number of insufficiently identified ITS sequences in GenBank.
  • • The web service presented (http://andromeda.botany.gu.se/emerencia.html#genus_search) offers new means, particularly for mycorrhizal and plant pathogenic fungi, to examine the IIS in GenBank in a taxon-oriented framework and to explore their metadata in an easily accessible and time-efficient manner.

Introduction

Fungi are a large and diverse group of organisms that serve many essential ecological functions, such as wood and litter decomposition, mycorrhizal symbiosis, and parasitism. Many aspects of the lives of fungi are difficult to study, however, as most species are only observable when they form conspicuous fruiting bodies. By contrast, the main part of the fungal life cycle takes place as a somatic mycelium inside, or otherwise associated with, the substrate inhabited. The possibilities of using molecular methods have therefore facilitated a deeper understanding of fungal ecology (e.g. Crozier et al., 2006; Taylor & McCormick, 2008). This is particularly true for mycorrhizal communities (Horton & Bruns, 2001). As sporulation structures such as fruiting bodies and conidiophores can be linked to species descriptions, it is possible to infer the identity of sequences from environmental samples by correlating them to sequences from specimens of known identity. In mycology, sequences from the internal transcribed spacer (ITS) region of the nuclear ribosomal DNA are commonly used for the identification of fungi (Kõljalg et al., 2005; Naumann et al., 2007; Nilsson et al., 2008). However, although this is one of the most frequently sequenced regions, ITS sequences from well-identified fruiting bodies are estimated to be available for < 1% of the hypothesized number of fungal species (Nilsson et al., 2005). There are thus many sequences from environmental samples whose species affiliation remains unknown in that they cannot be satisfactory matched to a sequence of known taxonomic identity (cf. Horton et al., 2005; Bastias et al., 2006; Kjøller, 2006).

The International Nucleotide Sequence Database (INSD: GenBank, European Molecular Biology Laboratory (EMBL), and DNA Database of Japan (DDBJ); Benson et al., 2008) is the major open repository for sequence data. As a part of the documentation of a scientific study, most international journals require that all sequences used in a manuscript be made available through such public databases. Consequently, as a part of the publication process of environmental studies, many sequences that are not identified to the species level (i.e. that are insufficiently identified) are submitted to the INSD. Although these sequences are far from rigorously and homogeneously annotated, their metadata can provide valuable information on the ecology and distribution of a group of organisms or contribute important data in a phylogenetic context (Weiss et al., 2004; Porter et al., 2008; Ryberg et al., 2008).

Here we investigate the taxonomic distribution of all insufficiently identified fungal ITS sequences in INSD (referred to as ‘insufficiently identified sequences’ (IIS)). In addition we compare the IIS with fully identified fungal ITS sequences (referred to as ‘fully identified sequences’ (FIS)) and examine the classes of metadata that are available for both types of sequence. In addition we introduce a web service that enables searches for IIS associated with any user-specified fungal genus. This offers new possibilities to explore the IIS in a taxon-oriented way and to synthesize pertinent information on taxonomy, ecology, and distribution of the focal genus for subsequent use.

Materials and Methods

The software package emerencia (Nilsson et al., 2005; http://andromeda.botany.gu.se/emerencia.html) was used to download all fungal ITS sequences from GenBank and to separate fully identified sequences (i.e. the FIS) from those without a full species name (i.e. the IIS). The sequences were stored in two separate tables in a MySQL 5.16.3 database (http://www.mysql.com). In addition, the variable subregion ITS2 was detected and extracted using the hidden Markov models of Nilsson et al. (2008) and was also stored in two separate tables. BLAST 2.2.9 (Altschul et al., 1997) was used to find the closest FIS for all IIS. A Perl script (Supporting Information Text S1) was implemented to search the database for the species among the identified sequences that form the best BLAST match of at least one IIS for the full sequence data. The results are used to calculate how many IIS are associated with each genus. For the genera associated with more than 100 IIS, the number of fully identified sequences and the number of species they belong to according to their taxonomic annotation were obtained from the database. The total number of species for these genera was obtained from Kirk et al. (2001) as this is the most comprehensive recent work including such estimates. In addition, a function to perform searches to determine which IIS are associated with any user-specified genus was constructed (Supporting Information Text S2). To evaluate the proportion of IIS that are assigned by BLAST to an incorrect genus, 100 IIS were randomly selected and aligned with their 10 closest BLAST matches using Clustal W 1.83 (Thompson et al., 1994). These alignments were then investigated to determine if the assigned generic affiliations were probable.

To give the user of the web service an overview of the present state of ITS sampling of fungi, the ITS sequences were compared for taxonomic affiliation, identification status, and ecological roles. Furthermore, information on the classes of metadata that come bundled with the FIS and the IIS, respectively, was obtained through parsing all entries in Perl for information in the Features field of the GenBank annotation. The software package of Nilsson et al. (2006) was used to investigate the number of fully/insufficiently identified sequences submitted to INSD over time and the proportion of IIS that originate from unpublished studies.

Results

The genus search

A new search function has been added to the emerencia web service and is accessible through http://andromeda.botany.gu.se/emerencia.html#genus_search. By providing a genus name, the user can retrieve a list of all insufficiently identified INSD sequences found to be associated with the genus through BLAST searches. The user is given the choice to base the search on the entire ITS region (default) or only the more variable sublocus ITS2 to lessen the impact of the very conserved 5.8S and Large Subunit (LSU) regions. The ITS2 option does, however, mean a loss of c. 6% of the available sequences (i.e. sequences in which the ITS2 could not be detected for whatever reason) and the default mode is what is used in this paper. The IIS are output together with their best BLAST match and are extensively linked to more detailed presentations of each entry and to tools and resources to assist in the quality control of the data (Supporting Information Text S3). The output is also summarized in three tables. The first table presents the source literature of the IIS and details how many sequences are associated with each reference (including links to the study in Google Scholar; http://scholar.google.com). The second table lists the source literature of the identified sequences that constitute the best BLAST matches of the IIS. Finally, the third table specifies the individual species (in the genus) that form the best BLAST match of at least one IIS, the individual sequences that constitute these best BLAST matches, and the number of IIS that are associated with each species. It is also possible to have the IIS output in the FASTA format (Pearson & Lipman, 1988) for incorporation into, for example, alignment programs, phylogenetic analyses, or additional quality control steps. Sequences inadvertently submitted as reverse complementary to National Center for Biotechnology Information (NCBI) are detected and displayed correctly in the web service as long as they feature the last third of the 5.8S.

A closer look at the fungal ITS sequences

There were 50 956 (65%) fully identified and 27 364 (35%) insufficiently identified ITS sequences in GenBank as of February 2008. The proportion of IIS among the fungal ITS sequences being submitted to INSD each year increases steadily such that their deposition rate now parallels that of FIS (Fig. 1). The IIS were found to be associated with 1148 different fungal genera, which represents 57% of the 2014 fungal genera present in the FIS data set of INSD. A total of 260 (23%) of the 1148 genera had only one IIS associated with them while 391 (34%) genera had 10, and 49 (4%) genera had > 100 associated sequences. Of the 49 genera with > 100 associated sequences, 15 (30%) are well-known mycorrhiza formers and 5 (10%) include mycorrhizal, or putatively mycorrhizal, species while 25 (51%) are known to include parasites (Table 1). Twenty-six (53%) of the 49 genera associated with > 100 sequences were found to belong to the Ascomycota, 21 (42%) to the Basidiomycota, one (2%) to the Glomeromycota, and one (2%) to the Zygomycota. The full list of genera is automatically updated and available through the emerencia web service (http://andromeda.botany.gu.se/genuslist.html). Of the 100 IIS that were investigated more thoroughly as a quality assessment measure, one was determined to belong to a genus other than the genus to which it had been assigned by BLAST and it was not possible to assign 13 to any genus with certainty. Of these 13, about half (54%) were classified as associated with various anamorphic Ascomycota genera.

Figure 1.

The number of fungal internal transcribed spacer (ITS) sequences submitted to (or modified in) the International Nucleotide Sequence Database (INSD) each year divided into fully identified and insufficiently identified sequences with respect to taxonomic affiliation. Fully identified sequences are represented as light grey and insufficiently identified sequences as dark grey.

Table 1.  The 49 fungal genera that have more than 100 insufficiently identified sequences (IIS) associated with them
Number of IIS1Number of fully identified sequences2Genus name3Estimated total number of species in the genus4Main nutritional mode5
  • 1

    Associated with the genus through BLAST.

  • 2

    The number of species these sequences represent (according to the International Nucleotide Sequence Database (INSD) annotation) is given in parentheses.

  • 3

    Phylum affiliation given in parentheses as: A, Ascomycota; B, Basidomycota; G, Glomeromycota; Z, Zygomycota.

  • 4

    According to Kirk et al. (2001).

  • 5

    M, mycorrhizal; P, parasitic; S, saprophytic; Sym, symbiotic (other than mycorrhizal). Uncertainty in the nutritional mode is indicated by ?.

1279426 (30)Glomus (G)  85M
946409 (60)Alternaria (A)  50P/S
76279 (28)Tomentella (B)  75M
7241195 (440)Cortinarius (B)2000M
58265 (11)Sclerotinia (A)   8P
5641439 (75)Fusarium (A)  50S/P
530306 (145)Russula (B) 750M
498671 (2)Thanatephorus (B)  11P
422269 (43)Cladosporium (A)  60S/P
402718 (49)Trichoderma (A)  34Sym/P
39228 (11)Tulasnella (B)  46S/M/Sym
379630 (78)Cryptococcus (B)  37S/P
34820 (9)Sebacina (B)   6M
34859 (5)Piloderma (B)   6M
346246 (64)Rhizopogon (B) 150M
326198 (24)Phoma (A)  40P
311457 (129)Lactarius (B) 400M
303270 (110)Inocybe (B) 500M
29521 (9)Ceratobasidium (B)  11P/S/M?
291121 (14)Lophodermium (A) 103P/S/Sym?
283867 (72)Hypocrea (A) 100S/P?
27835 (9)Mortierella (Z)  90S/P
277909 (163)Penicillium (A) 223S
23067 (1)Cenococcum (A)   1M
226331 (63)Tricholoma (B) 200M
222945 (126)Candida (A) 165S/Sym/P
20140 (2)Epicoccum (A)   2S/Sym?
195110 (9)Phialocephala (A)  20S/M/P
18267 (7)Cadophora (A)P/S/Sym
181429 (64)Puccinia (B)4000P
172154 (18)Diaporthe (A)  75P/S
16084 (2)Aureobasidium (A)   7P/Sym
15680 (38)Phaeosphaeria (A)  45P/S
155149 (32)Phomopsis (A) 100P
152846 (115)Mycosphaerella (A) 500P
14221 (3)Leptodontidium (A)  10P/Sym
140852 (37)Tuber (A)  63M
128227 (33)Ganoderma (B)  50S/P
128185 (9)Gibberella (A)  10P/Sym?
1287 (3)Craterellus (B)  20M
12716 (2)Tylospora (B)   2M
125164 (14)Nectria (A)  28P
123119 (33)Leucoagaricus (B)  75S
1224 (1)Amphinema (B)   4M
12138 (2)Meliniomyces (A)M?
116266 (50)Rhodotorula (B)  34S/Sym?
113281 (63)Pestalotiopsis (A)  50P/S/Sym
10970 (30)Xylaria (A) 100S/Sym
1042 (2)Tremellodendron (B)   8M/Sym

The proportion of species that have been sequenced varies considerably among the 49 genera with > 100 associated IIS. The mycorrhizal genera with more than 10 species in total (as estimated by Kirk et al., 2001) have rather few species represented as FIS (15–54% of the species) while saprophytic and parasitic genera are generally better represented (2–150% of the estimated species represented as FIS). Indeed, many of the parasitic genera are represented by more species than they have been estimated to contain (Table 1).

Considering geographical metadata, the co-ordinates of the collection locality were given for only 5% of the IIS. A more encouraging 50% of the entries were explicitly annotated with a country of origin and of these a full 51% had a more precise geographical annotation (state, province, or similar; Table 2). For the FIS, the corresponding values were even lower: only 0.5% of the entries had geographical co-ordinates, and a modest 37% had an explicit country annotation. Taken together, the sequences were found to originate from all continents; the IIS from a total of 102 different countries and the FIS from 158 countries (Supporting Information Text S3). Despite this broad geographical sampling there was a clear overrepresentation of sequences from North America, Europe, China, and Japan while other regions have been less well sampled (Fig. 2). A specific host was given for 38% of the IIS and 18% of the FIS (Table 2; see Supporting Information Text S3 for a more detailed list) and isolation source was specified for 55% and 10%, respectively; both fields are good sources of information on host species for the fungi. The isolation source field can also provide more specific information on the substrate of the fungi. Additionally, 47% of the IIS and 65% of the FIS had auxiliary annotations in the Note field.

Table 2.  Available metadata in the Features field for the insufficiently (IIS) and fully (FIS) identified fungal internal transcribed spacer (ITS) sequences in the International Nucleotide Sequence Database (INSD)
Type of metadataIIS (%)FIS (%)
  • The table shows the percentage of fully identified and insufficiently identified sequences, respectively, with an annotation for the different annotation fields.1Poorly applicable to generalists and often applied in a loose sense in INSD.

  • 2

    Poorly applicable to unculturable fungi and often applied in a loose sense in INSD.

Country5037
Latitude/longitude 5 0.5
Note4765
Specific host13818
 Vascular plants35.315.7
 Nonvascular plants 0.4 0
 Fungi 0.1 0.2
 Animals 2.2 2
 Other substrates 0.07 0.001
Isolation source25510
Specimen voucher 721
Collected by 3 1
Identified by 0.4 1
Figure 2.

Maps illustrating the number of insufficiently identified (a) and fully identified (b) fungal internal transcribed spacer (ITS) sequences originating from each country according to their International Nucleotide Sequence Database (INSD) country annotation. Full specifications and complementary information are provided in Supporting Information Text S3.

Although not examined here, the most important source of information for any entry is probably the publication from which the sequence originates. However, more than half (58%) of the IIS were marked as unpublished, but the real number is probably considerably lower as many sequence authors neglect, or take a very long time, to update the sequence annotations in INSD when their work is published (Nilsson et al., 2006).

Discussion

The increasing use of sequence-based methods for identification of fungi has opened new windows to the scientific pursuit of ecology. The knowledge gained from such data has, for example, changed our view of the ectomycorrhizal community in showing that it is much more species-rich and site-specific than previously thought (Horton & Bruns, 2001). It has moreover provided new insights into the ecology of many different groups of fungi where mycorrhizal associations are common, such as the Pezizales, Sebacinales, and Agaricales (Tedersoo et al., 2006; Selosse et al., 2007; Ryberg et al., 2008).

The present analysis of the IIS in INSD reveals strong biases in terms of both geography and taxon sampling. There is a preponderance of studies on mycorrhizal fungi and of studies on important parasites, while decay fungi have been less often recovered in environmental studies. This particular bias is further demonstrated by the fact that the overwhelming majority of the IIS annotated with a specific host were associated with plants (Table 2). We also found a geographical bias towards North America, Western Europe, and China and Japan and away from Africa and parts of Asia and South America (Fig. 2), an observation that also holds true for the FIS. This bias away from tropical areas is most unfortunate as they are predicted to be particularly rich in undescribed species (Hawksworth, 2004). Although falling short of giving a full description of diversity, sequencing of environmental samples might be an efficient way to obtain an overview of the richness of these areas (cf. Herbert et al., 2003). The lack, or low number, of FIS from these areas reduces the chances of integrating such environmental sequences into a more informative taxonomic framework.

There is furthermore a bias in how well represented different genera are, as seen by the proportion of their species for which FIS are available. Several genera were represented by more species in GenBank than they have been estimated to contain. This could partly reflect rapid taxonomic development in these genera over the last few years, which has rendered the estimates of Kirk et al. (2001) somewhat outdated. It could also be a result of taxonomic difficulties such as widespread use of synonymous names – particularly in relation to anamorphic stages – or use of names that are not formally published. The opposite problem – that most genera are poorly covered by DNA sequencing efforts – is, however, more substantial and more pressing. The fact that less than half of the species in most of the larger mycorrhizal genera have been sequenced emphasizes the urgent need for extended sequencing efforts targeted at these fungi. Unfortunately, such projects rarely seem to rank high in the priorities of the funding agencies. It would be preferable if such efforts were to involve type specimens or well annotated and easily accessible voucher specimens to avoid adding to the taxonomic complications in INSD (cf. Bidartondo et al., 2008). If the ambition is to be able to determine environmental sequences to species level, it seems especially important to focus on species in Glomus, Cortinarius, Tomentella, and Russula as these account for a large proportion of the IIS (Table 1). However, Nilsson et al. (2006) showed that as many as 6% of the fungal ITS sequences lack a satisfactory BLAST match altogether, which is likely to be a reflection of the complete absence of sequences in INSD for many genera. In order to increase taxonomic resolution in molecular environmental studies, such genera should be prime targets for sequencing efforts.

In this study, BLAST was used to infer the taxonomic affiliation of IIS. This represents a fast and efficient way to perform similarity searches for large data sets, but it does not give reliable results on taxonomic identity in all situations (Koski & Golding, 2001). The number of IIS per genus should therefore not be interpreted as the absolute abundance of each genus but rather as an estimation to be used for further investigations. Our evaluation of the method does, however, show that it is reasonably reliable. It is also well known that the taxonomic annotations of INSD are not always correct, which may negatively affect our ability to infer the taxonomic identity of the IIS. It has been shown that c. 5% of the IIS are best matched by a sequence with a questionable species annotation (Nilsson et al., 2006), but the proportion of sequences identified to the wrong genus is probably lower. The genus list of Table 1 is therefore a presentation of the data in INSD as seen through emerencia and represents a blend of the natural occurrence of the genera, methodological challenges, and the efforts of the mycological community.

In allowing the user to retrieve all IIS associated with any specified genus, the web service unites and automates two hitherto separate and arduous processes: (1) the iteration of BLAST over all fully identified sequences of the focal genus, and (2) the manual parsing of the BLAST results to pinpoint IIS with a relation to the genus. The sequences thus obtained are likely to represent constituent species of the focal genus, although this relation may have gone unnoticed before as a result of the poor state of the taxonomic annotation of the entries. The metadata associated with the entries are thus pertinent also to the genus itself, which opens up the possibility of synthesizing as yet incompletely explored information at the generic level. Areas where this data mining may have particular potential include: (1) nutritional mode(s) of the genus (where the IIS can be used as evidence to bind species of the focal genus to nutritional modes); (2) the taxonomic span of the genus (where the IIS fall inside the genus and yet do not produce a satisfactory match to any of its species, suggesting the presence of hitherto unsequenced taxa); and (3) the geographical distribution of the genus (where the IIS have been reported from locations extending beyond the known geographical range of the genus). These pursuits are further facilitated by the web service through the generation of summaries of the literature annotation for the entries in question and through the possibility of examining the underlying pairwise alignments. The entries may be exported for further sequence analysis and they are hyperlinked to GenBank, Google, and emerencia.

IIS form more than a third of the fungal ITS sequences in INSD, and their proportion is steadily increasing. As this study shows, the IIS are often better annotated with respect to environmental and geographical data than the FIS. This indicates that the IIS, if used in the proper taxonomic context, could contribute valuable data on the ecology, distribution, and taxonomy of fungi. The new search function of the emerencia web service presented here represents a means to retrieve such IIS associated with any user-specified genus in a format that makes this auxiliary information readily accessible to the scientific community.

Acknowledgements

Financial support for this study was received from the Royal Society of Arts and Sciences in Göteborg (MR) and the Carl Stenholm Foundation (RHN and MR). Fig. 2 was compiled in collaboration with Villa Geografica. We are also grateful for comments and suggestions by Tom Bruns and two anonymous reviewers on a previous draft of this paper.

Ancillary