Author for correspondence: Urmas Kõljalg Tel: +372 738 3027 Fax: +372 738 3013 Email: firstname.lastname@example.org
• Identification of ectomycorrhizal (ECM) fungi is often achieved through comparisons of ribosomal DNA internal transcribed spacer (ITS) sequences with accessioned sequences deposited in public databases. A major problem encountered is that annotation of the sequences in these databases is not always complete or trustworthy. In order to overcome this deficiency, we report on UNITE, an open-access database.
• UNITE comprises well annotated fungal ITS sequences from well defined herbarium specimens that include full herbarium reference identification data, collector/source and ecological data. At present UNITE contains 758 ITS sequences from 455 species and 67 genera of ECM fungi.
• UNITE can be searched by taxon name, via sequence similarity using blastn, and via phylogenetic sequence identification using galaxie. Following implementation, galaxie performs a phylogenetic analysis of the query sequence after alignment either to pre-existing generic alignments, or to matches retrieved from a blast search on the UNITE data. It should be noted that the current version of UNITE is dedicated to the reliable identification of ECM fungi.
Over the past few years there has been a dramatic increase in the number of studies on ectomycorrhizal (ECM) fungi involving molecular identification of species and individuals (for review see Horton & Bruns, 2001). In the majority of these studies identification to species level has been achieved by comparing sequence data obtained from mycorrhizal root tips with sequences derived from sporocarps contained within public sequence libraries [e.g. the European Molecular Biology Laboratory (EMBL), the National Center for Biotechnology Information (NCBI) and the DNA Data Bank of Japan (DDBJ)]. However, the taxonomic coverage in public databases is still limited, and misidentifications and cases involving incorrect use of names are, unfortunately, far from rare (Vilgalys, 2003). Furthermore, existing databases were not constructed to be taxon-oriented and thus have clear limitations for use in ecology and taxonomy (Tautz et al., 2003). Given that the information derived from ecological field studies of ECM fungi is only as accurate as the identification of the taxa involved, we have developed UNITE, a curated ribosomal DNA sequence database, with a view to enabling accurate and reliable identification of sequences derived from ECM fungi in environmental samples. The database, in its present form, covers ECM ascomycetes and basidiomycetes.
The development of UNITE is a Northern European initiative presently involving 10 core institutions. The project links experts in fungal taxonomy, ecology and bioinformatics, and also forms part of a student-training network funded by the Nordic Academy for Advanced Study (NorFA). For the full list of participants see the UNITE homepage (http://unite.zbi.ee).
Guiding principles behind the UNITE database
Recently we have seen intensive debates on DNA taxonomy and on how DNA-based identification systems should be constructed (Hebert et al., 2003; Tautz et al., 2003; Will & Rubinoff, 2004). UNITE, in its conception and open-ended development, represents a good working example of DNA taxonomy in action.
The UNITE database presently contains internal transcribed spacer (ITS) sequences generated from identified fungal sporocarp voucher specimens, but will be further expanded to include sequences derived from living cultures of ECM fungi and environmental samples (e.g. plant roots and soil). Sporocarp sequences can be regarded as directly identified, while all other sequences are identified by reference to a sporocarp sequence. Certain ECM fungi, however, do not produce sporocarps, and those of others remain undetected at present. In such cases living cultures can be used for species description (e.g. Pezizella ericae D.J. Read) and their ribosomal sequences are essential for identification purposes. Within the database, each sequence source is marked, making it possible to choose whether a query should be performed against the full database or only on ‘safe’ sporocarp sequences. To ensure the highest quality of sequences derived from sporocarps within the UNITE database, a number of guiding principles have been adopted, as follows.
6Whenever possible, specimens from which DNA is derived should be illustrated (e.g. by photos).
In addition to these safeguards, authorized persons act as ‘curators’ and only they have the right to update and enter data (for the list of curators see page ‘Contributors’ in UNITE homepage). Queries from users on specimens and associated information should be directed to the appropriate curator. Currently, only the authors of this paper act as curators for the UNITE database.
The UNITE database is not intended as an alternative to other sequence databases, and contributors are encouraged to submit sequences into the EMBL, NCBI or DDBJ databases. However, UNITE also includes unpublished sequences. These sequences, under the direct control of the contributor, are available to the database identification tool suites but are not freely downloadable. Further information can be obtained only through agreement with the named contributor.
UNITE is a relational database built on a mysql/apache–linux platform that communicates with the web interface through php/perl. For the structure of the database and the relationships among tables see the entity relationship diagrams (Figs S1–S3). Management of the UNITE database is performed over the web using the software myAdmin (InTraGate Netsolutions, Inh., Pegnitz, Germany).
UNITE includes several tools that aid in the identification of unknown sequences. As unequivocal identification is the main purpose behind UNITE, the implementation of tools that extend beyond simple similarity searches (as offered by blast and variations thereof) was an essential part of the database development. This requirement has been met by the development of galaxie (Nilsson et al., 2004), which allows web-based, basic phylogenetic analyses. Galaxie provides maximum parsimony heuristic and neighbour-joining analyses under different evolutionary models. To date, two galaxie (galaxieblast, galaxieHMM) and one blast script have been implemented. We recommend galaxieblast as the most appropriate tool for identification of unknown ITS sequences. Other identification methods will be considered for inclusion in the future. We stress again that the UNITE database is, in its present form, restricted to ITS sequences specific to ECM fungi, and the input of query sequences from other fungi not covered by the database (saprophytic or parasitic fungi) is not recommended for obvious reasons. An overview of the identification process and data retrieval is presented in Fig. 1.
galaxieblast uses the incoming sequence as a query in a blastn search against the UNITE data. Either the 15 best matches (as judged by the e value) or the best three matches followed by the next 12 matches of mutually distinct e values are collected; the latter option can be used to reduce the impact of identical sequences on the analysis. The inclusion of too many identical sequences in a phylogenetic analysis is likely to result in a tree with little, if any, resolution. Such trees are generally considered unsafe for inferring sequence relatedness. The matches and the query sequence are aligned in clustalW (Thompson et al., 1994) for joint phylogenetic analysis in phylip (Felsenstein, 1993), using either neighbour joining or the parsimony optimality criterion. The phylogenetic analyses feature random sequence addition, outgroup rooting, bootstrapping and branch swapping (parsimony). The results from blast and the phylogenetic tree are displayed together with the multiple alignment.
The hmmer package (Eddy, 1998) is used to compute hidden Markov models (HMMs) for prealigned nucleotide matrices. At present there are four such alignments: one for the Hymenoscyphus ericae (Ascomycetes, Helotiales) aggregate; and three for the resupinate thelephoroid fungi (genera Pseudotomentella, Thelephora, Tomentella and Tomentellopsis). The query sequence is compared with each of the HMMs; if it produces a significant match with any of the alignments, the (best) significantly matching alignment is selected for phylogenetic analysis, the procedure of which is similar to that of galaxieblast. To obtain threshold values to signify match vs nonmatch, 30 ingroup and 30 outgroup sequences (where applicable) were matched to the HMMs and appropriate cut-off values were decided. The use of manually adjusted alignments opens up the possibility of using galaxieHMM to obtain very accurate estimates of a sequence's position within a group of sequences, notably at the genus or family level. We are in the process of adding more generic alignments.
blast similarity search
For the sequence similarity search the program blastn (Altschul et al., 1997) is incorporated into UNITE. This freeware program is based on the blast algorithm and used by NCBI and many other databases. The default E (expectation) value threshold for the similarity search is set to 1e−80. The stringency of the default E value represents an attempt to rule out weakly matching sequences.
A 3 yr project (2004–06) to sequence most ECM species present in Northern Europe has been set up and funded. The same morphospecies from different communities and from different countries will be sequenced in order to cover the ITS variation of most species, and to detect possible cryptic species. To date, UNITE has had an emphasis on Nordic ECM species. However, it is our intention to expand the database to include sequences from taxa from other geographic regions. Subsequent versions of the UNITE database will be expanded to include sets of saprophytic and parasitic fungi covering major clades of basidiomycetes and ascomycetes, including many known to co-occur with ECM fungi. This will allow the full potential of the database be realized.
Emerging methods for the detection of microbial diversity will be included successively. Possibilities for the inclusion of terminal restriction fragment length polymorphism (T-RFLP) libraries will be considered in future. Microarray technology is another emerging tool, and taxon identification microarrays (TIM) or phylogenetic oligonucleotide arrays are some of its applications (for overview see Zhou & Thompson, 2002; Zhou, 2003). Taxon-specific oligos (Grönberg et al., 2003) are necessary prerequisites for TIM development. In this respect, oligo-designing software will also be implemented into UNITE in future versions. UNITE is built as a robust and flexible system so that new methods can be easily implemented.
Accurate quantification of ECM fungi in root and soil samples is increasingly being demanded in ecological research and for commercial application of these plant symbiotic fungi. In theory, it is possible to obtain semiquantification using species-specific probes and real-time PCR. While ITS sequences are ideal for designing species-specific probes and primers for qualitative species detection (presence/absence) in environmental and mixed genome samples, they cannot be used for quantification purposes, as the number of ITS copies within genomes are mostly unknown and may vary considerably within and between species (Pukkila & Skrzynia, 1993; Sonnenberg et al., 1996). For quantification purposes, new reliable single- or low-copy fungal markers that discriminate at the species level are therefore urgently needed. Such sequences are likely candidates for inclusion within future versions of UNITE.
We are continually updating taxon descriptions, illustrations and references linked to the sequence data. For certain groups we plan to cooperate with existing web-based taxonomic databases by establishing links between sequences and their taxon descriptions. Similar cooperation is planned with a number of available nomenclatural databases. Currently UNITE is linked to Cortbase (Parmasto et al., 2004), which mainly covers the nomenclature of corticioid fungi.
UNITE has been supported in part by grants to individual authors as follows: U. Kõljalg, K. Abarenkov and L. Tedersoo, ESF grant no. 5232 and FORMAS grant no. 23.9/2001–1063; S. Erland, FORMAS and the Royal Physiographic Society in Lund; R. Kjøller, the Carlsberg Foundation and the Danish Natural Science Research Council; T. Vrålstad, the Research Council of Norway (NFR-145324/432). NorFA is thanked for providing a solid basis for exchange between the scientists involved in UNITE by sponsoring the network ‘Identification and Ecology of Ectomycorrhizal fungi’ (2003–05) (http://www.systbot.gu.se/research/unite). We are grateful to the Nordic Forest Research Co-operation Committee who funded our common research project (SNS-92, 2004–06) to sequence ectomycorrhizal species present in Northern Europe. We thank numerous researchers who provided well annotated fruit bodies for DNA extraction. Their names appear on the UNITE home page. The Institute of Zoology and Botany of the Estonian Agricultural University and Mr Aavo Kuslapuu are thanked for making their database server available to us. We also would like to thank three anonymous reviewers for their constructive comments.