The UNITE database for molecular identification of fungi – recent updates and future perspectives
(Author for correspondence: tel +372 738 3027; email firstname.lastname@example.org)
Ectomycorrhizal (ECM) fungi are typically examined for taxonomic affiliation through sequence similarity searches involving the internal transcribed spacer (ITS) region of the nuclear ribosomal repeat unit and the International Nucleotide Sequence Databases (INSD) (Peay et al., 2008; Taylor, 2008; Benson et al., 2009; Tedersoo et al., 2010). However, the usefulness of these searches is constrained by the technical quality and the taxonomic reliability of the reference sequences in the databases (Nilsson et al., 2006; Bidartondo, 2008). The meagre data on voucher specimen, country of collection and host, which are associated with many of the entries, place a further restriction on the usefulness of the entries in an ecological or taxonomic context (Ryberg et al., 2009). The UNITE project (Kõljalg et al., 2005) was initiated in 2001 to address these problems through a free online database for high-quality reference records of ITS sequences from North European ECM fungi. Taxonomic reliability was the founding principle of the initiative; all records were determined to species level (or as far as possible) by researchers well versed in the taxonomic group in question, and all sequences were obtained from, or in association with, richly annotated fruiting bodies (voucher specimens) deposited in public herbaria.
The years since the 2005 publication of UNITE have witnessed a proliferation of environmental sequencing efforts from all over the world, and there is a clear tendency in the recent literature to target entire communities of fungi rather than individual taxa or subsets of the full diversity. Such studies may still focus on ECM fungi, but inherent to many of these projects is a desire to examine also the non-ECM sequences to obtain a better view of the trophic processes and potential interdependence or interactions among taxa (Lindahl et al., 2007). Any modern initiative aiming to provide facilities for sequence identification must therefore be prepared for geographically diverse, and ecologically disparate, query sequences. These observations, together with the prospects of emerging high-throughout sequencing technologies such as massively parallel (454) pyrosequencing (Margulies et al., 2005; Shendure & Ji, 2008; Hibbett et al., 2009), suggest that a sequence database with a limited geographical or nutritional coverage of taxa – and where sequences are processed one at a time – may no longer serve the needs of the research community in a fully efficient way. We have been working to keep the UNITE database abreast of developments in the field, and in the present Letter we list the major updates in technology, methodology and policy that UNITE has undergone since its initial publication. A set of guidelines for its future development is also provided.
Sequence coverage and taxonomic inclusiveness: statistics and policy updates
The number of voucher-associated sequences in UNITE has increased from 811 in 2005 to nearly 3000 at present (Table 1). The number of species of North European ECM fungi represented in UNITE has more than doubled to 896 (73% of the known ECM fungi in North Europe; Hansen & Knudsen, 1997; Knudsen & Vesterholt, 2008), and the total number of species has increased from 480 to 1078. Taxon sampling is necessarily not uniform, but reflects the availability of taxonomic expertise and updated generic revisions. For example, Ramaria (two species in UNITE), Pezizales (30 species in UNITE) and Helotiales (32 species in UNITE) are taxa for which a substantial amount of data remain to be generated; by contrast, sequences for 80% of the known North European species of Lactarius and Boletus have been deposited in UNITE.
Table 1. Statistics on the records of the UNITE database as of November 2009
The first major change to announce is that the previous geographical and ecological restrictions on the scope and sequence coverage of UNITE have been lifted. Although we will continue to expand and enhance the taxon sampling of mycorrhizal fungi, we now accept fully identified and well-annotated reference sequences from any geographical locality, nutritional mode and group of fungi, as long as they are supported by vouchers or type cultures, and the sequence authors have a documented expertise of the taxa in question. The curators of UNITE reserve the right to send new reference sequences for peer review. Unidentified and environmental sequences may also be deposited in UNITE, provided that they are of high quality, well annotated and mutually nonredundant (i.e. large sets of identical sequences from the same author and study site will not be accepted); in addition they must cover the ITS region in full. The option to submit reference sequences with the sequence data open to query, but with the species name withheld (i.e. ‘locked’), will be allowed only when the sequences have been accepted for publication but are not yet online. We are furthermore looking into supporting the provision of operational names of the accession number type for clusters of hypothetically conspecific sequences that cannot be identified to species level. Such informal names represent an unambiguous way of referring to such taxa until the data are there to warrant formal description of the species.
Although UNITE will maintain a focus on the ITS region, there is now generic support for any other gene and genetic marker pertinent to the identification of fungi. The nuclear large subunit (nLSU/28S) gene, arguably one of the mainstays of current fungal phylogenetic inference, is an example of this. Although not always discriminative at the species level, the nLSU can be used, in the absence of a good ITS match, to assign specimens to a higher taxonomic level. The number of nLSU sequences in UNITE is modest, but several sizable data sets of primarily saprophytic fungi have been scheduled for inclusion in the near future. We hope that these will be followed by more, and invite the mycological community to deposit nLSU sequences in UNITE. It is important that such data be accompanied by primer and amplification details, if this information is not available in a tagged publication.
Richer and more dynamic project-oriented relational database model
The initial database model has been expanded into a 130-table SQL-compliant database structure compatible with the Taxonomic Database Working Group standards (http://www.tdwg.org/standards/). The structure draws from Taxonomer (Pyle, 2004; and subsequent additions) to capture the full complexity of modern mycological taxonomy and nomenclature. Metadata pertaining to sequences or sets of sequences can now be stored in a way open to direct query and include locality, habitat, soil type and host (Supporting Information Fig. S1). Sequence sets can be formed to reflect contexts such as studies, plots and samples, and the sequences in such sets can be addressed jointly, separately, or in combination with all other sequences. Particular care has been taken to make sure that information can be represented in a fully nuanced way through many-to-many relations: a sequence can have more than one correct name (to account for anamorph–teleomorph relationships and synonyms), a species can have many habitats and ecological characteristics, and a study may be composed of any number of distinct or coupled plots and subplots. A researcher may, for instance, divide the sequences from some given project into sets – reflecting, for example, host or plot of origin – and compare these for differences in taxonomic composition and species richness. Research groups can be granted far-reaching access to the system, allowing, for example, tailoring of the submission procedure in the interest of exact and efficient storage and representation of particularly complex data. Indeed, we envision UNITE as being a fully fledged sequence-management system that individual researchers or research groups can use to store and analyze data from entire projects or study sites. Many of the features of such a system are already in operation. We feel that the inclusion of this sequence-management environment to process new and existing sequences distinguishes the UNITE database from the INSD.
Improved support for storage of auxiliary data
It is now possible to associate any number of binary files with sequences, species, studies, or other objects or contexts within UNITE. While this service was initially conceived as a response to the debate on the availability of primary sequence data such as chromatograms and other raw sequence data (Costello, 2009), any noncopyrighted and freely available file relevant to the interpretation of the underlying or downstream data will be accepted. This includes, but is not limited to, photographs of fruiting bodies or root tips, drawings of spores and mantle structure, maps, GIS (geographic information system) data files and PDF (Portable Document Format) documents; size restrictions may, however, apply and the data authors are requested to use file formats with manageable file sizes. Scientific publications may be deposited for public view alongside sequence data as long as no copyright laws or legal limitations in distribution or dissemination are violated: it is the responsibility of the depositor/author to obtain such permission where needed. We are also willing to hyperlink to relevant external files provided that these are maintained by – or otherwise under the control of – the sequence author in question and that a reasonable permanency can be guaranteed (Ducut et al., 2008). All such links are checked for validity every 6 months.
New sequence submission and maintenance procedure, as well as improved INSD connectivity
The sequence-deposition process in UNITE has been reworked and now features a log-in system through which the user can deposit and annotate even large sets of sequences. Information can be updated by the user (sequence author) through the log-in system; any such change will take effect immediately. Users are not allowed to modify the records of other sequence authors, but a Wikipedia-style system for commenting on individual sequences is under development. There is a batch-submission system for environmental sequences, and a software package to examine fungal ITS sequences for the presence of chimeric elements is in the final stages of development. All sequence authors are encouraged also to submit their sequences to the INSD; UNITE is now an INSD LinkOut provider, such that all sequences in INSD that are also present in UNITE (possibly with richer annotation) are hyperlinked there. UNITE similarly offers the possibility to link entries to the INSD. UNITE exchanges data with the INSD on a trimestrial basis and keeps a local copy of all fungal ITS sequences in INSD (approximately 135 000 sequences belonging to some 13 500 fully identified species as of July 2009). The fully identified sequences from INSD can be included in, or excluded from, sequence queries in UNITE. Similarly, environmental/unidentified sequences from UNITE and INSD can be excluded from searches.
Usage statistics as a window on the road ahead
UNITE (Kõljalg et al., 2005) has been cited about a hundred times since publication, with 2008 being the year with the highest number of citations (30 in total). The studies citing UNITE cover all five continents. The proportion of users from fields other than mycorrhizal and systematic mycology is growing. This increases the pressure on UNITE to provide information that is clear, accurate and up-to-date; the information should ideally be presented with a general scientific, rather than a strictly mycological, target audience in mind. It is equally clear that in future many users will not turn to UNITE with one or a handful of sequences for query and analysis, but with hundreds or thousands. This trend is taken to its extreme by the 454 pyrosequencing platform, whose voluminous output forms a challenge to any database effort (Buée et al., 2009; Jumpponen & Jones, 2009; Öpik et al., 2009). We do not currently envision UNITE as a full solution for newly generated, unprocessed raw sequence data from 454-based projects, but we will seek to make it a swift and useful resource for analysing pre-processed, clustered 454 data sets. As a first step we are preparing a new batch blast search function for joint analysis of multiple query sequences. A second, and more challenging, step is to employ phylogenetic analysis in the batch-mode identification process, but the details remain to be formalized here.
The pursuit of mycological knowledge is a global scientific enterprise. UNITE collaborates with the Fungal Environmental Sampling and Informatics Network (FESIN) (Bruns et al., 2008; Horton et al., 2009) to establish guidelines and standards for how environmental samples of fungi should be processed and analysed. Much will be gained in terms of time and resources if software and infrastructural development can be co-ordinated. Furthermore, the geographical coverage of UNITE and FESIN together is considerable and should lead to a significant leap in the number of reference sequences in UNITE over the next few years. The challenges remain substantial, however, and we welcome assistance and collaboration to further the underlying objectives and help to bridge the gap between mycology and other disciplines. We invite any researcher or research group with data or resources relevant to reliable molecular identification of fungi to either deposit their data in UNITE or to contact the UNITE team for further discussions. We similarly invite anyone with a set of fungal sequences in need of taxonomic assignment to consider the sequence-processing environment of UNITE and to make any information that would cast further light on data already residing in the database available to the scientific community. We have secured at least basic funding for UNITE for the foreseeable future, and we intend the database to be a permanent resource for the scientific community.