The wealth of shared resources: Improving molecular taxonomy using eDNA and public databases

Public databases such as the NCBI's GenBank have been used as repositories for genomic studies for more than 30 years. In this time, our understanding of the natural world, and especially the genomic world, has expanded vastly, and the size of these databases represent this genomic revolution. Databases like GenBank now help populate many molecular studies, supplementing a researcher's newly gathered data with publicly available sequences. Despite this, older sequence records, particularly those from understudied taxa, are frequently not updated in line with this burgeoning understanding, and this means that analyses that leverage this public data – from BLAST through to phylogenetic analyses – cannot do so with the full force of its collective understanding. This is particularly true for environmental DNA (eDNA) records, where older sequence records may identify sequences only to the phylum level, limiting their use in many studies. Here, with a case study of tardigrade 18S sequences, the family identities of 630 sequences, previously only identified to the phylum level, were established using 501 family, genus and species level 18S sequences, effectively doubling the depth and taxonomic resolution of tardigrade 18S sequences in GenBank.


| INTRODUCTION
The modern era of molecular biology owes a great deal of its development to the establishment of the major sequence databases throughout the 1980s and 1990s (Benson et al., 2009). GenBank opened in 1982, and, alongside the European Molecular Biology Laboratory and the DNA Bank of Japan, provides access to a vast array of sequence and taxonomic data. In addition, through resources such as the BLAST API (Camacho et al., 2009;Madden, 2003), these sequences are readily available and easily accessible through both text and sequence-based search queries. Grant organisations and the broader scientific consensus now encourage the deposition of new genomic and proteomic data into these databases, where they form the basis of a wide array of new studies and inform future directions in science. This trend towards making sequence data publicly available has contributed to a rapid expansion of accessible sequence data. GenBank alone now encompasses more than 17 trillion base pairs worth of sequences (Genetic Sequence Data Bank, 2022).

| 227
FLEMING However, the taxonomic resolution of this data can be variable. In the last 30 years, both analytical and sequencing techniques have changed, and both the questions and answers that molecular biologists are interested in have changed in turn Lozano-Fernandez, 2022;Przybyla & Gilbert, 2022;Varshney et al., 2021). Where previously the ability to identify an organism to phylum level purely from a molecular sequence was revolutionary (Blaxter, 2004), in the presentday searching for a genus or species-level understanding of the target sequence is consistently possible using DNA barcoding techniques (Hernández-Triana et al., 2019;Li et al., 2021). This suggests that for many taxa high resolution of taxonomic identification through a mix of BLAST and phylogenetic techniques is possible.
As outlined by Pappalardo & Osborn in Bachmann et al. (2023) keeping public databases updated in line with current taxonomic understanding is important. From a convenience and usability perspective, well-kept and frequently revised databases can both aid with the identification of new genetic material and provide new uses for older material. From a scientific and environmental perspective, this can help us target future sampling whilst economically using the data we have, especially in the case of rare and remote habitats. Frequent maintenance and revision of sequences identities can help us identify a variety of errors and contaminants that, though rare in public genetic databases, are no doubt both present -often due to morphological taxonomy -and could potentially mislead researchers in future work (Bachmann et al., 2023).
In the case of environmental DNA studies, the potential for inaccurate taxonomic assignment is far higher than in studies that combine DNA extraction from a single specimen with morphological specimen identification. This is a result of the methodology: a single sequence in an eDNA study does not necessarily represent a single organism -their advantage is in presenting a broad picture of an ecosystem. In addition, owing to the nature of this form of data collection, morphological specimens are not present for later examination and confirmation of sequence identity.
Taken together, this means that eDNA studies should not be used to assign identities to unknown sequences. Doing so presents the issue that misassignments from previous environmental DNA studies might agglomerate over time, accumulating further inaccuracies as inaccurate data could be used to assign sequence identities to environmental DNA sequences in future studies. However, in many current databases, such as GenBank, these results will be returned if the user does not explicitly filter them out. Particularly in the case of early environmental DNA studies, where few comparative sequences were available to assign identities to the sequences obtained, this could be cause for some concern. In this respect, a clear, simple and accessible methodology is needed to revise prior environmental DNA studies to increase the accessibility of these public databases and the utility of these sequence resources.
The Tardigrada are a historically undersampled and understudied taxa (Hashimoto et al., 2016;Koutsovoulos et al., 2016;Yoshida et al., 2017). In recent years, following the publication of the first tardigrade reference genomes, tardigrade genomics and proteomics has entered an exciting new era Gąsiorek et al., 2021;Guil et al., 2019;Jørgensen et al., 2018;Murai et al., 2021;Yoshida et al., 2017). However, many basic questions regarding the core systematics of the group remain, including conflicts about the number and distribution of its classes, the monophyletic nature of its classes, and the assignment of species to families and genera Guidetti et al., 2021;Guil et al., 2019;Jørgensen et al., 2010. Owing to their nature as meioscopic limno-terrestrial organisms, many tardigrade samples were collected during the early days of environmental DNA sampling and molecular taxonomy (Blaxter et al., 2005;Nold et al., 2010;Robeson et al., 2009). Unfortunately, due to the paucity of known reference sequences data in the phylum at the time, more thorough identification of these samples beyond the phylum level was not possible. In this respect, expanding the molecular sampling of Tardigrada both by collecting new samples and by unlocking the hidden potential that exists in our public genetic archives is a productive endeavour that can not only allow us to identify future sequences with more accuracy and confidence, but additionally use older, previously unusable, data in new situations.
In this paper, 630 tardigrade 18S sequences available in GenBank, sourced from previously published environmental DNA studies and only identified to the phylum level, have been identified and their taxonomy revised to the family or genus level. In addition, a further 282 sequences of the 630 which were previously identified as tardigrade 18S sequences have been identified as belonging to other phyla and in some cases were further identified as plant or fungal in origin. We additionally present our methods here as a template for future database revision.

| Open access software
Only open access software and data repositories were used for the analysis, including within the phylogenetic analysis step. This was done to better support the conclusion that the taxonomic prediction and database improvement depicted within this manuscript is achievable with little to no specialised resources, thanks to the wealth of publicly available data and analytical servers.

| Calibration dataset construction
An 18S dataset of tardigrade sequences was first constructed by searching NCBI Genbank's Nucleotide database with the query phrase "Tardigrada 18S". All 1408 sequences were downloaded and the 778 sequences identified to a tardigrade clade at the genus or species level were separated to begin the process of forming a calibration dataset (Figure 1). The remaining 630 sequences were assessed later (See 2.1.2, Figure 1). The identities of the 778 sequences were assessed using the NCBI BLAST API (Madden, 2003;Camacho et al., 2009; https://blast. ncbi.nlm.nih.gov/Blast.cgi). Ninety-five sequences either did not return multiple best hits in a Blastn search versus the nt database (hits against five or more tardigrade voucher sequences with an e-value greater than e-100) that corroborated their identity as tardigrade 18S (as opposed to fungal or bacterial contamination, or another tardigrade ribosomal RNA sequence retrieved by NCBI's keyword search) or were shorter than 400 nucleotides in length, and as such were excluded. This resulted in a final calibration dataset size of 683 sequences ranging in length from 416 to1952 nucleotides. Twenty-one 18S sequences from a range of ecdysozoan taxa used as an outgroup in prior tardigrade 18S phylogenetic analyses Guil et al., 2019) were then added to the dataset. These sequences are then referred to as "Calibration Sequences" for the remainder of this study.

| Test sequence identification
Of the 1408 sequences downloaded from NCBI, 630 were identified only as "Tardigrada 18S" (Figure 1). These sequences were then each interrogated using Blastn searches versus the nt database NCBI BLAST API (Madden, 2003, Camacho et al., 2009; https://blast.ncbi.nlm.nih.gov/ Blast.cgi). The taxonomic family of the best hit with an e-value greater than e-100 against an 18S sequence that was published in a paper that contained an accompanying morphological description of the specimen was recorded. Sequences that did not recover a best hit against a tardigrade, or against an 18S sequence, were excluded from the analysis, resulting in the exclusion of 282 of the 630 sequences ( Figure 1, Table S1). The final number of new sequences included in the analysis was 348, with a sequence length between 226 and 1763 nucleotides. These sequences are referred to as "Test Sequences" for the remainder of this study.
F I G U R E 1 A flowchart of the methodology used to determine the family level identity of the tardigrade 18S sequences, accompanied by the number of sequences in each dataset across the study.

| Phylogenetic analysis and taxonomic identification
Test sequences over 1000 nucleotides were first aligned in MAFFT version X (Katoh & Standley, 2013) alongside calibration sequences over 1000 nucleotides to create a less fragmented identification dataset comprising 438 sequences of which 68 were test sequences (Alignment S1). This alignment was used as a scaffold and guide for fragment alignment in MUSCLE version v5 (Edgar, 2004; http://www. drive5.com/muscl e/) of more fragmented sequences in two stages: first sequences between 800 and 1000 nucleotides long (534 sequences, 69 test sequences, Alignment S2), and then sequences between 400 and 800 nucleotides long (1052 sequences, of which 348 were test sequences, Alignment S3) (Katoh & Frith, 2012). Separate phylogenies were then constructed for each of the three alignments.
All phylogenetic analyses were performed in IQ-Tree (Nguyen et al., 2015) using Modelfinder (Kalyaanamoorthy et al., 2017) to find the best model for each alignment (GTR + F + I + G4 for sequences over 1000 nucleotides, and GTR + F + G4 for the remaining two alignments). In keeping with only using publicly available scientific resources, all analyses additionally used the IQ-Tree Web Server (Trifinopoulos et al., 2016; http://iqtree.cibiv.univie.ac.at/). For each tree, 1000 ultra-fast bootstrap replicates were performed.
If a test sequence was found as sister to a calibration sequence, and this grouping was additionally sister to further calibration sequences identified as belonging to the same family as the initial calibration sequence, its definition was then revised to the family level corresponding to the identified sequences surrounding the unidentified sequence. Family was chosen as the target taxonomic level, as opposed to genus or species, as this takes into consideration both the sample paucity of the Tardigrada, and the lack of morphological identification of the target unidentified sequences.

| Exploring the diversity of unidentified tardigrade 18S
The whole tardigrade 18S phylogeny of all 1052 sequences produces a topology that is broadly consistent with most recent studies of tardigrade systematics Guil et al., 2019), suggesting that the inferences that can be made as to the identity of the test tardigrade sequences are robust.
In addition, most currently known tardigrade families were recovered as monophyletic groups with strong bootstrap support (80-100). Calohypsibiidae was recovered as paraphyletic with respect to Hypsibiidae: this is the result of a single Calohypsibiid sequence identified as belonging to Calohypsibius ornatus (HQ604914), which clusters with Mixibius saracenus (a Hypsibiid) and Microhypsibius bertolanii (the sole representative in the study of the Microhypsibiidae) (Figure 2). This particular Calohypsibiid sequence was recovered in this position in the study in which it was first published (Bertolani et al., 2014). However, here, further Calohypsibiid sequences originating from three studies (MH279652; MH079472 and MH079471; EU266940, EU266941 and EU266942 respectively, Table S2) form a monophyletic Calohypsibius clade sister to the Hypsibiidae and the Ca lohypsibiidae + Hypsibiidae + Microhypsibiidae clade (Figure 2, Figure S2). This could imply a need for reexamination of tardigrade morphological taxonomy and systematic revisions around these nodes.
The majority of the test tardigrade 18S sequences were recovered inside the Macrobiotidae (Figure 2, Figure S1). Of these 213 sequences, 129 were recovered inside the broader M. hufelandi complex. A further 82 sequences were recovered from Paramacrobiotus and 1 sequence each were recovered inside Minibiotus and Mesobiotus. The hufelandi group is one of the mostly commonly identified and readily recorded tardigrade groups across the globe (Kaczmarek & Michalczyk, 2017), and so in this respect it is unsurprising that this species complex comprises 37% of the test tardigrade 18S sequences in GenBank. More exciting are the putative Minibiotus and Mesobiotous sequences, which effectively increases sampling from these undersampled genera -Minibiotus contains 12 positively identified sequences in this dataset, and Mesobiotus 15though in this latter case many sequences identified only as Macrobiotus sp. appear to be positively identifiable as members of Mesobiotus (Figure 2, Figure S1).
One hundred and two of the test tardigrade 18S sequences were recovered inside the Hypsibiidae (Figure 2, Figure S2). Of these, 56 were found in a clade sister to Cryobiotus klebelsbergi. Prior to the addition of these sequences to the dataset, this species was known from three voucher specimen sequences, and as such this might represent further diversity within the genus in an undersampled cladistic region of the Hypsibiidae. These sequences are from glacial soil at a single locality in Colorado (Tables S1 and S2). As Cryobiotus klebelsbergi is understood to be a glacial obligate species (Zawierucha et al., 2022), this further corroborates the veracity of these identifications.
Ten of the 348 test sequences were recovered within the Superfamily Isohypsibioidea (Figure 2, Supporting Figure S3). Of these, six are recovered as sister to a calibration Dianea sequence (labelled Isohypsibius papillifer in GenBank), two as sister to the clade comprising the aforementioned Diania/Isohypsibius sequence and the six test sequences, one as sister to Pseudobiotus and one as sister to Eremobiotus. Each of these divisions is notable as a division by type locality (Table S2). In the case of the Eremobiotus and Pseudobiotus sequences especially, these represent relatively large additions to the known 18S sequence pool for these genera, as only five Eremobiotus sequences and four Pseudobiotus 18S sequences can be found in GenBank. In addition, six of these 10 test sequences -including the Pseudobiotus 18S sequence identified by this study -are above the 1000 base threshold that we used to describe "phylogenetically reliable" sequences, and so could prove useful for future phylogenetic analyses. [Correction added on 30 March 2023, after first online publication: The taxonomy for some of the taxa has been corrected in the paragraph] A further 10 test sequences were recovered in the Ramazzottidae (Figure 2, Figure S4). These sequences were recovered in a single clade, sister to a calibration Ramazzottius varieornatus. Ramazzottius is one of the more widely studied genera of Tardigrada (Horikawa, 2008, Tanaka et al., 2015, Yoshida et al., 2017Fukuda & Inoue, 2018;Neves et al., 2020,) and whilst these sequences likely represent further sampling of R. varieornatus, they may additionally belong to a new sister species within the genus. Here, this methodology reveals its potential application to direct future work, suggesting that further sampling and collection of voucher specimens in Edinburgh (from which all these sequences originate) may reveal a new species of Ramazzottius. Similarly, the five sequences recovered from the Milnesiidae were found nested in a monophyletic group containing a number of calibration Milnesium tardigradum sequences (Figure 2, Figure S5).
Two test sequences were recovered inside the Murrayidae (Figure 2, Figure S6). These were both recovered as sister to a clade comprising three calibration sequences of Murrayon dianeae. As provenance of the Murrayidae is part of a broader systematic debate regarding the composition of eutardigrade systematics, the prospect of new 18S sequences for this group is particularly exciting (Guidetti et al., 2005;Guil et al., 2019). Unfortunately, both test sequences were below 500 nucleotides in length.
Of particular note in this topology is the grouping of Adorybiotus with Richtersius ( Figure 2, Supporting Figure S6). These affinities were presented in both  and Guil et al. (2019), but in recent studies (Guidetti et al., 2021;Stec & Morek, 2022) that propose Richtersiusidae, it comprises Richtersius and Diaforobiotus to the exclusion of Adorybiotus (which then becomes sister to the Murrayidae). Due to inadequate sampling from Diaforobiotus in this study, the five tardigrade 18S test sequences recovered as sister to Adorybiotus were instead identified as members of the Order "Parachela", rather than attributing them to Murrayidae, Adorybiotidae or Richtersiusidae. Parachela comprises this current phylogenetic controversy and a number of additional families (Macrobiotidae, Eohypsibiidae, Hypsibiidae, Calohypsibiidae, Ramazzottidae and Isohypsibiidae) that are highly sampled in this study, and the controversial grouping is not found as an early diverging group within F I G U R E 2 An 18S phylogeny of the Tardigrada, including all 630 sequences previously identified as only "tardigrade 18S" in GenBank. These 18S sequences are identified as red branches, and each family is independently coloured with the outgroup and interstitial branches in black. The same phylogeny with tip names and bootstrap node support values can be found in the Appendix S1. Detailed inspections at the family level can be found in Figures S1-S7. [Correction added on 30 March 2023, after first online publication: Figure 2 was corrected.] Parachela (being sister to Macrobiotidae). As such, this revision should in theory be robust both to the lack of resolution in our own phylogeny and the contentious nature of the family organisation in this section of the tardigrade tree in the wider literature Guidetti et al., 2021;Guil et al., 2019;Stec & Morek, 2022).
[Correction added on 30 March 2023, after first online publication: The taxonomy for some of the taxa has been corrected in the paragraph] Outside of the Eutardigrada, only one of the test sequences in GenBank belonged to the Heterotardigrada (Figures 1 and S7). This is unsurprising, as all the type localities of the unidentified sequences were terrestrial ( Figure S1), and many known heterotardigrade species are predominantly marine (Gąsiorek et al., 2018). The sole unidentified 18S heterotardigrade sequence was recovered as sister to Viridiscus viridissimus (labelled Echiniscus viridissimus in GenBank). Notably, in this tree, V.viridissimus and the test sequence were recovered as sister to the newly discovered genus Nebularmis (Gąsiorek et al., 2019), though the lower bootstrap support of this particular topology suggests this arrangement may be artefactual. This sequence is thereby identified as "Echiniscoidea". [Correction added on 30 March 2023, after first online publication: The taxonomy for some of the taxa has been corrected in the paragraph]

| Cautions and concerns
A primary concern explicated by this analysis -and by Pappalardo & Osborn in Bachmann et al. (2023) -is the number of misattributed sequences in GenBank: sequences identified as belonging to, in some cases, a different phyla owing to the nature of eDNA studies. In our initial BLAST analysis, we rejected 283 of the 631 test sequences (44.8%) identified as Tardigrada 18S (Table S1). When the taxonomic family of the best hit with an e-value greater than e-100 against an 18S sequence that was published in a paper that contained an accompanying morphological description of the specimen was recorded, the identity of these sequences, according to a BLAST search against the nt database filtered to remove uncultured and environmental samples, comprised other ecdysozoans (Acari, Collembola, Nematoda), lophotrochozoans (Annelida), plants (Circubitales, Lamiales, Rosales), fungi (Capnodiales, Pezizomycotina), ciliates and bacteria; 236 (83.3%) of these contaminant sequences belonged to the Nematoda, which we suggest may be an artefact of the environmental DNA nature of most of these samples, combined with the phylogenetically close relationship of the two clades.
The number of misidentifications in this study shows that the use of older environmental DNA samples to establish the identity of sequences from environmental DNA studies would be poor practice: it could readily cause an agglomeration of misidentification errors. Thankfully, the web interface for GenBank BLAST offers the ability to filter out these sequences with a simple tick box option: being explicit about its use when describing BLAST searches in the methods sections is clearly a vital information for readers and reviewers. The fear of an agglomeration of errors causing misidentification is particularly concerning for the nematode contaminants in this study, as the sheer number of these misidentified sequences in an unfiltered BLAST search may drown out hits against correctly identified nematode sequences. This could cause new nematode sequences to be misidentified as belonging to Tardigrada, considering the closely related nature of the two phyla and the numerous ecologies they share (Giribet & Edgecombe, 2017).
More positively, establishing a more detailed molecular taxonomy for sequences previously only identified to the phylum level is a worthwhile endeavour. Sixty-eight of the 348 sequences were over 1000 nucleotides long, and so are readily usable for more detailed phylogenetic analyses. Of these, six belonged to the Isohypsibiidae and five to the Adorybiotidae/Richtersiidae, which are both historically undersampled families.
Whilst the remaining 280 test 18S sequences are likely unsuitable for serious phylogenetic studies, they remain a useful addition to the nr database. These sequences, with their updated definitions (provided in Table S1 for non-tardigrade sequences and Table S2 for tardigrade sequences), will enable researchers to better understand new genomic data prior to phylogenetic analysis. These definitions are now available for users of GenBank to cross-reference in their own studies.

| The future of GenBank maintenance?
Currently, only the author of a sequence record can revise the submitted definitions. However, as GenBank continues to grow and the field continues to develop, this will rapidly become an unsustainable way of maintaining a functioning database. Some prolific authors have contributed an immense number of records to these public resources, and this personal method not only places a great burden on them, but also leaves no clear best practice once they have retired, resigned or simply no longer wish to take on this additional work. This paper shows that, with a clear, repeatable and transparent methodology, crowd-sourcing revisions is not only achievable using publicly available resources, but potentially capable of unlocking powerful reserves of genomic data for future use.
Our hope is that consistent, vigilant revision of GenBank data -whether from eDNA studies or from voucher specimens -will allow us to even further improve our ability to obtain reliable family level definitions through BLAST and HMM (Camacho et al., 2009;Finn et al., 2011), and in turn aid us in filtering out contamination and performing initial sequence identifications through software such as blobtools  and CroCo (Simion et al., 2018).

| CONCLUSIONS
It is imperative that we maintain public databases at the highest quality if we want them to achieve their goal of effectively disseminating scientific information throughout the research community. Whilst the misidentification in datasets such as this may initially be disheartening, through simple methodological protocols such as the one highlighted in this manuscript, we can not only remove problematic data, but use the wealth of reliable resources that are still contributed by these public databases to effectively improve them. In addition, by using only public, open access resources, this study demonstrates that considered revision of GenBank data is something anyone with a basic phylogenetic skillset and some spare time could undertake. Though time itself is a precious resource in our modern world, maintaining and improving these public databases is a task achievable without personal access to expensive computational equipment, thanks to publicly accessible servers. The large-scale revisions to GenBank that are implied by Pappalardo & Osborn in Bachmann et al. (2023) and this study may initially seem daunting, but with simple protocols and a widespread devotion of just a small amount of time, the mountain of data could become far more manageable, and result in not just a "fixed" database, but one fundamentally better than before.

ACKNO WLE DGE MENTS
J.F.F would like to thank Prof. Bachmann for organising the symposium that gave rise to the experimental concept and Dr. Pappalardo and Dr. Osborn for the work that directly inspired it.

FUNDING INFORMATION
J.F.F is currently funded by the Invertomics project (Forskningsrådet project number: 300587).