Exhaustive reanalysis of barcode sequences from public repositories highlights ongoing misidentifications and impacts taxa diversity and distribution

Accurate species identification often relies on public repositories to compare the barcode sequences of the investigated individual(s) with taxonomically assigned sequences. However, the accuracy of identifications in public repositories is often questionable, and the names originally given are rarely updated. For instance, species of the Sea Lettuce (Ulva spp.; Ulvophyceae, Ulvales, Ulvaceae) are frequently misidentified in public repositories, including herbaria and gene banks, making species identification based on traditional barcoding unreliable. We DNA barcoded 295 individual distromatic foliose strains of Ulva from the North‐East Atlantic for three loci (rbcL, tufA, ITS1). Seven distinct species were found, and we compared our results with all worldwide Ulva spp. sequences present in the NCBI database for the three barcodes rbcL, tufA and the ITS1. Our results demonstrate a large degree of species misidentification, where we estimate that 24%–32% of the entries pertaining to foliose species are misannotated and provide an exhaustive list of NCBI sequences reannotations. An analysis of the global distribution of registered samples from foliose species also indicates possible geographical isolation for some species, and the absence of U. lactuca from Northern Europe. We extended our analytical framework to three other genera, Fucus, Porphyra and Pyropia and also identified erroneously labelled accessions and possibly new synonymies, albeit less than for Ulva spp. Altogether, exhaustive taxonomic clarification by aggregation of a library of barcode sequences highlights misannotations and delivers an improved representation of species diversity and distribution.


| INTRODUC TI ON
Species identification of biological specimens is paramount for assessing the diversity of ecosystems (Johannesson & Andre, 2006), identify invasion events (Dunbar et al., 2021;Estoup & Guillemaud, 2010), and qualify the distribution of species of interest (Mendez et al., 2010). While morphological characteristics can be used for species identification (Dugon et al., 2012), precise species identification often relies on the analysis of "barcode" sequences, which are small standardized genetic loci used for taxonomic identification of the samples (Valentini et al., 2009). Indeed, morphological characters can be a poor indicator of the underlying complexity of the genetic diversity within a genus (Packer et al., 2009).
For example, due to the phenotypic plasticity of the genus Ulva -the type genus of the Ulvophyceae, Ulvales and Ulvaceae-in response to environmental factors, and relatively subtle morphological differences between species (Hofmann et al., 2010;Malta et al., 1999), DNA barcoding is necessary to attribute species names to specimens, even for the most common species. DNA barcoding for the purpose of identifying specimens relies on the amplification and sequencing of specific loci in the genome. In plants and algae, it is often through chloroplast markers such as rbcL and tufA, but also nuclear markers such as parts of the 45S rRNA repeats (most commonly the Internal Transcribed Spacer 1 [ITS1]) Fort et al., 2018Fort et al., , 2019Fort et al., , 2020Miladi et al., 2018;O'Kelly et al., 2010).
The sequences obtained from those barcodes are then compared with sequences associated to species names which are publicly available in repositories, such as the National Center for Biotechnology Information (NCBI).
Typically, NCBI sequences with high percentage identity compared with the query sequence are considered as belonging to the same species and used as reference for phylogenetic trees when no statistical inference of species delimitation is used (Heesch et al., 2009;Saunders & Kucera, 2010;Steinhagen et al., 2019). The risk in such case is that the species attributed to the matching sequences present in the NCBI can be erroneous, leading to the misidentification of the investigated individual. Indeed, the taxonomic information in the NCBI is not always accurate, and often contains "putative" species names (Garg et al., 2019), erroneous classifications (Chowdhary et al., 2019;Nasehi et al., 2019), or nonupdated species names following nomenclature adjustments .
Therefore, improving the nomenclature and taxonomic classification of sequences of any genus of interest requires a careful exhaustive reanalysis of barcodes sequences, to ensure accurate classification of new specimens, and to provide an updated list of reannotations.
Here, we deployed such an analytical framework to revisit the phylogeny of Ulva spp., a genetically diverse group of green macroalgal species ubiquitous in the world's ocean, brackish and even in freshwater environments. Over 400 Ulva names have been coined of which about 90 are currently recognised as taxonomically valid , many of which are uncommon or rare and only about 25 are frequently reported . The morphology of Ulva species can be grouped into two general types, one containing foliose "sheet-like" species (distromatic foliose blades commonly known as "Sea Lettuce"), and another with tubular or partially tubular thalli (monostromatic tubes formerly recognized as the genus Enteromorpha). However, the phenotypic plasticity between tubular and foliose morphotypes is not solely genetic, but can be based on both abiotic and biotic factors (Wichard et al., 2015). We generated DNA barcodes (rbcL, tufA, ITS1) on 185 strains of distromatic foliose Ulva from the North East Atlantic, and used data and species delimitation from our previous study containing another 110 strains , as a primer for large-scale phylogenetic analysis of all Ulva sequences for the three common barcodes present in the NCBI database. The main aim of this study was to develop an analytical framework allowing the extent of misannotations in the sequences of any taxa of interest to be highlighted, taking as proof of concept the case of distromatic foliose Ulva species. We provide a detailed view of the phylogenetic relationships and possible misannotations between all sequences in the NCBI database, and propose readjustment for misannotated NCBI accessions, a list of appropriate reference vouchers for large foliose species, and a nomenclature adjustment between certain Ulva species. Finally, we employed the same analytical framework for three other seaweed genera, Fucus, Porphyra and Pyropia and identified clades containing misannotations and potential new synonymies. analysis of the global distribution of registered samples from foliose species also indicates possible geographical isolation for some species, and the absence of U. lactuca from Northern Europe. We extended our analytical framework to three other genera, Fucus, Porphyra and Pyropia and also identified erroneously labelled accessions and possibly new synonymies, albeit less than for Ulva spp. Altogether, exhaustive taxonomic clarification by aggregation of a library of barcode sequences highlights misannotations and delivers an improved representation of species diversity and distribution.

K E Y W O R D S
aquaculture, DNA barcoding, phylogeny, Sea lettuce, Ulva 2 | MATERIAL S AND ME THODS

| Foliose Ulva sample collection and DNA extraction
We collected individual thalli from foliose Ulva individuals with a thalli area >1,000 mm 2 in 34 sites in Ireland, Brittany (France), Spain, Portugal, the United Kingdom and the Netherlands between June 2017 and September 2019. The list of strains and associated metadata are available in Table S1. A total of 185 strains were collected for this study. On collection, samples were placed in clip-seal bags filled with local seawater and sent to Ireland in cold insulated boxes.
On arrival, thalli were thoroughly washed with artificial seawater and a ~50 mm 2 piece of biomass collected and placed in screw caps tubes (Micronic). The tubes were immediately flash-frozen in liquid nitrogen and stored at -80°C. Then, samples were freeze dried, ground to a fine powder using a ball mill (Qiagen TissueLyser II), and ~5 mg of powder used for DNA extraction, using the magnetic-beads protocol described in Fort et al. (2018).

| DNA amplification and sanger sequencing
The extracted DNA was amplified using three different primers combinations to obtain partial sequences for the nuclear 45S rRNA repeats (ITS1), as well as the chloroplast rbcL and tufA barcodes.
The primers used in this study are available in Table S2, and originate from (Heesch et al., 2009) and (Saunders & Kucera, 2010) for rbcL and tufA, respectively. The ITS1 primers were designed from the data set obtained in , and used in Fort, Linderhof, et al. (2021). PCR amplification was performed in 25 μl reaction volume containing 1 μl of undiluted DNA, 0.65 μl of 20 pmol forward and reverse primers, 9.25 μl of miliQ water and 12.5 μl of MyTaq Red mix (Bioline). The PCR protocol used 35 cycles of denaturation at 95°C for 30 s, annealing at 60°C for 30 s and extension at 72°C for 30 s. PCR products were precipitated using 2.5 volumes of 100% EtOH and 0.1 volume of 7.5 M ammonium acetate and incubated on ice for 30 min. Pellets were centrifuged at 4,000 g for 30 min at 4°C, and washed twice with 75% EtOH. Finally, PCR amplicons were sent to LGC Genomics GmbH for Sanger sequencing using the forward primer for each barcode.

| Data set compilation for phylogenetic analyses
Our phylogenetic analysis aimed to consider all sequences attributed to Ulva species (foliose and tubular) in the NCBI database, including tubular and partially tubular species, and detect any evidence of species misannotation therein. We designed an analysis pipeline that could be used in any other taxa of interest, summarised in Figure 1.
Command line codes and links to download the software used are available in Appendix S1. We downloaded all available sequences in the NCBI for ITS, rbcL and tufA (as of 13 July 2020), in addition to the sequences from our previous study .
The search keywords were as follows: "Ulva (organism) AND internal transcribed" for ITS sequences, "Ulva (organism) AND rbcL (gene) AND plastid (filter)" for rbcL sequences, and "Ulva (organism) AND tufa (gene) AND plastid (filter)" for tufA sequences. This search strategy yielded 1,679 ITS sequences (1,975 in total including this study , 1,432 rbcL sequences (1,732 in total) and 1,114 tufA sequences (1,393 sequences in total). National Center for Biotechnology Information entries that did not contain species information (containing "Ulva sp." as organism) were then removed from the data set, by selecting all sequences not containing "Ulva sp." in their title, and using Samtools faidx (Li et al., 2009) to extract their corresponding sequences. This filtering yielded 1,726, 1,312 and 1,321 sequences for ITS1, rbcL and tufA, F I G U R E 1 Analysis framework used in this study. The list of scripts and software is available in Appendix S1 | 89 FORT eT al.
respectively. Sequences were then aligned using MAFFT (Katoh et al., 2019) using the default settings for rbcL and tufA, and the iterative FFT-NS-i method for the ITS1 alignment, due to the numerous gaps present. Because each study might amplify a slightly different portion of the barcodes due to the use of different primers, we then removed nucleotide positions that were absent in (i) more than 60% of the sequences using Trimal (Capella-Gutiérrez et al., 2009) -gt 0.4 for rbcL and tufA, and (ii) in more than 91% of the sequences for ITS1 (Trimal -gt 0.09). This step effectively trimmed the 5' and 3' ends of the alignment as to retain informative nucleotides, thereby avoiding large missing positions due to the use of different primers in different studies. Sequences containing more than 50% unknown bases in the trimmed alignments were then removed using Trimal (trimal -seqoverlap 50) (for rbcL and tufA), and more than 70% unknown bases for the ITS1 alignment (trimal -seqoverlap 70). The use of two different filtering methods between the organellar barcodes (rbcL and tufA) and ITS1 was because the ITS1 alignment contains gaps that are biologically relevant (the ITS1 length varies between species), while rbcL and tufA coding sequences generally do not vary in length, but only in sequence. The filtering steps yielded final alignments containing 1,245 sequences (270 bp), 1,062 sequences (1,231 bp) and 1,320 sequences (801 bp) for ITS1, rbcL and tufA, respectively. The 5′ and 3′ gaps introduced by the presence of missing positions in some of the sequences due to missing data were modified into "n" (i.e., unknown) bases. The missing nucleotides at the beginning and end of the sequences were due to the use of different primers (or sequencing length), and not to genetically relevant differences.
The Fucus and Poyphyra +Pyropia data sets were generated as above, using the search terms "Fucus (organism) AND (

| Phylogenetic analyses
We used both maximum likelihood and Bayesian MCMC phylogenetic analyses for the ITS1, rbcL and tufA data sets to create maximum likelihood and Bayesian trees for each barcode. First, the best evolutionary model for each of the three alignments was determined based on their Akaike information criterion (AIC) score using jModeltest 2 (Darriba et al., 2012;Posada & Buckley, 2004). For all three alignments, general time reversible +gamma distribution +proportion of invariants sites (GTR + G + I) was deemed the most appropriate. Maximum likelihood trees were obtained using RAxML-NG (Kozlov et al., 2019) using the "--all" option (20 maximum likelihood inferences, then bootstrap trees). Bootstrapping was stopped automatically using a MRE-based Bootstopping Test (Pattengale et al., 2010) once reaching convergence values below 0.03. Bootstrap values were computed using the "--bs-metric tbe" option, representing transfer bootstrap expectation (TBE) values, expected to produce higher support for large trees with hundreds of sequences (Lemoine et al., 2018), compared with classical Felsenstein bootstrap proportions (FBP). Bayesian MCMC analyses were performed using MrBayes with MPI support (Ronquist et al., 2012), with a varying number of generations between the three data sets, until the average standard deviation of split frequencies reached a maximum of 0.05, and estimated sample sizes (ESSs) were higher than 200 for all parameters.
For species delimitation, we used the same method as per Fort et al. (2019) and , with a general mixed yule coalescent model (Fujisawa & Barraclough, 2013;Pons et al., 2006) in BEAST, and 50 million Markov Chain Monte Carlo (MCMC), using the BEAGLE library for decreasing computational time (Suchard & Rambaut, 2009). Convergence was confirmed in Tracer (Rambaut et al., 2018), with an ESS score >200 for all relevant parameters.
For detecting putative species disagreement within clades, all species names of the accessions present within GMYC clusters were compared and a percentage agreement metric per cluster was generated.
For each cluster, the maximum number of accessions with the same species names was divided by the total number of accessions within the clade. This ratio indicates how divergent species names are within the GMYC clade, and all clades below 100% agreement can indicate a possible misannotation or new synonymies. The R script to generate the species delimitation and this ratio is available in Appendix S2.

| Taxonomic assignment of sequence names
Regarding foliose Ulva species, since several species names have been found to be synonymous, and we used the species names listed in Table 1 as our reference. Where holotype or lectotype reference sequences were available, we attributed the species names of the reference to all sequences within the same GMYC clade. Where such type sequences were not available, we based our species attribution with comparisons from sequences from the literature and the GMYC clustering, with the caveat that indeed the nomenclature of the GMYC clade could change once holotype sequences become available. The rationale behind the selection of reference sequences is detailed in Appendix S1.

| Species distribution of distromatic foliose Ulva species
The country of origin, GPS coordinates, specimen name and publication name of all of the NCBI entries in the three data sets were recovered using custom python scripts (Appendix S3 and S4), restricted to vouchers assigned in our analysis as belonging to the 11 main distromatic foliose Ulva species (namely, U. australis Setchell & N. L. Gardner, U. arasakii Chihara and U. ohiohilulu H. L.
Spalding & A. R. Sherwood), and Ulva sp. A. Publications associated with NCBI entries missing GPS coordinates and/or location of origin were manually searched to retrieve GPS coordinates where available. Accessions whose area of origin were uncertain were removed from the analysis. Duplicated specimens (i.e., specimens with more TA B L E 1 Names and synonyms used in this study

F I G U R E 2
Maximum likelihood tree of the rbcL alignment, rooted on Umbraulva sequences. Coloured clades represent distromatic foliose species found in this study. Shaded clades represent tubular or partially tubular species and/or species with no representative in this study. Numbers, shaded and/or coloured clades represent species clusters determined using GMYC. Full trees including bootstrap values and bayesian posterior probabilities are available in Figure S1 than one barcode sequenced in the NCBI) were removed and only one entry was kept. The complete list of vouchers, specimen, name, publication, GPS coordinates and proposed species attribution is available in Table S3. The world map and pie-chart distribution of Ulva species was created in R using the package Rworldmap (South, 2011).

| RE SULTS
Using the analysis pipeline we created, recovered and analysed all Ulva sequences in the NCBI, as well as 185 additional strains from the North-East Atlantic sequenced in this study, for the three most common barcodes used in Ulva phylogeny, namely rbcL, tufA and ITS1.

| Analysis of all Ulva spp. rbcL sequences from public repositories
We used the rbcL data set generated in this study, that from Fort,  Heesch et al., 2009), as well as a single clade containing both U. lactuca and U. ohnoi, and another clade containing U. rigida and U. adhaerens. The full maximum likelihood tree (including bootstrap support), the Bayesian MCMC analysis tree (including probabilities), and entries species names for rbcL can be found in Figure S1, and Table S3). Altogether, we found a relatively low agreement between the species names assigned to the NCBI vouchers and the GMYC clusters for rbcL, with only seven out of 24 GMYC clusters containing 100% of sequences with the same species name annotation ( Figure   S2). Disagreements between GMYC clades and species names within them do not necessarily indicate misannotations, due to poor detection of species boundaries by the GMYC analysis using this barcode.
Nonetheless, the results show that rbcL sequences are probably poor at defining Ulva species, and that each clade should be investigated in detail, as significant naming discrepancies are present.

| Analysis of all tufA sequences from public repositories
We performed the same analysis using the tufA barcode (Figure 3, Figure S3 and Table S3). We found significantly more species clusters than for the rbcL barcode (40 species clusters, confidence interval 37-46).

| Analysis of all ITS1 sequences from public repositories
Finally, the analysis was repeated on the ITS1 barcode data set ( Figure 4, Figure S4 and Table S3). Once again, the results are in general agreement with the previous barcodes, particularly with tufA.
Indeed, species delimitation predicts 42 species clusters (compared with 40 with tufA), with a confidence interval of 34-59.
The U. australis, U. gigantea and U. ohnoi clades are well conserved, with only minor discrepancies (  Figure S2). This shows that a significant number of misannotations are probably present in the ITS sequences of the Ulva genera.

| Impact of NCBI accession reanalysis on species distribution
After reassigning species name for each NCBI entry, we generated a world map of the distribution of the eleven large foliose Ulva species from which there is genetic evidence ( Figure S5). Strikingly, no

U. lactuca individuals are present in the North Atlantic and the Baltic
Sea, outside of a specimen recovered from an aquarium and misannotated as U. laetevirens (Vranken et al., 2018), and a single specimen in Massachusetts, USA. As shown above, the reports of U. lactuca in many regions are all referable to U. fenestrata. Importantly, while the number of misannotations in the NCBI is significant, the problem is even higher in other databases that do not rely on DNA sequencing for reporting species records. For instance, the Ocean Biodiversity Information System (OBIS) contains >4,700 records for U. lactuca, most of which located in the North Atlantic, in seeming contradiction with our results ( Figure 5). Hence, reanalysis of barcode sequences can drastically change species distribution. For Fucus spp., we used all publicly available sequences for the COX1 and nrRNA-ITS barcodes, and generated a maximum likelihood tree and species delimitation as for the Ulva data sets. The GMYC analysis predicts eight and nine species for COX1 and nrRNA-ITS sequences, respectively ( Figure 6), with the Fucus distichus clade being split into six different predicted species by the GMYC analysis of COX1 sequences. For the nrRNA-ITS data set, the clades containing F. serratus and F. vesiculosus species names are separated into two and four predicted clades, respectively. Overall, the species names within the GMYC clusters are well conserved, with 5/8 and 7/9 clusters displaying 100% agreement ( Figure S2). However, one clade in F I G U R E 3 Maximum Likelihood phylogenetic tree of 1,320 Ulva spp. tufA sequences, and description of the entries belonging to the main distromatic foliose Ulva species. Maximum likelihood tree of the tufA alignment, rooted on Umbraulva sequences. Coloured clades represent distromatic foliose species found in this study. Shaded clades represent tubular or partially tubular species and/or species with no representative in this study. Numbers, shaded and/or coloured clades represent species clusters determined using GMYC. Full trees including bootstrap values and bayesian posterior probabilities are available in Figure S3 both barcode data sets appears problematic. F. vesiculosus and F. spiralis sequences are intertwined in both data sets. This indicates that the two species are frequently misannotated. Indeed, sequences with both names are in some cases indistinguishable, with 100%

|
identity. The full maximum likelihood trees are available in Figure S6.
The Porphyra and Pyropia data set contains 1,296 COX1 sequences, separated into 62 GMYC clusters (Figure 7, full tree available in Figure S7). Unlike Ulva, the species names within GMYC clusters appear remarkably consistent in this data set, with only twelve out of 62 GMYC clusters containing sequences with different species names ( Figure S2). Furthermore, most of those relate to clusters containing vouchers with undetermined species names, hence do not represent misannotations per se. Only one clade is potentially problematic, with sequences named either F I G U R E 4 Maximum likelihood phylogenetic tree of 1,245 Ulva spp. ITS1 sequences, and description of the entries belonging to the main distromatic foliose Ulva species. Maximum likelihood tree of the ITS1 alignment, rooted on Umbraulva sequences. Coloured clades represent distromatic foliose species found in this study. Shaded clades represent tubular or partially tubular species and/or species with no representative in this study. Numbers, shaded and/or colored clades represent species clusters determined using GMYC. Full trees including bootstrap values and Bayesian posterior probabilities are available in Figure S4 Porphyra linearis or Porphyra umbilicalis, despite being identical in sequence.
Altogether, the three additional data sets show a lower extent of potential misannotations than the Ulva data sets, even when using a species-rich family such as the Bangiaceae. We generated a histogram of the percentage of agreement in the species names of all GMYC clusters between the three groups of species investigated here (Figure 8), which shows a significant number of GMYC clusters below 100% agreement in Ulva, compared to Fucus, Porphyra, and Pyropia data sets.

| Limitations of species delimitation using single barcodes
In this study, we endeavoured exhaustively to assess the genetic information available for our taxa of interest. We used all publicly available sequences from the NCBI for three common barcodes. Notably, species delimitation using such a large number of sequences yields relatively large species clusters confidence intervals. For instance, using rbcL

F I G U R E 6
Maximum likelihood phylogenetic tree of Fucus spp. COX1 and nrRNA-ITS sequences. Numbers and shaded clades represent species clusters determined using GMYC. Full ML trees are available in Figure S6 alone did not allow to separate certain taxa that were previously shown to be separate species Hiraoka et al., 2004;Hughey et al., 2019), such as U. sp. A and U. lacinulata or U. ohnoi and U. lactuca. This could be due to the use of a smaller length of alignment for rbcL in this study, as opposed to concatenated rbcL + tufA sequences in  for the U. sp. A/U. armoricana separation.
In addition, such a discrepancy is inherent to large-scale species delimitation analyses when using limited genetic information (Leliaert et al., 2014;Tang et al., 2014). Indeed, the presence of possibly spurious sequences in the entire data set can skew the speciation threshold of the GMYC analysis, especially when a single barcode containing a limited number of SNPs between species is used. This probably explains the relatively large confidence intervals we observed for rbcL. In contrast, using tufA alone we were able to separate U. lacinulata and U. sp. A, which is in agreement with previous studies Hayden & Waaland, 2002;Heesch et al., 2009;Tan et al., 1999). tufA displays more SNPs than rbcL when comparing those two species (nine vs. two, respectively), allowing for a species delimitation between the two clades. The ITS1 barcode similarly allowed for the separation of those two species. However, while we are able to separate U. lactuca and U. ohnoi using tufA, U. ohnoi is separated into two different clades.
Similarly, U. linza, U. compressa, U. intestinalis and U. prolifera clades are separated into several clades. Finally, while seven U. reticulata vouchers originating from , are included in the U. ohnoi clade using the ITS1 barcode, these probably do not represent erroneous annotation, since in their study, Monotilla et al. (2018) showed that U. ohnoi and U. reticulata are sexually isolated, despite having little to no sequence divergence in this barcode sequence.
Thus, appropriate species delimitation analysis should ideally be performed on a larger amount of genetic information, such as full organellar genomes, or concatenated sequences from the same specimens. Additionally, other species delimitation algorithms are available, such as Poisson tree processes (PTP) or the automatic barcode gap discovery for primary species delimitation (ABGD) (Puillandre et al., 2012;Zhang et al., 2013). It is likely that using different methodologies for species delimitation will yield a different number of species clades in the same data set, and a combination of approaches could be used to precisely delimitate all Ulva species. Regardless of precise species delimitation however, the methodology described here allows to quickly test putative clades and their associated sequence names for possible misannotations or new synonymies. Notably, the use of "agreement of species names within clade" ( Figure S2, Figure 8) from the GMYC output helps to identify potentially problematic clades and species names. It provides a visual representation of the diversity within the data set and serves as a steppingstone for in-depth reassessment of the taxonomy and diversity of genera of interest.
Regarding our findings with Ulva, the number of "species names" in the entries from the NCBI data set is 56, nine of which are classified as synonyms. Of the 47 unique species names remaining, this analysis, despite its limitations, found ~40 species clusters containing more than two sequences, thus broadly agreeing with the present number of species described in NCBI. These numbers are significantly lower than that of the number of currently accepted species taxonomically (84 according to ). This apparent discrepancy could be explained by the presence of numerous species entities described morphologically in past studies from which there is no genetic evidence. These specimens should be sequenced F I G U R E 7 Maximum Likelihood phylogenetic tree of Porphyra + Pyropia COX1 sequences. Shaded clades represent species clusters determined using GMYC. Full ML tree is available in Figure S7 F I G U R E 8 Distribution of species names agreement per GMYC cluster between Ulva, Fucus and Porphyra + Pyropia data sets if they are available, or their type locality resampled, as the NCBI database probably only contains a subset of all Ulva species.

| Nomenclature, taxonomy and species misidentifications in public repositories
The main issue with the use of public repositories to assign species name to sequences is the underlying quality of the species annotation within the repository. Two issues can be present, a nomenclatural issue, where the naming of the taxa is erroneous, or taxonomical issues, where the relationships between taxa is at fault (de Queiroz, 2006). The analytical framework described here allows us to identify clades that contain sequences with different species names, which could represent new synonymies for nomenclatural adjustments, and/or detect problematic taxonomic relationships when sequences of the same species name are present in different clades.
Importantly, both of those points do not require prior knowledge of the nomenclature or taxonomy of the genus. For example, the presence of a significant amount of U. lactuca sequences intertwined with U. fenestrata accessions in one clade highlights misannotation of many specimens of U. lactuca, while multiple clades containing only one species name could represent undescribed new taxa.
However, to resolve the nomenclatural issues highlighted requires the systematic sequencing of all available types or the designation of epitypes. This work in Ulva is currently underway by Hughey and colleagues, leading to nomenclatural adjustments of several species names (Hughey et al., ,2019. For example, the clade described here as U. lacinulata was previously referred to as U. laetevirens and U. armoricana Kirkendale et al., 2013;Miladi et al., 2018). Following sequencing of the U. laetevirens lectotype , the name U. laetevirens was found to be synonymous with U. australis. Recently, the sequencing of U. lacinulata lectotype revealed that it was the oldest valid and available name for this clade . We therefore renamed our accessions as U. lacinulata. Furthermore, sequencing of the U. rigida lectotype revealed that it belongs to the clade previously known as U. pseudorotundata, for which the oldest available name is U. rigida . Finally, given that the sequences pre- Here, using all sequences available, we found that this misidentification is indeed significant. Some 40% of sequences belonging to U. fenestrata are misannotated (127/334). Hence, caution should be exercised when comparing U. fenestrata sequences using BLAST since some of the best matches will erroneously be referred to "U. lactuca." We naturally support the use of U. fenestrata type as described by Hughey et al. (2019) as the baseline for this species (Table 3). This significant amount of species misannotation lead to a drastic change in the species distribution of U. lactuca ( Figure 5) and should not be overlooked. Only Ulva products labelled as containing "Ulva lactuca" are officially authorized for food consumption in Europe outside of France (Barbier et al., 2019). Furthermore, accurate description of the species used in the literature is essential for natural products biodiscovery, nutritional profile and traceability (Leal et al., 2016). This highlights the need to both improve the identification of Ulva species and to change the European food regulation by inclusion of the Ulva species which are effectively consumed at present under the name of "Ulva lactuca" or to treat "Ulva lactuca" as a commercial name encompassing all foliose Ulva species.
Finally, our study shows that U. "rigida" (now U. sp. A) and U. lacinulata are also commonly misannotated in public repositories, which was hinted by Miladi et al. (2018). It perhaps is not surprising since both species sequences are relatively close, with only a handful of discriminating SNPs contained within those three barcodes, and the viability of interspecific hybrids (Fort, Linderhof, et al., 2021;. However, previous species delimitation analysis on rbcL + tufA using different methodologies (GMYC and bPTP), and the sequence identity differences between the organellar genomes of the two clades indicates that they are probably two separate species , and not the single taxon as postulated by . While we consider that the U. lacinulata clade is fully resolved due to the presence of U. lacinulata type within the clade , the sequence of the U. sp. A type specimen is not currently available in public repositories. Hence, sequences of the U. sp. A clade will need to be renamed when a suitable type is found.
Overall, the analysis of large foliose Ulva species showed ~26% of misannotated entries in the NCBI database, a percentage probably much higher when tubular or partially tubular species are considered. A significant amount of the misannotations originates from recent nomenclature changes, which renders the work presented in this study particularly important, as we provide in Nevertheless, in Table 3 we suggest a list of reference NCBI accessions for all three barcodes of the 11 large foliose Ulva species.
The rationale for this list is available in Appendix S1. As it is simple to update the information associated to NCBI sequences (see https:// www.ncbi.nlm.nih.gov/genba nk/updat e/), we encourage authors that have deposited sequences on the NCBI to update, if incorrect, the "organism" information of their accession numbers, thus avoiding the amplification and recurrence of misannotated Ulva species, such as U. lactuca, and to update taxonomic assignments due to nomenclatural adjustments.
Concerning tubular and or partially tubular species, the major hurdle found here lies within the separation of U. linza, U. procera and U. prolifera individuals. This appears to be an ongoing issue with the delimitation of the species within the Linza-Procera-Prolifera (LPP) complex (Cui et al., 2018;Kang et al., 2014;Leliaert et al., 2009), and will require further reanalysis of the NCBI entries after organelle sequencing of holotype specimens. The precise species delimitation of those clusters is outside the scope of this study but indicates that caution should also be taken when analysing the sequences of those species, as misidentifications are likely to be present. Yang, whose genome has been released (Cao et al., 2020), appears to be frequently misannotated, given that sequences with this species name are present in multiple clades containing other species names.
The striking consistency in the Bangiales data set over the Ulva one ( Figure 8) Yang et al., 2020). Perhaps the ubiquitous distribution of Ulva, its phenotypical plasticity, and the slow release of holotype/lectotype specimen sequences, contribute to the considerable discrepancies in Ulva taxonomy. We believe that a similar approach to that of the Bangiales order is needed to appropriately revisit Ulva nomenclature and taxonomy, and the analytical framework described here could be used as the first step towards that goal.

| CON CLUS IONS
Due to the increasingly large number of sequences being deposited in public repositories, it is becoming important regularly to reassess the genetic information of taxa of interest, to highlight ongoing species identification issues, update NCBI accessions with new nomenclatures, and potentially reassign names to previously uncharacterised synonymous species. Here, we investigated all Ulva, Fucus and Porphyra/Pyropia sequences in the NCBI public repository for common barcodes, as a contribution to clarify the species composition and annotation of these four genera. This data set can be used for future species identification, accession validation and classification purposes, to ensure accurate representation of the species names and taxa within the databases. The analytical framework described here in detail could be transferred to any other taxa of interest, particularly those that show subtle morphological differences between taxa and contain large amount of sequences and suspected misannotations.

ACK N OWLED G EM ENTS
The authors would like to thank Ricardo Bermejo (NUI Galway), Lars McHale and Ronan Sulpice wrote the manuscript. All authors reviewed the manuscript.

CO N FLI C T O F I NTE R E S T S
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are openly available in the NCBI at https://www.ncbi.nlm.nih.gov/, reference numbers MT894471-MT895108. Scripts and pipeline are available in GitHub: https://github.com/FortA nt/Barco deAna lysis.