Gaps in DNA sequence libraries for Macaronesian marine macroinvertebrates imply decades till completion and robust monitoring

DNA metabarcoding has great potential to improve biomonitoring in island's marine ecosystems, which are highly vulnerable to global change and non‐indigenous species (NIS) introductions. However, the depth and accuracy of the taxonomic identifications are mainly dependent on reference libraries containing representative and reliable sequences for the targeted species. In this study, we evaluated the gaps in the availability of DNA sequences and their accuracy for macroinvertebrates inhabiting Macaronesia's shallow marine habitats.


| INTRODUC TI ON
Although of great importance, the world's marine ecosystems and biodiversity are increasingly exposed to several threats driven by global change, over-use of natural resources, habitat loss, invasion by non-indigenous species (NIS), among other disturbances (Barbier, 2017;Molinos et al., 2016). This multitude of disturbances can severely impact ecosystems around the world, instigating the need for the densification and greater accuracy in biodiversity assessments and monitoring across the planet (Borja et al., 2020;Cardinale et al., 2012;Pereira et al., 2010).
Islands, which contain one-fifth of the world's biodiversity and a profusion of endemic species, are among the most threatened ecosystems (Kier et al., 2009;Lagabrielle et al., 2009). Endemic species often exhibit comparatively small population sizes with limited geographical distribution and habitat availability, making islands biodiversity highly vulnerable to global change, particularly to the introduction of NIS (Vitousek, 1990). Macaronesia is a group of volcanic islands composed of five archipelagos (Azores, Madeira, Selvagens, Canaries and Cape Verde), located in the Northeast Atlantic Ocean (NEA), which was established based on their flora and fauna similarities (Fernández-Palacios et al., 2011). Macaronesia has a unique and rich biodiversity and is the geographical boundary of many species, in both the terrestrial (Arechavaleta et al., 2010;Borges et al., 2008Borges et al., , 2010 and marine realms (Borges et al., 2010). The conservation of this valuable diversity is complex, and protection programmes have been already in place in some areas (e.g. under Natura 2000). However, further research and extensive biomonitoring programmes are still needed to assess more accurately which species are threatened and to provide a more holistic view of these ecosystems' present and changing status (Cacabelos et al., 2020;Iacarella et al., 2020). Thus, the strategic expansion of the network of protected areas and the effective allocation of resources for conservation is highly dependent on accurate and recurrent biodiversity assessments.
As above-mentioned, one of the major threats to native islands biodiversity is the introduction of NIS, and the Macaronesia archipelagos are not an exception (Arechavaleta et al., 2010;Borges et al., 2008;Moro et al., 2003). When introduced to new areas, NIS can spread rapidly and become invasive, modify habitats, compete with native fauna for resources and threaten biodiversity (Bax et al., 2003;Rilov & Crooks, 2009). Their introduction, namely ships/vessels, canals and aquaculture activities, can provoke severe ecological, social and economic impacts Lenzner et al., 2020;Rilov & Crooks, 2009;Seebens et al., 2013;Wallentinus & Nyberg, 2007). Worldwide, it is predicted that NIS will increase one-third until 2050, with strong rises projected for Europe . To determine the state of NIS introductions and their impact on ecosystems, and to implement measures to prevent biodiversity loss, data compilation is mandatory, and to that end, regulations were created. Most of the marine waters of the NEA ocean fall under the jurisdiction of the European Union and its member states, including those surrounding the Azores, Madeira, Selvagens and Canary Islands, and thus, they are targeted by the European Marine Strategy Framework Directive (EU-MSFD) (European Commission, 2008;Tsiamis et al., 2019). This directive includes assessment of NIS occurrence and has led to an increase in monitoring programmes and inventories in Europe over the last decades (e.g. Afonso et al., 2020;Chainho et al., 2015;Micael et al., 2014;Tsiamis et al., 2019).
Until recently, biodiversity assessments have been conducted almost exclusively through morphology-based species identifications. However, this approach has several drawbacks, being highly expertise-demanding and time-consuming, and delivering lower taxonomic resolution (Hering et al., 2018;Leese et al., 2016Leese et al., , 2018. With the exponential rise in the power of both DNA sequencers and computational technology, molecular techniques constitute an effective alternative or complement to morphology-based identifications, in particular DNA barcoding (short standardized DNA sequences amplified from a single specimen and used for species identification) and DNA metabarcoding. In the latter, DNA is extracted from bulk organismal samples or directly from the environmental sample matrix such as seawater or sediment (in this case, designated as "environmental DNA" or eDNA). Subsequently, amplicon libraries for target gene regions are generated, highthroughput-sequenced and compared to reference sequences to deliver a taxonomic identification (Duarte, Leite, et al., 2021;Fais et al., 2020;Leese et al., 2016;Steyaert et al., 2020). DNA metabarcoding offers potential benefits over morphological assessments, such as (i) increased sensitivity, (ii) discrimination of cryptic species, (iii) identification of species regardless of the life stage (e.g. eggs, larvae), (iv) enables assessments covering a wide range of taxa, and (v) allows high-throughput assessments leading to a higher spatialtemporal density of taxa occurrence data (Holman et al., 2019;Leduc et al., 2019;Schroeder et al., 2020;Suarez-Menendez et al., 2020). The taxonomic composition of hundreds of samples of suspected hidden diversity further deepen the expected gaps and reinforce the vulnerability of this endemism-rich fauna.

K E Y W O R D S
biomonitoring, eDNA metabarcoding, Macaronesia, macrozoobenthos, non-indigenous species, reference libraries, species endemism can be assessed quickly and at a relatively low cost, facilitating the implementation of more extensive monitoring programmes and providing a more comprehensive view of the present and changing status of island ecosystems. This fast and reliable approach can be highly efficient for the early detection of NIS (Schroeder et al., 2020;Zaiko et al., 2015). Major caveats still include the inability to quantify species abundances and distinguish between life stages (Duarte, Vieira, et al., 2021). The depth and accuracy of DNA metabarcoding-based identifications are mainly dependent on the availability of reference libraries containing representative and accurate sequences for the targeted species. The existence of gaps and unequal representation of taxonomic groups in reference databases may compromise the accuracy of the DNA-based biodiversity assessments (Ardura, 2019;Duarte et al., 2020;Leite et al., 2020;Weigand et al., 2019). Thus, assessing these gaps and the quality of sequence data in reference databases is mandatory for the successful implementation of DNAbased tools in biodiversity assessments.
In this study, we evaluated the gaps in the availability of DNA sequence data and their accuracy to assess macroinvertebrate diversity through DNA-based tools in Macaronesia's shallow marine habitats. As the DNA barcode region (cytochrome c oxidase subunit I-COI) and the gene encoding the nuclear 18S rRNA (18S) have been the most widely used genetic markers in metabarcoding studies targeting marine invertebrates, including NIS (Duarte, Leite, et al., 2021;Duarte, Vieira, et al., 2021), the sequence availability was assessed for both. The incorporation of cutting-edge biomonitoring tools is essential for efficient management of islands biodiversity and to develop mitigation strategies to deal with increasing environmental change in these highly vulnerable ecosystems.

| Checklist compilation
The four European Macaronesian archipelagos were used in this study: Azores, Madeira, Selvagens and Canaries ( Figure 1). As recent studies based on marine biota suggest that Cape Verde's community structure and biogeographic relationships differ significantly from the remaining Macaronesian islands (Cunha et al., 2014;Freitas et al., 2019;Wirtz et al., 2013), we opted not to include this archipelago in our analysis. A list of native species (marine invertebrates) was compiled (14 May 2020) based on GBIF (gbif.org, 2020) and WoRMS (WoRMS Editorial Board, 2020) databases.  −16.13777 29.87343, −15.77614 29.87343, −15.77614 30.22865, −16.13777 30.22865, −16.13777 29.87343). Data were mined from WoRMS using the geounits for the Azores: "Azores (Archipelago)," for the Canaries: "Canaries (Archipelago)," for Madeira: "Madeira (island)" plus "Porto Santo(island)" and for Selvagens: "Selvagens (Archipelago)." The search was limited to marine and extant animal species and accepting only valid names. Species with the annotation "alien" were added to the list of NIS; see below. Then, the lists obtained from GBIF and WoRMS were merged, duplicate entries were removed, and only marine invertebrates were accepted (taxonomy was confirmed on WoRMS).

| Genetic data mining and analyses
For each list (native and NIS), COI and 18S genetic data were mined from BOLD (Ratnasingham & Hebert, 2007) and GenBank (Sayers et al., 2019) using the R 3.6.0 software (R Core Team, 2019; www.rproje ct.org) with the package "bold" (Chamberlain, 2019) and "rentrez" (Winter, 2017), respectively. In BOLD, the following terms were used to filter the sequences: for COI-"COI-5P"; for 18S-"18S" or "18Sa". In GenBank, the terms used were as follows: for COI- Only sequences with more than 500 base pairs were considered as this is the minimum length required for a sequence to meet Barcode Compliance standards (Ratnasingham & Hebert, 2007)  All GenBank accession numbers detected in BOLD were then manually confirmed on GenBank to double-check the duplicated status.
The geographic origin of specimens and year of submission of the sequences were verified through BOLD metadata.
The number of Barcode Index Numbers (BINS) (Ratnasingham & Hebert, 2013) for each species within each taxon was retrieved from BOLD, based on the COI marker. Then, to verify the reliability of the genetic data for each species, the auditing and grading software BAGS was used (Fontes et al., 2020; https://github.com/tadeu 95/ BAGS). This tool relies on COI, the BIN system and the number of records to annotate and grade species according to the quality of their available public sequences. Grade A and grade B are considered concordant (one species = one BIN), grade C indicate multiple BINs for a given species (one species = two or more BINs), grade D is insufficient data (less than three records), and grade E indicates discordances, that is more than one species is assigned to a single BIN (two or more species = one BIN). More details can be found in Fontes et al. (2020). All the scripts used in this study can be consulted at https://github.com/pedro emanu elvie ira/NGB_Macar onesia.  Table S1). Madeira was the archipelago with the highest percentage of NIS (9.5% of the total number of species), while Azores and Canaries displayed similar percentages (4.6 and 4.5%, respectively). No NIS were reported to occur in Selvagens (Table S1).

| Taxonomic composition
Mollusca was the most well-represented phyla in the Azores, Madeira and Canaries (31.9% to 63.0% of the total number of species) (Figure 2; Table S1). Other dominant phyla in these regions included Arthropoda (12.1% to 31.0%) and Cnidaria (8.4% to 15.3%).

| Gap analysis and grading system
More records were found on GenBank than on BOLD, with more sequences available for COI than 18S (Tables 1, S2, S3). When merging the information of both databases and excluding duplicated records, the Azores was the archipelago with the highest number of native (COI: 46.8% and 18S: 30.6%) and NIS represented in the genetic databases (COI: 72.6% and 18S: 54.8%), while Madeira displayed the lowest, for both native (COI: 39.8% and 18S: 26.5%) and NIS (COI: 61.5% and 18S: 40.4%). In general, NIS displayed higher percentages (65.4%-79.0%) of at least one of the genetic markers (COI or 18S) when compared with native species (38.1%-51.6%). However, the percentage of species with both genetic markers was much lower for native (17.8%-25.8%) and NIS (36.5%-52.8%).
In the Azores, Phoronida was the phylum with the highest coverage for COI (100.0%), followed by Arthropoda (61.5%) and Annelida Considering all archipelagos, the grading system BAGS classified a higher percentage of NIS (~50%) as discordant (grade E) when compared with native species (~41%) (Figure 4). In general, around one-quarter of the species had insufficient records (grade D), and less than 10% were concordant species (grades A and B). More cases of Multiple BINs (grade C) were detected in native species (between 12.6% and 36.4%) than in NIS (between 7.7% and 18.5%) (Figure 4). When excluding discordant and insufficient cases (grades D and E), for the native species, all phyla displayed more BINs than species (except Porifera), which for NIS was only observed in half of the phyla ( Figure S3). For native species, Arthropoda (228) and Mollusca (128)  BINs, but Cnidaria had the highest ratio BIN/species, with five times more BINs than species. For NIS, Chordata displayed the highest number of BINs (9), with almost three times the number of BINs per species.
More than 40% of the native species displayed two or more BINs, with nine species displaying six or more BINs, while one-third of NIS were single BINs, with only two species having more than two BINs ( Figure S3).

| D ISCUSS I ON
As a result of this study, four main findings can be pointed out: (a) reference DNA sequence libraries are still highly incomplete for suggests that some decades will be needed to complete the reference libraries for marine macroinvertebrates.

Despite the contribution of several studies to complete
Macaronesia's macrozoobenthos reference libraries (Borges et al., 2016;Gargan et al., 2017;Gomes, 2014;Luz & Keskin, 2019;Silva et al., 2011;Valdés, 2017), we found that their taxonomic coverage is still incipient compared with the diversity of the region. Besides, we found that different archipelagos and taxonomic groups display different degrees of completeness ( Figure 3; Tables S2, S3). Despite having more species than other archipelagos (Figure 2), the Azores had the highest percentage of species sequenced-47% for native and 73% for NIS-and Madeira the lowest-40% for native and 62% for NIS, respectively. Consistently, a higher number of records were found on GenBank than on BOLD. However, although GenBank contains reference sequences from many different genetic markers and includes all domains of life, it is more prone to errors than BOLD as it contains many non-curated data entries (López-Escardó et al., 2018).
NIS displayed higher levels of completion in all archipelagos (Table 1).
These species are generally the focus of a greater number of studies due to their high impact on the environment and thus may experience a higher trend of sequence deposition in genetic databases (Briski et al., 2016;Pyšek et al., 2008;Trebitz et al., 2015). The number of NIS is also much smaller than native species; therefore, levels of com- DNA-based biodiversity assessments in NEA have been limited by poor taxonomic coverage of genetic databases (Hestetun et al., 2020). These limitations are transversal across Europe Leite et al., 2020;Weigand et al., 2019), which led to the creation of national (Price et al., 2020) and international initiatives aiming to fill the reference libraries for aquatic biota (Leese et al., 2016. Nearby coasts that share many species with Macaronesia, such as the Iberian Peninsula, still have 60% of the spe- is only slightly lower than the one here reported for Macaronesia (globally 63% merging BOLD and GenBank data), we must keep in mind that most of the sequenced specimens were not collected in Macaronesia. Because several studies indicate the occurrence of highly divergent lineages in Macaronesia, to the point of segregating in separate endemic lineages Vieira, Desiderato, Holdich, et al., 2019;Xavier et al., 2010), various morphospecies may skip DNA-based detection even if they are present in reference libraries. Considering this possibility, we suspect that these completion levels for Macaronesia may be somewhat overestimated, though it is still unknown how much. Therefore, reference libraries must include specimens collected locally.
We also found significant differences between COI and 18S completeness (Tables 1, S2, S3). Despite many gaps in the COI library, 18S still falls behind, and more work should be conducted in populating other non-COI reference libraries. If only species sequenced for both markers are considered, these values decrease noticeably (Table 1).
This may be a relevant limitation to efficiently detect some species as several studies suggest that some taxonomic groups are preferentially detected by different markers and primers Lacoursière-Roussel et al., 2018;Leduc et al., 2019;Leite et al., 2019).
Considering this, it has been argued that DNA metabarcoding, either to detect native species or NIS, should rely on more than one genetic region to assure detection of the widest possible spectrum of taxa (Duarte, Leite, et al., 2021;Stat et al., 2017).
More than one-third of the species still display discordant records, with higher percentages in NIS than native species (Figure 4).
Incongruencies should be carefully examined to detect the sources of conflict (e.g. misidentifications, incomplete taxonomy or sequences that were deposited under different synonyms) and subsequently curated, so that DNA-based tools can reliably identify these species in bulk or environmental samples. Discordant records raise mistrust because erroneous observations derived from them may easily remain undetected through unsupervised taxonomic assignments of metabarcoding data and quickly propagate across studies.
As such, quality control and quality assurance tools must be implemented to audit and curate reference libraries (Fontes et al., 2020;Leese et al., 2018;Weigand et al., 2019), as the reliability of the reference sequences is as essential as their availability, or even more.
When considering only concordant species records assigned to multiple BINs, approximately 20% of native and 10% of NIS fell under this condition (Figure 4). From a taxonomic perspective, specific phyla displayed one to five times more BINs than barcoded species ( Figure S3). Therefore, it appears that a very high proportion of species from Macaronesia may incorporate undescribed or cryptic diversity. Indeed, several recent studies report the high incidence of deeply divergent endemic lineages in Macaronesia (Tavares et al., 2017;Vieira, Desiderato, Azevedo, et al., 2019;Xavier et al., 2010). Most of these highly divergent lineages have restricted distributions, frequently even limited to a single island, which makes them potentially more susceptible to global change and NIS impacts, thereby constituting a prime target for conservation measures. DNAbased approaches detect molecular entities (molecular operational taxonomic units-MOTUs), and it is important to connect the different MOTUs to their occurrence in each island/archipelago, as some may be endangered lineages or endemic species, which may only be diagnosed through DNA-based methods. Hence, it becomes imperative to generate more sequence records of specimens collected in the Macaronesia archipelagos.
Although the number of species with COI sequences available on BOLD has been increasing in the last twenty years, so far, only less than half of the native species present in these Macaronesian archipelagos have sequences available. Excluding species discovery or extinction, we estimate it will take another twenty to thirty years to exhaustively complete the reference libraries of DNA barcodes for the species present in these islands, if the rate of production of COI sequences is sustained. However, as rarer species may be harder to find, these projections are probably the best-case scenario as they do not consider the expected difficulties in the access to specimens of rarer species. Moreover, these projections do not contemplate the predicted growth of NIS introductions due to the increase in maritime traffic and the absence of legislation to prevent the involuntary transport of these species in hull fouling. More likely, it will take even longer to complete the DNA barcode libraries of all marine invertebrates present in Macaronesia ( Figure 5). Many studies based on DNA metabarcoding of marine taxa may also contribute to generate sequences that can potentially match species still unavailable in the libraries, but that will remain as unknown until matching sequences are finally deposited in reference databases.
In what concerns the coastal area and the number of islands, the Azores and Canaries are the most extensive archipelagos and held the highest number of NIS and native species of marine macroinvertebrate fauna compiled in the current study ( Figure 1) Figure S1).
Madeira displayed the highest % of NIS (NIS/total number of species ratio), particularly in Arthropoda and Mollusca, which is also supported by recent data found in the literature that considers this archipelago highly impacted by bioinvasions (Bailey et al., 2020).
However, we cannot discard the possibility that the highest % found in this region can be biased by the greater effort employed in conducting NIS-focused studies in Madeira (Canning-Clode et al., 2013;Parretti et al., 2020;Ramalhosa et al., 2014Ramalhosa et al., , 2019. To our best knowledge, no NIS were reported in Selvagens. Being a tiny remote archipelago of difficult access and with no permanent human population, it is probably less susceptible to NIS introductions, but, for the same reasons, an updated assessment of NIS may also be more challenging to accomplish. for cryptic taxa and ultimately be more responsive to environmental management needs while also enabling the early detection of NIS.

| CON CLUS IONS
Santos et al. (2016) "advocate a continuing effort to build comprehensive island data for multiple taxa, to serve the wider scientific community in the coming decades." We extend this plea, as current rates of accretion of reference DNA sequence data for Macaronesia are too slow to materialize the benefits of DNA-based monitoring for enhancing biodiversity conservation efforts in this region.
By our predictions, completeness will be accomplished only after 2040, considering the current rate of accretion of 1.9%-2.9% per year.
Researchers must, at least, triple the current efforts if this goal is to be achieved in the next decade. To this end, initiatives such as "BIOSCAN" (Hobern, 2020), which involves more than 1,000 researchers from over 30 countries and aims to generate barcode coverage for 2.5 million species, may be decisive to fill up the gaps across the planet.
Robust monitoring will allow a more comprehensive view of the status of the island populations, helping to mitigate the ongoing pressures (e.g. climate change, fisheries) these populations experience and, therefore, contributing to preserve the invaluable ecosystem services these islands provide. If this goal cannot be reached due to lack of taxonomic expertise, sampling bottlenecks (e.g. inability to get specimens from rare species) and the high levels of cryptic and endemic diversity that are expected, other approaches based on reverse taxonomy, MOTUs/BINs or taxonomy-free methods (Cordier et al., 2017(Cordier et al., , 2018Ratnasingham & Hebert, 2013;Weigand et al., 2019) may be an option, although far from ideal. MOTUs/BINs can be used provisionally and associated with an identification to the lowest possible rank, but always with the final goal of eventually reaching a true identification and recognition of species. If the intention is to use DNA-based tools to detect non-indigenous species, then identifications at the species level are mandatory, and consequently, populating reference libraries with DNA barcodes becomes paramount.

ACK N OWLED G EM ENTS
We would like to thank S. L. Azevedo for the feedback and suggestions of the figures. This work was supported by the "Contrato-Programa"

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/ddi.13305.

DATA AVA I L A B I L I T Y S TAT E M E N T
DNA sequences, raw files and scripts can be found at https:// github.com/pedro emanu elvie ira/NGB_Macar onesia or https://doi. org/10.5061/dryad.sf7m0 cg63.