Biodiversity assessment of tropical shelf eukaryotic communities via pelagic eDNA metabarcoding

Abstract Our understanding of marine communities and their functions in an ecosystem relies on the ability to detect and monitor species distributions and abundances. Currently, the use of environmental DNA (eDNA) metabarcoding is increasingly being applied for the rapid assessment and monitoring of aquatic species. Most eDNA metabarcoding studies have either focussed on the simultaneous identification of a few specific taxa/groups or have been limited in geographical scope. Here, we employed eDNA metabarcoding to compare beta diversity patterns of complex pelagic marine communities in tropical coastal shelf habitats spanning the whole Caribbean Sea. We screened 68 water samples using a universal eukaryotic COI barcode region and detected highly diverse communities, which varied significantly among locations, and proved good descriptors of habitat type and environmental conditions. Less than 15% of eukaryotic taxa were assigned to metazoans, most DNA sequences belonged to a variety of planktonic “protists,” with over 50% of taxa unassigned at the phylum level, suggesting that the sampled communities host an astonishing amount of micro‐eukaryotic diversity yet undescribed or absent from COI reference databases. Although such a predominance of micro‐eukaryotes severely reduces the efficiency of universal COI markers to investigate vertebrate and other metazoans from aqueous eDNA, the study contributes to the advancement of rapid biomonitoring methods and brings us closer to a full inventory of extant marine biodiversity.

There are many potential benefits that whole-community metabarcoding of eukaryotic marine eDNA, using multiple or even single-assay approaches, could bring to biodiversity assessment and monitoring, such as using direct measurements of biodiversity, instead of relying on biodiversity indicators (Aylagas, Borja, Irigoien, & Rodríguez-Ezpeleta, 2016;Djurhuus et al., 2017;Lindenmayer & Likens, 2011;Rees, Maddison, Middleditch, Patmore, & Gough, 2014). Environmental DNA analysis can also be used for the detection of "hidden diversity," without a priori knowledge of the composition of species assemblages in a particular water body (Boussarie et al., 2018;Lindeque, Parry, Harmer, Somerfield, & Atkinson, 2013). As such, eDNA can be a powerful tool in the early detection of alien species (Zaiko, Samuiloviene, Ardura, & Garcia-Vazquez, 2015) and of community changes in response to environmental disturbances or regime shifts (Bik, Halanych, Sharma, & Thomas, 2012;Bucklin, Lindeque, Rodriguez-Ezpeleta, Albaina, & Lehtiniemi, 2016). Gains in cost-effectiveness, reproducibility, comprehensiveness, and the potential for multiple trophic levels to be evaluated simultaneously also make eDNA an attractive tool for large-scale monitoring of biodiversity trends (Bourlat et al., 2013).
Few recent studies have employed the mitochondrial cytochrome c oxidase subunit I (COI) marker region to offset the limited taxonomic resolution of ribosomal genes Lacoursière-Roussel et al., 2018;Leray & Knowlton, 2017). The COI "barcode" region (Hebert, Ratnasingham, & deWaard, 2003) is one of the most commonly used DNA fragments for the analysis of species diversity among marine animals (Bucklin, Steinke, & Blanco-Bercial, 2011). While the use of COI as a metabarcoding marker has been criticized, arguing that the high rates of sequence variability impair the design of truly universal primers and hamper the bioinformatic analysis (Deagle et al., 2014), the high mutation rate of COI may ensure unequivocal identification at the species level across a vast majority of taxa. Moreover, no other genetic region is currently represented in taxonomically verified databases to the same extent as the COI barcode region (Andújar, Arribas, Yu, Vogler, & Emerson, 2018).
Although no "gold-standard" truly universal metabarcoding primer set has been identified for highly variable markers such as COI (Coissac, Riaz, & Puillandre, 2012;Deagle et al., 2014;Riaz et al., 2011), it has been shown that the taxonomic coverage and resolution provided by degenerate COI primers (primer sets that have one or more degenerate positions incorporated in either one or both of the forward and reverse primers), indeed make them valuable metabarcoding markers for biodiversity assessment (Clarke, Beard, Swadling, & Deagle, 2017;Elbrecht & Leese, 2016). Most recently, a degenerated version of the established COI internal primer set (Leray et al., 2013), amplifying a 313 bp region, has been described . This primer set (henceforth referred to as "Leray-XT") features a high number of degenerate positions, including five deoxyinosine nucleotides (a nucleotide that complements any of the four natural bases) in the fully degenerated sites of the sequence, enhancing universality in the amplification of the COI fragment in most eukaryotic groups. Compared to 18S primers, these primers have been shown to reveal greater biodiversity at the species level when applied to the same samples .
Even so, a number of questions remain unanswered: (a) What portions and segments of the total eukaryotic biodiversity present in tropical seawater is detectable through universal COI metabarcoding? (b) Is the method of using one universal metabarcoding marker powerful enough to distinguish between geographic regions and habitat types? (c) Are we close to introducing this approach to an applied, operational biomonitoring context? Here, in an attempt to answer these questions, we aimed to assess the potential for describing marine eukaryotic biodiversity from five distinct areas in the Caribbean, using the Leray-XT primer set on eDNA obtained from taxonomically complex water samples. We evaluate the potential and scope of this primer set for the characterization of eukaryotic community structure, profiling of biodiversity, and for the assessment of spatial patterns between locations and habitats.

| Water sampling
Within each of five Caribbean sampling locations (Figure 1)

| Sample processing and DNA extraction
After collection, water samples were individually covered and stored in the dark and on ice until further processing. Vacuum filtration was carried out within 2 hr of collection. The sterile mixed cellulose esters (MCE) filters (Merck Millipore; 47 mm diameter; 0.45 µm pore size) containing sample filtrates were stored in 2-ml screw-cap tubes containing silica beads. The silica beads function as a desiccator, drying out the filters and preventing the DNA from degrading (Bakker et al., 2017). The advantages of using silica beads instead of a liquid for DNA preservation include improved filter preservation, the prevention of leakages and the complications related to shipping/traveling with flammables (e.g., ethanol).
The sample filters were subsequently stored, for approximately one month, at −20°C until extraction. DNA was extracted from the filters with the DNeasy PowerSoil DNA Isolation Kit (Qiagen), following the manufacturers' protocol. Purified extracts were assessed for DNA concentration in a Qubit fluorometer (Thermo Fisher Scientific).

| Contamination control
Contamination of samples may occur anywhere from preparing sampling equipment and collecting the samples in the field (target DNA being carried unintentionally from one locality to another), to every subsequent step of sample preparation, extraction, and analysis in the laboratory. Hence, strict adherence to contamination control was followed at all field and laboratory stages in order to prevent the occurrence of contamination, including the use of disposable gloves and single-use sterile collection bottles F I G U R E 1 Map of Caribbean sampling locations. Bahamas, Belize, British Virgin Islands, Jamaica, and Turks and Caicos and filtration equipment, and the bleaching (50% bleach) of sampling devices and laboratory equipment and surfaces. Additionally, a dedicated controlled eDNA laboratory at the University of Salford, with separate rooms designated for the physical separation of eDNA extraction, pre-PCR preparations, and post-PCR procedures, was used for all laboratory work. Moreover, to identify potential contamination, negative field samples (filtration of drinking water purchased at local supermarkets), DNA extraction blanks (elution buffer from extraction kit), and PCR blanks (laboratory grade water) were included.

| Library preparation and sequencing
We used a highly degenerated primer set, amplifying a 313-bp segment from the COI region, which comprised the reverse primer jgHCO2198 5′-TAIACYTCIGGRTGICCRAARAAYCA-3′ (Geller, Meyer, Parker, & Hawk, 2013), and a modified forward primer mlCOIintF-XT 5′-GGWACWRGWTGRACWITITAYCCYCC-3′ . Samples were multiplexed by using primers with attached 8-base sample-specific oligo-tags differing in at least 3 bases (Guardiola et al., 2015). In order to increase variability of the amplicon sequences, a variable number (2, 3, or 4) of fully degenerate positions (Ns) were added at the beginning of each primer (Wangensteen & Turon, 2017

| Bioinformatic and statistical analysis
The bioinformatic analysis was based on the OBITools metabarcoding software suite (Boyer et al., 2016). The pipeline used for data analysis is summarized in Table S2. Paired-end reads were aligned using illuminapairedend, retaining alignments with a quality score >30. The aligned dataset was demultiplexed, and primer sequences were removed with ngsfilter. A length filter (obigrep) was applied to the assigned reads in order to select only the fragments with the correct target size (300-320 bp). Reads containing ambiguous bases were also removed. The reads were subsequently dereplicated using obiuniq, grouping all the identical sequences, while keeping track of their abundances. A chimera removal step was performed using the uchime-denovo algorithm ( of the taxonomic tree, which is particularly pertinent with highly biodiverse samples such as those analyzed in this study. Taxonomic assignment of the representative sequences for each MOTU was performed using the ecotag algorithm (Boyer et al., 2016), which uses a custom reference database and a phylogenetic approach for assigning unmatched sequences to the last common ancestor of the most closely related sequences in the reference database. Ecotag infers a reference set of sequences including the best match, and other sequences present in the reference database, which are equally or more similar to the best match than the query sequence. Then, the query is assigned to the lowest rank taxon including all sequences in the reference set. As a result, taxonomy assignment by ecotag will yield different taxonomic ranks with different levels of uncertainty for different branches of the tree of life, depending on the density of the reference database for each branch. Thus, with each detected taxon, the percentage of identity with the reference sequence (%ID) is given as a measure of accuracy of the taxonomic identification. The custom COI reference database (db_COI_MBPK) contains 191.295 eukaryote sequences , retrieved from the BOLD database (Ratnasingham & Hebert, 2007) and the EMBL repository (Kulikova, 2004), and is available from http://github.com/ metab arpar k/Refer ence-datab ases. After taxonomic assignment, putative pseudogene sequences were removed using LULU (Frøslev et al., 2017) to obtain reliable MOTU richness estimations. Finally, the dataset was refined by taxonomy clustering of MOTUs assigned to the same species, and minimal abundance filtering; all MOTUs with <2 reads were discarded.
All statistical analyses were performed in R v 3.3.0 (https ://www.R-proje ct.org/). Sample groups were represented in nMDS diagrams (function isoMDS) with Bray-Curtis dissimilarities, using the package MASS (Venables & Ripley, 2010). Using the adonis function in Vegan (v 2.5-1; Oksanen et al., 2016), group distances between and within locations were formally tested with PERMANOVA, using fourth-root transformed relative abundances. SIMPER analyses were also performed with the package Vegan, to identify the MOTUs that contribute most to the differentiation between the sampling locations.  Figure S1) show that saturation in the total number of MOTUs is not achieved with this sequencing depth for most samples, indicating that the true extent of the MOTU richness assessed using this method may be higher than the values obtained in this study.

| Sequence read abundance and MOTU richness
The negative controls contained a negligible number of reads, all of which were classified as human DNA, and hence were excluded from further analyses. Table 1

| Community structure
The ordination of the sampling locations is visualized by a nonmetric multidimensional scaling (nMDS) plot ( Figure 3). Pair-wise For each of the five sampling locations, individual nMDS plots are displayed in Figure 4, showing the ordination patterns within areas.
In the Bahamas (Figure 4a), the two samples with the largest number of dinoflagellate reads (see also  In Belize, samples were collected from both back-and fore-reef sites, which is reflected in its ordination plot (Figure 4b). Here, the samples are converged into separate fore-and back-reef ellipses (F = 3.54, df = 1, p < .05). The back-reef samples are characterized by higher numbers of dinoflagellate, Bacillariophyta, Viridiplantae, and chordate (tunicate) reads, while the samples from the fore reef contain more "unassigned Eukarya," Oomycetes and Cnidaria reads ( Figure 2b). SIMPER analysis (Data S1) indicates that the unknown eukaryote MOTU 703 is responsible for almost 23% of the differentiation between the back-and fore-reef samples, being more abundant in the fore-reef sites. Additionally, the same unknown dinoflagellate sequences (sequence number 152) as in the Bahamas are the second most important MOTU, occurring more abundantly in the back-reef samples. The additional differentiation within the back-reef ellipse can be explained by the fact that these samples were collected from two different back-reef sites (Table S1).
The sampling sites in the BVI are separated into four different groups (Figure 4c). The samples that were collected directly from the shore (without the use of a boat, n = 4) are most different from the samples that were collected from reef sites (F = 1.56, df = 3, p < .01).

SIMPER analysis indicates that both the unknown eukaryote 703
and dinoflagellate MOTUs contribute most to the dissimilarity in MOTU diversity between the shore and the reef sites. Together with an unidentified mollusc MOTU, they contribute to almost 34% of the total dissimilarity (Data S1).
In Jamaica, samples were collected from two distinct sites, Discovery Bay and Montego Bay, which are grouped separately in  diversity. These results are in line with a recent study where 18S ribosomal DNA was used to assess eukaryotic diversity in photic-zone plankton communities worldwide. It was estimated that eukaryotic ribosomal diversity saturated at ~150,000 operational taxonomic units, of which one-third could not be assigned to known eukaryotic groups . As both the sample size and sequencing depth from de  are much larger compared to the present study, the spectrum of species diversity detected is also much greater (about 10 times). Additionally, instead of water samples, plankton nets were used to concentrate plankton density in each sample in order to recover complete local eukaryotic biodiversity . However, the 18S rDNA gene has limited taxonomic resolution power compared to COI used in this study (albeit with a smaller sample size). Therefore, it is very likely that for both studies, the level of species richness is still highly conservative, and could see a significant increase when a combination of high-resolution primers and large sample size will be used.

| D ISCUSS I ON
Compared to more targeted metabarcoding applications, the marker used in this study allows coverage of the largest taxonomic breadth and the most powerful taxon delineation for any one marker.
This results in the paradox of uncovering a substantial amount of diversity, much of which remains hidden. Irrespective of the level of identification, abundant, putatively ecologically important sequences are included in the overall biodiversity analysis, and the effect of these important taxa on β diversity may still be evaluated, even when these sequences cannot currently be taxonomically assigned to a known morphological group or species. Much of this diversity would remain undetectable if taxonomic assignment was solely based on a high percentage identity match. This is abundantly highlighted by the fact that SIMPER analyses (Data S1) suggest that currently unidentifiable MOTUs may actually be largely responsible for the differentiation of certain communities. Of those analyzed in this project, 50% of the differentiation between the different locations/communities is caused by less than 30 MOTUs, of which most have a best ID of <90% with a reference sequence present in the representative sequence of this MOTU is sequence number 703 to appear in the MOTU database, which indicates that it is highly abundant (608.941 reads) and is thus likely to be a relatively important component of the sampled communities. This novel sequence may either belong to a yet completely undescribed species, or to a species that has already been described morphologically and/or genetically identified with other markers, but for which the COI marker has yet to be sequenced. The ecological importance of this, and other undescribed MOTUs, to these areas is unknown but should not be overlooked.
A key group in our dataset that is defined by many unidentified MOTUs is the "Other Protists," a typically diverse and heavily undersampled group Foissner, 2008), that may display a wide range of trophic modes (Vaulot, Romari, & Not, 2002) and includes a high diversity of parasites and photosymbiotic taxa . Our dataset also contains a relatively high All sampling methods are subject to methodological limitations, and different sampling methods will capture different subsets of biodiversity (Kelly et al., 2017;Shelton et al., 2016). Like more traditional survey methods, eDNA metabarcoding has a certain level of taxonomic selectivity, which may be the result of primer bias.
The use of COI as a metabarcoding marker has been criticized in the past (Deagle et al., 2014) owing to its high sequence variability, which may impair the design of truly universal primers and complicate bioinformatics analysis. However, the use of primers with high degeneration rates and including deoxyinosines has contributed to mitigate these universality issues .
Additionally, when using COI as a metabarcoding marker, particularly when applied to samples containing highly complex signals, false-negative detection errors are likely to occur. Conversely, the number of species (MOTUs) will often be overestimated due to the detection of pseudogenes, such as "numts," nuclear sequences of mitochondrial origin (Bensasson, Zhang, Hartl, & Hewitt, 2001;Vamos, Elbrecht, & Leese, 2017) include multiple assays to improve detection probability, in order to provide an improved approximation of eukaryotic community diversity, derived from eDNA samples (Djurhuus et al., 2018;Günther et al., 2018;Stat et al., 2017;Villarino et al., 2018). These studies aim to overcome the occurrence of false negatives due to primer bias, at the expense of the potential for quantitative inference, so that only presence/absence methods can be used in downstream analyses.
The vast majority of marine life is physically small and dominant in both numbers and diversity (Guil, 2011;Snelgrove, 1999).
Consequently, in a water sample, eDNA from small eukaryotic organisms far outnumbers that of any vertebrate species, rendering the detection of eDNA from species such as teleosts in unfractionated water samples with a broad-spectrum primer, similar to an eDNA needle in the proverbial haystack (Collins et al., 2019). Even more so, at least part of the DNA from microscopic eukaryotes will have originated from entire individuals, as opposed to exclusively extracellular DNA from larger species, potentially drowning out part of the eDNA signal from these larger organisms. Nonetheless, the Leray-XT primer set was capable of detecting 54 teleost and 6 elasmobranch species. The use of a combination of multiple metabarcoding markers is essential to guarantee sufficient sequencing depth for capturing all the main components of eukaryotic diversity, including the better known metazoans groups upon which most conservation initiatives are based (Boussarie et al., 2018). Of course, the use of a single marker will significantly reduce operational costs, but a universal COI approach from aqueous eDNA samples, in the fashion of what is presented here, will always inevitably emphasize the micro-eukaryotic component.
While biodiversity loss has been exhaustively documented for macro and mega fauna (which only represent a small fraction of total marine biodiversity), to our knowledge, no research has addressed specifically this issue pertaining to microscopic eukaryotic plankton communities, which is most likely due to the difficulties in characterizing and quantifying the diversity of these communities (Bouchet, Lozouet, Maestrati, & Heros, 2002;Fonseca et al., 2010;Hirai, Kuriyama, Ichikawa, Hidaka, & Tsuda, 2015). However, marine plankton is essential for biological and geochemical processes, fixing CO 2 and other elements into biological matter, which subsequently enters the food web Field, Behrenfeld, Randerson, & Falkowski, 1998;Pernice et al., 2016). As such, micro-eukaryotic plankton plays essential roles in the structure and function of marine ecosystems globally (Liu et al., 2017). Additionally, planktonic communities are often used as indicators of ecosystem change due to their ability to rapidly respond to environmental shifts (Abad, Albaina, Aguirre, & Estonba, 2017; Taylor, Allen, & Clark, 2002). This makes the understanding of the composition, dynamics and position in food webs of planktonic communities across space and time, essential.
Interestingly, despite the thousands of taxa detected in our study, only tens of those were enough to distinguish samples at both regional and local scales, which suggests that in spite of the current major gaps in reference databases, COI metabarcoding of unfractionated water samples can be useful for describing coarse biodiversity trends, even when taxonomic assignment for key MOTUs is yet unavailable (Cordier et al., 2017). In fact, for certain ecological applications, a DNA-based monitoring approach may be streamlined by using just a few "pelagic indicator" taxa, although such applications would require the formal description of such indicator species. Species-level analysis of pelagic biodiversity is critical for understanding impacts of climate change, detecting invasive species, and the design of management objectives (Bucklin et al., 2016;Leray & Knowlton, 2016), hence the unraveling of the composition of hidden eukaryotic diversity and its subsequent description remain essential tasks, as comprehensive reference databases are critically needed for the taxonomic designation of eukaryotic DNA sequences. And while it has been estimated that between 24% and 98% of marine eukaryotic species are yet to be described (Goodwin et al., 2017;Leray & Knowlton, 2016;Mora et al., 2011), the advent of high-throughput sequencing and DNA metabarcoding is rendering the huge task of uncovering this hidden diversity using a "reverse taxonomy" approach (Markmann & Tautz, 2005), and in particular taxonomic assignment with COI metabarcoding (applying either single or multi-assay approaches), less insurmountable.
Simultaneously, studies aimed at identifying areas harboring high numbers of potentially important micro-eukaryotes could serve to direct targeted sampling to examine the most abundant microplankton, using powerful microscopy, in order to verify the identity of these taxa, which potentially hold great ecological importance.
As this study merely represents an initial exploration of the possibilities of using a COI metabarcoding assay for the description of marine eukaryotic community diversity, it by no means captures the full potential of eDNA metabarcoding for this purpose. However, it does clearly display the future possibilities and benefits of the method, underlining the importance of multiple assays as opposed to one single "universal" metabarcoding assay for the description of whole eukaryotic community assessment.
Moreover, it particularly highlights the effect and significance of currently unidentified micro-eukaryotic diversity on marine community structure and advocates for its inclusion in biodiversity assessments.

ACK N OWLED G M ENTS
We thank all staff, students, and volunteers from Glover's Reef Unite, and the University of Salford R&E strategy funding.

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
JB and SM conceived the study. Samples were collected by JB, DB, DDC, AJG, TLG, and HH. JB and OSW carried out laboratory work and analyzed the data with CB. All authors interpreted and discussed the results. JB drafted the manuscript, with contributions from OSW and SM.

DATA AVA I L A B I L I T Y S TAT E M E N T
Full sequence data can be found here: https ://doi.org/10.5061/ dryad.hqbzk h1bj.