Blind assessment of vertebrate taxonomic diversity across spatial scales by clustering environmental DNA metabarcoding sequences

-diversity, r = 0.98). This work paves the way towards extending the use of eDNA metabarcoding in community ecology and biogeography despite major gaps in genetic reference databases.


Introduction
In the new era of the Anthropocene, most ecosystems are experiencing severe human impacts and environmental changes with major consequences on species diversity (McCauley et al. 2015, Hughes et al. 2017, Isbell et al. 2019. Nevertheless, the ongoing reorganization of biodiversity is still poorly quantified and understood (but see Blowes et al. 2019) for two major reasons. First, the losses or gains of species are scale dependent with complex results emerging at the local or regional spatial scale (Vellend et al. 2013, Dornelas et al. 2014. For instance, several studies show that local species diversity is on average constant over time (Dornelas et al. 2014, Magurran et al. 2018, even under human impacts, while other studies report alarming species losses regionally or globally (Galetti et al. 2014, Doherty et al. 2016, Finderup Nielsen et al. 2019. Thus, any biodiversity monitoring should be spatially explicit (McGill et al. 2015) with three major components 1) local or α-diversity for the number of species within a given site, 2) spatial variation or β-diversity in species composition among sites and 3) regional or γ-diversity for the number of species within a geographical area containing all sites (Whittaker 1972). Second, biases and gaps in biodiversity inventories prevent accurate and comparable assessments across space and time (Hortal et al. 2015). This is particularly problematic when species are rare, small, cryptic or elusive or when ecosystems are either species-rich like in the tropics or hardly accessible like the deep sea (Mora et al. 2008, Menegotto andRangel 2018). Hence, there is an urgent need for standardized and accurate biodiversity monitoring methods across spatial scales allowing reliable inter-study comparisons.
The metabarcoding of environmental DNA (eDNA) has the potential to fill this gap as it has been shown to surpass most traditional methods in species detection for both terrestrial and aquatic ecosystems (Bohmann et al. 2014, Valentini et al. 2016, Ruppert et al. 2019, Sales et al. 2020. Indeed, all organisms shed cells containing DNA in their environment, as intra or extra-cellular material, and can be retrieved for up to a few days (Dejean et al. 2011, Collins et al. 2018, Harrison et al. 2019. Amplification and high-throughput eDNA sequencing followed by bioinformatic analyses produce a list of sequences with the ultimate goal to assess species composition in a given site. This bioinformatic step requires the completeness of a reference database, i.e. a list of sequences attached to each species in the regional pool, to accurately assign each eDNA sequence to a given species. Yet, reference databases are often incomplete (Weigand et al. 2019). An estimated 91% of eukaryotic species inhabiting the ocean are yet to be described (Mora et al. 2011a) while only 13% of all described Teleostean fish species are referenced in public reference databases like the European Nucleotide Archive (ENA) (Leinonen et al. 2011) for the 12S ribosomal DNA fragment amplified by the teleo primers (Valentini et al. 2016), limiting the extent of species diversity revealed by eDNA metabarcoding.
Currently, completing reference databases would require massive sampling and sequencing efforts since many species still remain undiscovered due to their intrinsic nature (rare, small or elusive) or their unexplored habitat (e.g. deep sea) (Menegotto and Rangel 2018). Moreover, polymerase chain reaction (PCR) and sequencing generate numerous errors, overestimating the true number of species by several orders of magnitude (Edgar andFlyvbjerg 2015, Flynn et al. 2015). Thus, accurate methods able to assess biodiversity without complete reference databases while considering PCR and sequencing errors are urgently needed.
The microbial field pioneered methodological advances to infer biological diversity without a complete reference by clustering similar sequences into molecular operational taxonomic units (MOTUs) (Huse et al. 2010). However, these approaches focus mainly on fungi or unicellular organisms where the concept of species remains challenging (Pawlowski et al. 2018, Lladó Fernández et al. 2019. Even if clustering-based analyses are increasingly used in eDNA studies targeting vertebrates (Andruszkiewicz et al. 2017, Bakker et al. 2017, Closek et al. 2019, Sales et al. 2019, using the diversity of MOTUs as a reliable proxy for species diversity has yet not been evaluated. For instance, Closek et al. (2019) reported a large overestimation with more than 1300 MOTUs for 92 fish taxa only in the Californian Current upwelling ecosystem. The extent to which the metabarcoding of vertebrate eDNA can provide a reliable blind estimation of species diversity across spatial scales is unknown.
Here we evaluate how clusters of vertebrate eDNA sequences can predict species diversity across spatial scales. More precisely, we quantify how MOTUs can accurately predict local (α) and regional (γ) species diversity but also composition species dissimilarity between samples (β-diversity). For this, we focused on teleost fishes which are highly vulnerable to anthropogenic threats (Mora et al. 2011b) and represent the main group of vertebrates studied with eDNA (Tsuji et al. 2019). First, we highlight the geographic and taxonomic gaps in the reference database for the 12S mtDNA fragment, which is known to perform well with the teleo primer (Collins et al. 2019) designed by Valentini et al. (2016). Then, we assemble a metabarcoding bioinformatic pipeline based on sequence clustering using SWARM (Mahé et al. 2015), post-clustering using LULU (Frøslev et al. 2017) and stringent quality filters to analyze eDNA sequences from 196 samples along 500 km of the Rhône river (France). From the composition of MOTUs in each sample, we estimate α-, βand γ-diversity and compare them to their analogs obtained with groundtruth assignment of all sequences using the complete reference database without clustering. Finally, we discuss strengths and weaknesses of this approach based on eDNA sequence clustering to assess taxonomic diversity across spatial scales, even when lacking exhaustive reference databases.

Global taxonomic and spatial gap analysis for fish
Recent fish metabarcoding studies indicate that primers located on the 12S ribosomal rRNA locus (12S rDNA) perform better (i.e. detect more species, with less bias and more specific amplification) than primers based on alternatives loci (Ribosomal locus 16S, the cytochrome c oxidase I gene (COI)) (Collins et al. 2019, Weigand et al. 2019. Although the COI gene and associated primers might cover a larger proportion of fish species in the reference database and have a higher interspecific variability, their lack of suitable conserved region complicates the definition of taxa-specific primers. COI primers exhibit a clear lack of consistency across replicates, have a low specificity leading to a low amplification of target organisms with often less than 5% of cleaned reads assigned to fish (Collins et al. 2019) resulting in a low detectability power (Deagle et al. 2014, Bylemans et al. 2018, Collins et al. 2019). Among the fish eDNA 12S markers, we selected the teleo marker (forward primer-ACACCGCCC-GTCACTCT, reverse primer-CTTCCGGTACACTTAC-CATG) (Valentini et al. 2016) given its high ability to detect fish species even in highly diverse ecosystems (Civade et al. 2016, Valentini et al. 2016, Bylemans et al. 2018, Pont et al. 2018, Cantera et al. 2019, Cilleros et al. 2019. We first assessed the global taxonomic coverage of the teleo primers by performing in silico PCR using ecoPCR (Boyer et al. 2016) on the entire public database ENA (Leinonen et al. 2011) (release 138, January 2019). To build our reference database, we allowed a maximum of three mismatches and compared the results with the complete fish taxonomy from FishBase (Froese and Pauly 2019). For the spatial analysis, we extracted freshwater fish checklists of all drainage basins from the most recent and comprehensive data at the global scale (Tedesco et al. 2017), covering about 80% of inland waters. We obtained marine checklists from OBIS (OBIS Ocean Biogeographic Information System) at 1° resolution (Albouy et al. 2019), and used them to estimate fish composition within marine ecoregions globally (Spalding et al. 2007).

eDNA sampling and sequencing
We downloaded the sequence data from a previous study by Pont et al. (2018). The complete dataset encompasses 196 eDNA samples collected along 500 km of the Rhone River (France, Supplementary material Appendix 1 Fig. A1), corresponding to 103 distinct sites with field replicates (between 1 and 4 samples per site) in 2016. Among those, the original study used only 118 samples corresponding to 59 sites, but all samples were collected and processed in parallel. For each sample, 30 l of freshwater water were filtered, extracted, amplified and sequenced (Pont et al. 2018).

Clustering methods
Accurately delineating 'true' biological sequences from PCR and sequencing noise has been an ongoing challenge since the emergence of next generation sequencing (NGS) technologies. Clustering sequences into molecular operational taxonomic units (MOTUs) or defining exact sequence variants (ESVs) as proxies for species is a common practice in the prokaryote microbial field but also to study unicellular eukaryotes or fungi (Huse et al. 2010, Schmidt et al. 2013, Zimmermann et al. 2015, Callahan et al. 2017) and more recently eDNA of vertebrates (Closek et al. 2019, Sales et al. 2019. While clustering has been historically limited to the creation of MOTUs based on a fixed similarity threshold, usually 97% (Stackebrandt andGoebel 2008, Edgar 2018), it poorly generalizes across markers or biological models (Edgar and Flyvbjerg 2015, Mahé et al. 2015, Nguyen et al. 2015, Callahan et al. 2017. As an alternative, new methods generate either ESV like the divisive amplicon denoising algorithm (DADA2) (Callahan et al. 2016) or MOTUs from de novo clustering algorithms based on sequence distribution and abundance to correct errors, like SWARM (Mahé et al. 2015). SWARM is an agglomerative unsupervised de novo single-linkage-clustering algorithm, building networks to define MOTUs based on sequence proximity and relative abundance (Mahé et al. 2015). While a threshold-based algorithm simply groups sequences together according to a fixed value, SWARM forms chains linking sequences based on their similarity and analyses the pattern to optimally break the network and delineate MOTUs (Mahé et al. 2014(Mahé et al. , 2015. So, the 'true' sequence is expected to be the most abundant while less abundant but close sequences are considered as erroneous as they are more likely to accumulate errors. This process avoids the dependence on a fixed value, which is not recommended in eDNA metabarcoding with short barcodes where only one mismatch can imply a different species (Miya et al. 2015).

Pipelines workflow
We based our analysis on two different pipelines: one where each unique sequence is independently assigned to a given species (called the Species pipeline) and the other one which clusters sequences into MOTUs using the SWARM algorithm (called the MOTU pipeline). In the Species pipeline, a complete reference database is required to assign a taxa to each sequence. In the MOTU pipeline, each MOTU also requires a taxonomic assignment but the completeness of the reference database is not required, as a partially complete reference database is sufficient to exclude MOTUs representing non-specific amplification, in our case, all non-fish taxa.
First, pre-processing steps were common for both pipelines ( Fig. 1). Reads were assembled using VSEARCH (Rognes et al. 2016), demultiplexed at the PCR replicate level and primers trimmed using CUTADAPT (Martin 2011) adapted from an existing metabarcoding pipeline (<https://github.com/frederic-mahe/swarm/wiki/Fred'smetabarcoding-pipeline>). No mismatches were allowed in tags for demultiplexing while sequences containing ambiguous nucleotides were discarded. Two additional steps were applied in the pre-processing for the MOTU pipeline. First unsupervised clustering was performed with SWARM, using a minimum distance of one nucleotide between each MOTU (d = 1), as one mismatch can separate two distinct species with Figure 1. Illustration of the entire pipeline with three main steps: pre-processing, clustering, application of thresholds and post-clustering. Programs used are in blue and thresholds or requirements in red. Blue lines represent the classical alternative paths for the ground-truth method (Species pipeline), i.e. with the complete reference database and no clustering, whereas yellow lines represent the MOTU-based pipeline (MOTU pipeline), while black lines represent shared steps. 5 our primer. Taxonomic assignments of all unique sequences or MOTUs were then performed by ecotag, a lowest common ancestor (LCA) algorithm from the Obitools toolkit relying on the National Center for Biotechnology Information (NCBI) phylogeny tree (Boyer et al. 2016). Then, a set of custom and already published thresholds were applied on unique sequences for both the Species and MOTU pipelines ( Fig. 1) (Valentini et al. 2016). All sequences or MOTUs with less than 10 reads, too short (< 20 bp), too long (> 75 bp) (Valentini et al. 2016) or not assigned to a fish phylum were discarded.
Each site usually has 2 samples as field replicates (except for 13 sites where the number of samples ranges from 1 to 4), and each sample has 12 PCR replicates, so most sites are represented by 24 individual PCRs (range: 12-48 PCRs replicates). For each site, we discarded all MOTUs or sequences present in only one PCR replicate (Civade et al. 2016). To avoid tag-jump noise (Schnell et al. 2015), all sequences with an abundance frequency of less than 0.001 per taxon/MOTU and per library were discarded. For the MOTU pipeline only, we then used the LULU algorithm (Frøslev et al. 2017) to clean MOTUs identified as erroneous based on sequence identity between MOTUs, abundances and patterns of co-occurrence. We used the blastn command line with the megablast algorithm to create the file matching all pairwise MOTUs to infer their similarity percentage. Then, to apply LULU, we used the 84% identity threshold (Frøslev et al. 2017) but also ran a sensitivity analysis with changes in the main parameters, i.e. the cross influence of identity threshold percentage and co-occurrence percentage (Supplementary material Appendix 1 Fig. A3).

Taxonomic assignments
For both pipelines, taxa assignments were performed on both our local database, exhaustive for resident species of the regional pool, and ENA (release 138, January 2019). For the Species pipeline, associating the local database with ENA (Leinonen et al. 2011) detected 24 extra species, among which 12 matches at 98% to our local database but at 100% in a public database to a foreign species (Supplementary material Appendix 1 Table A1). Those foreign species were unlikely to be present in the river, and most likely resulted from PCR or sequencing errors of local species randomly matching with foreign species. To avoid artificially inflating regional diversity from incorrect assignments, we only considered ENA assignments when our local database performed poorly (< 98% similarity). Among the 12 remaining species detected only by ENA and matching at < 98% to our local database, all were marine species from the Mediterranean Sea but 11 have records indicating a tolerance for brackish water while 6 were clearly known to enter estuaries (Supplementary material Appendix 1 Table A2). Most of those were also commonly consumed by humans, and DNA could have been transported into the river from sewage waters. Those extra species were hence kept for further analyses as they were unlikely to be errors generated at the PCR or sequencing step and they unlikely represent a methodological artefact.
Before analysis, assignments from ecotag were corrected to be more stringent as the algorithm can sometimes validate genus or family-level assignments to sequences with low similarity, which we chose to not trust blindly. This is due to the functioning of the ecotag algorithm (Boyer et al. 2016) and can happen in clades with a low species coverage in the reference database. We decided to add a level of standardization and only validate assignments at the species level for sequences matching at > 98% similarity, at 96-98% for the genus level, at 90-96% for the family level and at less than 90% similarity for the order or higher level for all sequences matching following a pilot study on phylogenetic signal for this marker (Supplementary material Appendix 1 Fig. A2).

Controlling taxonomic redundancy
When a sequence has a low percentage of similarity (< 98%), it can correspond to 1) a species absent from databases, 2) noise from PCR/sequencing errors from actual sequenced species or 3) rare but strong intra-specific variation at this non-coding locus which is prone to rapid mutations or insertions (Leinonen et al. 2011, Valentini et al. 2016. A common NGS metabarcoding issue is that for one species sequence matching at 100%, it can generate several noise variants matching at less than 100% (Frøslev et al. 2017). Hence, when counting the total number of taxa to infer the level of diversity, there is always a clear overestimation. For example, one Salmo trutta sequence with 100% similarity to a reference database would likely be accompanied by sequences matching at 97%, assigned at the Salmo genus and 95% assigned at the Salmonidae family. Where one species is present, the total taxa count can be three. To correct the number of taxa while being conservative, we created an estimated species count based on taxonomic correction for redundancy. A genus, family or order assignment can only be kept if there is no species already belonging to that rank, otherwise it would be more likely to be an error since the genetic databases are exhaustive for local resident species, the rest representing only a minority of rare sequences.
To evaluate the performance of LULU in the MOTU pipeline, we grouped taxa following this logic up to the family level. If a MOTU is assigned to a family for which a species representative is also detected, we assumed an error for this species and taxonomic redundancy. If a MOTU is assigned to an order only, it was not considered to represent an additional species.

Diversity comparison across scales
To assess the performance of our MOTU-based approach we calculated regional (γ) diversity, local (α) or sample diversity and dissimilarity between samples (β) with each pipeline. For the Species pipeline we retained all sequences matching at > 98% similarity cleaned for taxonomic redundancy to count the number of distinct species. For the MOTU pipeline, we retained all MOTUs assigned to a fish taxa regardless of their similarity percentage. We used the software R ver. 3.6, where sample or α-diversity was computed as richness, i.e. plain species count. β diversity was computed using the Sorensen index, with the beta.temp and beta.multi functions from betapart package (Baselga and Orme 2012). A low value of dissimilarity between samples indicates similar communities, on a scale from 0 (identical) to 1 (totally dissimilar so no species or MOTU is common). We used the Mantel correlation test for pairwise sample comparisons.

Global gaps in fish reference databases
Our analysis reveals that only 4243 out of 33 124 teleostean fish species (13%) are sequenced in the region amplified using the teleo primers, for both marine and freshwater environments (Fig. 2a). At higher taxonomic rank, we show that 38% of genera have at least one representative species sequenced for the 12S on the teleo fragment, this percentage reaching up to 80% for families. Next, we highlight a strong spatial heterogeneity between marine and freshwater environments but also among freshwater basins and marine ecoregions (Fig. 2b-c). For freshwater ecosystems, the proportion of fish species being referenced for the 12S fragment ranges from 0 to 100%, with tropical basins having an overall lower coverage than their temperate counterparts, except for Oceania where the proportion of sequenced species is among the highest. South America and Africa have by far the lowest coverage among all continents. For marine ecosystems, disparities are less pronounced but coverage varies between 10 and 53%. Ecoregions in Europe and Northern America have the highest coverage whereas tropical and southern ecoregions are the least covered.

γ-diversity assessment after filtering and clustering processes
In the 196 samples along the Rhone river, we obtain 60 689 053 reads of 299 225 distinct sequences with a mean of 309 617 reads per sample prior to any filtering (Table 1). First, we analyzed the eDNA metabarcoding data with the complete reference database (local database and ENA combined) with the Species pipeline (Fig. 1). We detect a total of 63 fish species (Table 1). Our new assembled MOTU pipeline applied on the same raw dataset identifies 67 MOTUs out of which 61 (91%) could be subsequently identified at the species level, i.e. matching at least at 98% of similarity with a species in the reference database.
We find that 98% of unique sequences and 96% of unique MOTUs correspond to either low abundant (< 10 reads) or non-fish species, so represent artefacts, noise or unspecific amplifications (Table 1), while only accounting for 12.5% and 4.4% of total reads, respectively. Sequence length filtering has a low influence, removing only 1 MOTU and no species. While removing only 0.004% of the total read count, our PCR filter removing all reads found in only one PCR replicate per site eliminates 45 MOTUs assigned to species (from 108 to 63) among which only 4 are possibly resident to the area (Supplementary material Appendix 1 Table A3). All other eliminated taxa are absent in the river and likely result from errors, contaminations from sewage waters or methodological artefacts. This PCR replicate filter also discards more than half of the detected MOTUs (86 out of 189, Table 1) representing mainly taxonomic redundancy and low-quality reads.
Following the PCR replicate filtering step, only 50 out of 86 MOTUs are represented by one taxon (Fig. 3), revealing either redundancy, with several MOTUs corresponding to the same taxa, or a lack of identification at the species level for the 36 remaining MOTUs. The application of LULU decreases the total number of MOTUs from 86 to 67 (Fig. 3). In particular, the number of taxa represented by more than 1 MOTU decreases from 15 (up to 6 MOTUs per taxa) to 8 after cleaning with LULU. Following this step, the lost MOTU representing a real taxa corresponds to a complex of two cyprinid fish species (Ctenopharyngodon idella and Hypophthalmichthys molitrix) for which teleo marker is not resolutive at the species level.
Finally, the regional pool (γ-diversity) of our fish Rhone dataset is comprised of 67 MOTUs among which 61 can be assigned to a species with 98% similarity while the ground-truth value is 63 fish species using the Species pipeline (Table 1).

Estimates of α and β species diversity using MOTUs
For each sample, we calculated the local (α-) diversity obtained by each pipeline so in terms of species and MOTUs. Overall the correlation between the number of MOTUs and the number of species is high and significant (r = 0.98; p < 0.001; Fig. 4a). The mean difference in local diversity across samples between the two pipelines is of 1.02 (SD = 1.5) with the MOTU-based approach underestimating true α-diversity. The maximum difference in local diversity is 5 (Fig. 4a), meaning that for one sample five less MOTUs are detected compared to the number of species identified with the reference database.
Since a similar value of α-diversity detected by the two pipelines does not necessary imply the same community composition, we performed a dissimilarity analysis (β-diversity) between samples pairs for both methods using the Sorensen index. We detect a high and significant correlation (r = 0.98, p < 0.001, Fig. 4b) between pairwise sample dissimilarity estimated with the Species and MOTU pipelines. We highlight no over or underestimation of dissimilarity by one pipeline compared to the other. Overall, the MOTU pipeline generates lower dissimilarity for 71% of pairs of samples compared to the Species pipeline but in 95% of all cases, the inferred level of dissimilarity has less than 0.1 difference between the two pipelines.

Discussion
While eDNA metabarcoding represents a promising tool for scaling-up biodiversity inventories (Berry et al. 2019, Ruppert et al. 2019, its strong dependence on genetic reference databases limits its application in many regions of the world, as well as for some taxonomic groups or some habitats (Weigand et al. 2019). Indeed, even diverse yet well-studied ecosystems such as coral reefs do not have exhaustive genetic references for most lineages and the majority of commonly used primers in eDNA metabarcoding (DiBattista et al. 2017, West et al. 2020). 8 Some reference-free tools exist, but their application remains mostly limited to unicellular or fungi organisms, with different aims and constraints compared to eDNA studies targeting vertebrates. Moreover, such tools do not provide plausible diversity levels for most applications on vertebrate eDNA (Andruszkiewicz et al. 2017, Closek et al. 2019, Siegenthaler et al. 2019). Further, a proper testing of whether those approaches provide reliable diversity estimates is lacking (Pedrós-Alió 2006, Huse et al. 2010, Lladó Fernández et al. 2019) beyond controlled mock communities (Frøslev et al. 2017, Alberdi et al. 2018. In our study, we assembled a set of bioinformatic tools to generate fish MOTUs and assess the level of diversity across spatial scales based on the use of eDNA metabarcoding, using a well-known river system as a case study. We show that, at the regional level, our MOTU-based pipeline provides a comparable estimate of species diversity with 67 MOTUs when 63 species are detected. However, some MOTUs represent either errors or unreferenced species, and 8 species remain undetected due to clustering and stringent filtering. Such weakness arises as many species have close sequences to each other and co-occur. So, it remains impossible for any algorithm to distinguish close species from errors. This dilemma -distinguishing rare MOTUs from errors -is inherent to clustering techniques (Huse et al. 2010, Frøslev et al. 2017, Pawlowski et al. 2018. Despite numerous attempts to solve this issue, there is still a trade-off between allowing false positives and creating false negatives (Reeder and Knight 2009). Among the MOTUs representing taxonomic redundancy, at least 3 taxa (Gobio gobio, Alosa sp., Table 1. Numbers (#) of reads, sequences, species and MOTUs identified and retained at each step of our Species or MOTUs pipelines ( Fig.  1) with # Species representing the number of taxa corrected for taxonomic redundancy (see Methods). Details for each step are presented in Methods and Fig. 1 Figure 3. Distribution of the number of MOTUs per fish taxa (a) before LULU cleaning and (b) after LULU cleaning for taxonomic redundancy (Frøslev et al. 2017).
Phoxinus phoxinus) are known to hybridize (Alexandrino et al. 2006) or are under taxonomic revision with the potential existence of multiple species displaying genetic variations (Kottelat andPersat 2005, Collin andFumagalli 2011) while for one species (Dicentrarchus labrax), genetic public databases (Sayers et al. 2019) highlight a marked intra-specific variability.
Sequencing and PCR errors are common in metabarcoding datasets (Siegwald et al. 2017), but as eDNA barcodes are usually short to enhance detectability (Bohmann et al. 2014, Deiner et al. 2017, one mismatch generated randomly can easily correspond to a distinct but closely related species. This poses the risk of false-positive detection, like in the present study, where several foreign species were detected (Supplementary material Appendix 1 Table A2). Yet, none of the false positive species detected with the Species pipeline were retained as a MOTU after the clustering process, highlighting the strength of our clustering approach to clean false positive errors when they likely arise from PCR and sequencing errors. When using a classical metabarcoding pipeline without a stringent cleaning or clustering step to infer diversity from short sequences, those false positive species might remain in the global pool of detected species which would require special care to flag and exclude such errors (i.e. manual alignment of sequences and verification of species geographical distribution). We also show the extent to which SWARM is able to assign the correct sequence as the representative of each MOTU, since 61 (out of 67) MOTUs perfectly match to a species from the reference database. Our results stress the importance to combine post-clustering filters based on PCR replicates and a cleaning algorithm to remove spurious MOTUs.
Since our MOTU-based pipeline slightly overestimates regional diversity with 67 MOTUs obtained compared to 63 species identified, a key question is how it can impact local diversity assessment. We found a slight tendency for MOTUs to underestimate species richness, with less than 2 MOTUs of difference compared to the number of species for most samples. The underestimation of diversity stemming from missed species (8 species so 13% of the regional pool) is not totally compensated by the overestimation caused by taxonomic redundancy in the regional pool. Further, no outliers were identified over all 196 samples. We also show that most of mentioned pitfalls do not impact patterns of dissimilarity at the community scale, as results are similar whether they are based on blind MOTUs or species identification. In summary, the assessment of local diversity is nearly not impacted by the absence of a complete reference database, both estimates are highly correlated (98%) with a mean difference of one species between pipelines. While these results are valid using the teleo marker (12S rDNA, ~60 bp long), we could not validate our pipeline using other primer sets due to time and financial constraints. This pipeline can still be applied to other markers, but it would require a marker with a similar level of taxonomic specificity and limited intra-specific variation, to avoid an over-estimation of taxonomic diversity due to haplotype diversity. An application with another primer would require more investigation to test if threshold adjustments are necessary to match its specificities (i.e. PCR replicates number, minimum number of reads, LULU parameters, minimum distance in SWARM clustering). We suggest the design of a small pilot study in a well-known system to validate its blind predictive power before larger-scale applications.
We show that our approach using MOTUs delivers robust estimates of species diversity at the three geographic scales, unlocking new potential for biodiversity monitoring through eDNA. With more than 75% of fish families potentially detectable, our approach can go beyond the simple delineation of sequences within clusters when further assigning taxonomy to our MOTUs. In particular, the use of assignment algorithms such as the Lowest Common Ancestor (LCA) algorithm (Boyer et al. 2016, Gao et al. 2017) is well suited for taxonomic assignment in eDNA studies with incomplete reference database. We can then estimate the potential number of species per family when the sequence coverage within families is sufficient for such assessment. While a family assignment has limitations, ecological characteristics are generally well conserved for species within a given family (Brandl et al. 2018) and allow relevant metrics of ecological analyses to be computed at this scale. As the minimum coverage within family necessary for robust detection using LCA is likely to vary across taxa and goes beyond the scope of this study, a complete coverage is not requested and our approach can provide an accurate estimate of species diversity within family for ecological studies. Yet, we highlight some limitations when it comes to conservation policies for which unnamed MOTUs will not be satisfying. As conservation programs usually focus on few taxa which are mostly rare, threatened, invasive or emblematic (Pimm et al. 2018, Enquist et al. 2019, Hannah et al. 2020, achieving the complete sequencing of those target species is urgent but realistic in the near future, as opposed to the sequencing of most vertebrates. The current filling of global DNA databases is sufficient for our approach to work globally and across scales. Diversity indices derived from this method are shown to be reliable at α, β and γ scales to infer similar ecological conclusions as those based on classical species identification.

Conclusion
While it has widely been reported that molecular biodiversity inventories outperform classical inventories (videos, acoustic) in the open environment (Thomsen et al. 2016, Boussarie et al. 2018, we demonstrate that, in the absence of a complete genetic reference database, a bioinformatic pipeline using Molecular Operational Taxonomic Units is able to provide robust estimates of species diversity across spatial scales. Even if some species cannot being distinguished after the clustering step, a common issue due to genetic proximity between close taxa (Fahner et al. 2016), the geographic biodiversity patterns are highly similar to those obtained with a species-based method. As false negatives are inherent to any inventory method in ecology (Field et al. 2007) and while false positives are rarer but to avoid at all cost (Chambert et al. 2015), we suggest a precautionary approach where some 'true' observations could be lost in order to reduce the risk of false observations. Given the current state of genetic database coverage, a species-based eDNA approach is only achievable in freshwater ecosystems located in the Northern hemisphere, where the coverage exceeds 50% of fish species (Fig. 2). For all other ecosystems, our study is the proof of concept demonstrating that, given an appropriate primer set as well as filtering and cleaning processes, MOTUs can be used to accurately assess the level of biodiversity at all scales: local, turnover and regional. We thus advocate the need to focus sequencing efforts in priority towards 1) families with no genetic coverage so presently virtually undetectable with our approach and 2) conservation-important like invasive species or IUCN Red List species for which unassigned MOTUs cannot substitute. This work paves the way towards extending the use of eDNA in community ecology and biogeography even for poorly known ecosystems or lineages, and install eDNA as a standard monitoring tool (Jarman et al. 2018). It also reinforces its initial goal of versatility and high comparability to monitor any kind of ecosystem and compare communities across wide environmental gradients.