Biodiversity inventory of the grey mullets (Actinopterygii: Mugilidae) of the Indo‐Australian Archipelago through the iterative use of DNA‐based species delimitation and specimen assignment methods

Abstract DNA barcoding opens new perspectives on the way we document biodiversity. Initially proposed to circumvent the limits of morphological characters to assign unknown individuals to known species, DNA barcoding has been used in a wide array of studies where collecting species identity constitutes a crucial step. The assignment of unknowns to knowns assumes that species are already well identified and delineated, making the assignment performed reliable. Here, we used DNA‐based species delimitation and specimen assignment methods iteratively to tackle the inventory of the Indo‐Australian Archipelago grey mullets, a notorious case of taxonomic complexity that requires DNA‐based identification methods considering that traditional morphological identifications are usually not repeatable and sequence mislabeling is common in international sequence repositories. We first revisited a DNA barcode reference library available at the global scale for Mugilidae through different DNA‐based species delimitation methods to produce a robust consensus scheme of species delineation. We then used this curated library to assign unknown specimens collected throughout the Indo‐Australian Archipelago to known species. A second iteration of OTU delimitation and specimen assignment was then performed. We show the benefits of using species delimitation and specimen assignment methods iteratively to improve the accuracy of specimen identification and propose a workflow to do so.

At the core of all DNA barcoding initiatives lies the construction and validation of reference libraries that can be further used to assign unknown specimens to known species (Hubert et al., 2008;Ratnasingham & Hebert, 2007). However, reference libraries available in public repositories such as BOLD or GenBank can host a substantial portion of taxonomic misidentifications (Bridge, Roberts, Spooner, & Panchal, 2003;Vilgalys, 2003) leading to ambiguous identifications at the species level (Ardura, Planes, & Garcia-Vazquez, 2013;Bortolus, 2008). Several recent studies have evidenced that our taxonomic knowledge of living organism is still  Hubert et al., 2012;Kadarusman et al., 2012;Meyer & Paulay, 2005;Smith, Wood, Janzen, Hallwachs, & Hebert, 2007;Winterbottom, Hanner, Burridge, & Zur, 2014). These conflicting cases have promoted the use of DNA barcoding for the morphological identifications are usually not repeatable and sequence mislabeling is common in international sequence repositories. We first revisited a DNA barcode reference library available at the global scale for Mugilidae through different DNAbased species delimitation methods to produce a robust consensus scheme of species delineation. We then used this curated library to assign unknown specimens collected throughout the Indo-Australian Archipelago to known species. A second iteration of OTU delimitation and specimen assignment was then performed. We show the benefits of using species delimitation and specimen assignment methods iteratively to improve the accuracy of specimen identification and propose a workflow to do so.
We here exemplify the benefits of the iterative use of species delimitation and specimen identification methods based on a carefully crafted DNA barcode library used as a test case with the fish family Mugilidae (grey mullets) in the Indo-Australian Archipelago (Durand, Hubert, Shen, & Borsa, 2017). This fish family illustrates the stakes associated with inventorying complex and diverse groups with difficult taxonomy and systematics (Durand & Borsa, 2015;Durand et al., 2012). Currently scattered across 30 genera distributed at a global scale (Froese & Pauly, 2019), the 78 recognized species show strikingly conserved morphological attributes, making species identification challenging (Thomson, 1997). As a consequence, grey mullets are often under-represented in field guides and specimen identifications are usually extremely difficult for most nonspecialists. Grey mullets, however, constitute a valuable source of protein and income for local communities in many tropical countries through either artisanal fisheries or aquaculture (Bacheler, Wong, & Buckel, 2005;Crosetti & Blaber, 2016;Whitfield, Panfili, & Durand, 2012).
In the present study, we aim to demonstrate the benefits of an iterative use of species delimitation and specimen identification methods to tackle the inventory of a complex taxonomic group, the Indo-Australian mugilids. We first re-examined a publicly available and curated DNA barcode reference library for Mugilidae across their distribution range (827 sequences, Durand et al., 2017)  In order to detect potential new OTUs or the impact of incomplete coverage of the coalescent trees of each OTUs, the four species delimitation methods were applied to the entire dataset consisting of 1,072 DNA barcodes. In case new OTUs were detected or OTU delimitation was revised, an updated reference library was build including representative sequences of each new OTU. A second specimen assignment analysis was further performed using this updated library. Using an iterative procedure of re-examination through species delimitation and specimen assignment methods on the whole dataset (1,072 sequences), we generated a fit-for-use reference library and quantified the impact of incomplete sampling on specimen assignment. The benefits of iteratively using DNA-based species delimitation and specimen assignment methods are discussed.

| DNA barcode reference library
The baseline reference library used in this study originates from Durand et al. (2017). This library consists of 827 DNA barcode records trimmed to 538 base pairs representing 102 known taxa (Table S1). Mugilidae systematics follows that of Durand et al. (2012) and Xia, Durand, and Fu (2016) established on the basis of multilocus molecular phylogenies. Species nomenclature follows that of Durand and Borsa (2015).

| Sampling, sequencing, and data repository
A total of 245 specimens were collected by visiting fish markets and scuba diving using polespears or captured using various gears including seine nets, cast nets, and gill nets across 25 sites in the Indo-Australian Archipelago (23 sites in Indonesia and 2 sites in Papua New Guinea, Figure 1). Specimens were photographed and individually labeled, and voucher specimens were preserved in a 5% formalin solution or a 70% ethanol solution. A fin clip or a muscle biopsy was taken for each specimen and fixed in a 96% ethanol solution for further genetic analyses.
Both tissues and voucher specimens were deposited in the National The sequences and collateral information (photographs, voucher collection number, and collection data) are publicly available in BOLD (Ratnasingham & Hebert, 2007) and are available in the projects BIFV and WPRFM (Table S2) and as a dataset (https ://doi.org/10.5883/ DS-BIFMU ). DNA sequences were submitted to GenBank; accession numbers are accessible directly at the individual records in BOLD.

| Genetic distances, OTU delimitations, and specimen assignments
Kimura 2-parameter (K2P; Kimura, 1980) pairwise genetic distances were calculated using the R package Ape 4.1 (Paradis, Claude, & Strimmer, 2004). Maximum intraspecific and nearest neighbor genetic distances were calculated from the matrix of pairwise K2P genetic distances using the R package Spider 1.5 (Brown et al., 2012). We checked for the presence of a barcoding gap, that is, the lack of overlap between the distributions of the maximum intraspecific and the nearest neighbor genetic distances (Meyer & Paulay, 2005), by plotting both distances and examining their relationships on an individual basis instead of comparing both distributions independently (Blagoev et al., 2016). A neighbor-joining (NJ) tree based on K2P distances was built using ape 4.1 to visually inspect genetic distances and DNA barcode clusters ( Figure S1).
For the sake of clarity, species identified based on morphological characters are referred to as species while species delimited by DNA sequences are referred to operational taxonomic unit (OTU), defined as diagnosable molecular lineages (Avise, 1989;Moritz, 1994;Vogler & DeSalle, 1994). OTUs were delimitated using four different algorithms: GTR + Γ substitution model. The ultrametric and fully resolved tree for mGMYC was reconstructed using the Bayesian approach implemented in BEAST 2.4.8 (Bouckaert et al., 2014). Two Markov chains of 50 million each were run independently using a Yule pure birth model tree prior, a strict-clock model, and a GTR + I + Γ substitution model. Trees were sampled every 10,000 states after an initial burn-in period of 10 millions, both runs were combined using LogCombiner 2.4.8, and the maximum credibility tree was constructed using TreeAnnotator 2.4.7 (Bouckaert et al., 2014). Duplicated sequences were pruned prior to the Bayesian analysis.
Three specimen assignment methods implemented in the R pack- to the nearest neighbor reference DNA barcode sequence (Zhang, Muster, et al., 2012); and (c) the alignment-free kmer-based method (FZKMER), suitable for both coding and noncoding portions of the genomes using machine learning (Zhang, Feng, et al., 2012) as the optimal kmer length is first estimated followed by an FZ specimen identification. We included one representative sequence, randomly selected, for each OTU of the reference libraries to perform the assignments. We used the barcoding.spe.identify function specifying "bpNewTrainingOnly" in the first run and "bpUseTrained" in the second run of the function to perform BP. We used the barcoding.spe.identify function with the option "fuzzyId" to perform FZ and the barcoding.
spe.identify2 function specifying a search for a kmer length up to 5 to perform the FZKMER method. Both FZ and FZKMER methods assign to each potential identification an "FMF value" in the range of 0-1, indicating likelihood of the assignment (Zhang, Feng, et al., 2012), while BP assigns to each identification a "bp.probilities" also in a range of 0-1. For the sake of clarity, we will refer hereafter to both the bp.probabilities and the FMF values as "probabilities." As the different methods can lead to conflicting results, a consensus has been established for the multiple methods using the consensus.identify function of the barcodingR package. We considered that a consensus emerged if at least two methods were converging. Finally, we computed the ratio of the intraspecific K2P genetic distance of a selected OTU to the nearest neighbor K2P genetic distance for each method to test whether the selected OTU was the least distant possible and spot potential false positives (i.e., specimens incorrectly assigned).
The delimitation and the assignment methods were used iteratively: Species delimitation methods were first applied to the 827 DNA barcode reference library, and specimen assignment methods applied to the 245 newly generated DNA barcodes. A second round of species delimitation was performed on the whole dataset, that is, the 827 DNA barcode reference library and the 245 newly generated DNA barcodes (1,072 sequences), in order to check for the consistency of the delimitation schemes. We added one representative sequence of each new OTU retrieved and removed/added the corresponding sequences when OTUs were merged/split following this second iteration of delimitation to perform a second round of specimen assignment of the newly generated DNA barcodes. The comparison of the two rounds of species delimitations and specimen assignments also allowed us to appraise the impact of the taxonomic coverage of the reference library on the accuracy of the specimen identifications, that is, whether OTU attributed by specimen assignment methods corresponds to the OTU given by the species delimitation methods, and to evaluate the behavior of the probabilities associated with the true (correctly assigned) and false (erroneously assigned) positives.

| First round of species delimitation-identifying OTUs in the published reference library
The first round of species delimitation methods using the DNA barcode reference library of Durand et al. (2017)  The maximum K2P intraspecific distances ranged from 0.00000 to 0.02828 (Figure 3a), while the nearest neighbor distances ranged from 0.00000 to 0.18403 (Figure 3b). The median nearest neighbor distance (0.03478) was 8.76-fold higher than the median intraspecific distance (0.00397; Figure 3c). A single lineage (OTU 90) displayed a lower nearest neighbor K2P distance than its maximum intraspecific distance. Finally, nearest neighbor K2P distances below one percent of pairwise genetic distance were observed for 5 species, with 2 species displaying a K2P genetic distance of 0 to their nearest phylogenetic relative (OTU 16, OTU 113).

| First round of specimen assignment-assigning new unknown specimen to known species
All these sequences were above 500 bp, and no stop codons were detected. Probabilities associated with each specimen assignment varied significantly between the three methods, FZKMER median: 0.57; Figure 4a and Table S3). We also compared the probabilities associated with each assignment method to the distance to the assigned OTU (Figure 4b). Both the FZ and FZMER methods systematically attributed probabilities values lower than 0.5 for sequences displaying a distance to the selected OTU higher than 0.015 (Figure 4b). In 16 cases, BP and FZKMER assigned more F I G U R E 2 Bayesian maximum credibility tree of the 326 unique haplotypes of the 827 DNA barcode reference library (round 1 of species delimitation) including 95% HPD intervals for node age estimates and OTU delimitation schemes according to the four species delimitation methods implemented (blue) and the resulting consensus (green) at such ratio levels. Finally, despite these differences between the three methods, they were generally in agreement regarding the species assigned; a consensus emerged for most cases (95.9%) with the three methods of assignment converging in 77.1% of the cases and 2 out of the 3 methods converging in 18.8% (Table S3). The median probability associated with such consensus was over 0.77 (Figure 7), while the mean probability associated with cases where no consensus emerged (n = 10) was 0.20.

| Second round of species delimitationrevising OTU delimitation by incorporating the new unknown specimens
The second round of species delimitation applied to the joint dataset of 1,072 DNA barcodes yielded more OTUs in general with 113 using RESL, 161 using ABGD, 82 using mPTP, and 110 using mGMYC (vs. 105, 148, 70, and 120 for the first round; Figure S1 and  Distance to NN (K2P)

| Second round of specimen assignment-final assignment of unknowns to knowns
Similarly to the first round of specimen assignment, the BP method attributed higher probabilities than the two other methods. Yet, a shift toward higher probabilities was observed for all three methods, with a median value of 0.94 for BP, 0.86 for FZ, and 0.84 for FZKMER. In comparison with the first round, the range of distances to selected OTUs is smaller for all three methods with the FZ now being the only method with distances to selected OTUs not larger than 0.020 (Figure 4b). FZ always selected the nearest F I G U R E 4 Results of the first round of specimen assignment. (a) Distribution of probabilities associated with the three assignment methods for the first (1) and the second (2) iteration of specimen assignment; and (b) Distance to selected OTU for the different assignment methods for both iteration of species assignment (first at the left side, second at the right side). The lower and the upper hinges of the box plots correspond to the first and the third quartiles; the lower whiskers correspond to the smallest and observation greater than or equal to lower hinge − 1.5 × interquartile range, while upper whiskers correspond to the largest observation less than or equal to upper hinge + 1.5 × interquartile range Pr OTU, while both BP and FZKMER still assigned several sequences to OTUs more distantly related than the nearest OTU available in the library (eight and one sequences respectively; Figure 5).
Interestingly, all these nine cases correspond to misidentifications (assignment to an OTU different from the OTU attributed by the delimitation methods) while only two out of 40 cases reported for the first assignment round corresponded to misidentifications. The remaining 38 cases represented not yet delineated new OTUs. The ratio between the K2P genetic distances of the selected OTUs and distance to the nearest neighbor was smaller for the three methods in the second round ( Figure 6). A consensus between the 3 assignment methods emerged in 99.5% of the cases, with the three methods converging in this second round for 97% of the cases and 2 out of the 3 methods converging in 2.5% (Table S3). The median probability associated with a consensus was over 0.87 (Figure 7). Finally, we evaluated the potential pres-  I G U R E 5 K2P genetic distances to the selected OTUs and K2P genetic distances to the nearest OTUs for the different specimen assignment methods for the first round (1) and second round (2) of specimen assignments Probability (OTU 114 (Crenimugil sp. C) and OTU 120 (Planiliza sp. E)), two in Lombok (OTU 109 (Planiliza sp. G) and OTU 113 (Planiliza subviridis)), and one in Ambon (OTU 34 (C. buchanani)), while OTU 36 (Crenimugil sp. A) has been collected in five different islands, from Sumatra to New Guinea (Figure 9a-c). Finally, at the island level, the number of OTUs collected ranged from one (Lembeh) to 10 (Java and New Guinea). No new taxa were collected from Ambon, Bali, and Lembeh, while the other islands hosted up to four new taxa (Sumatra; Figure 9d).

| D ISCUSS I ON
DNA-based automated specimen identification methods, such as DNA barcoding, open new perspectives to inventory, and monitor biodiversity (Deiner et al., 2017). In the last fifteen years, the success of the DNA barcoding initiative has given rise to the development of multiple approaches to not only assign unknown specimens to known species (see review of Bazinet & Cummings, 2012) but also to automatically delineate species through DNA-based approaches (Brown et al., 2012;Munch, Boomsma, Huelsenbeck, Willerslev, & Nielsen, 2008;Pons et al., 2006;Puillandre, Lambert, et al., 2012;Ratnasingham & Hebert, 2013). The present study highlights the benefits of using species delimitation and specimen identification methods jointly and iteratively.
Our test case was a family of shore fishes that are among the most complex for morphological species identification and which is plagued by major taxonomic gaps. The first iteration of species delimitation applied to the reference library revealed the presence of more than 10% cryptic diversity because 113 OTUs were extracted from 102 known taxa. However, mitonuclear discordances due to incomplete lineage sorting and introgression are known to limit the robustness of mtDNA species delimitation (Hinojosa et al., 2019;Pedraza-Marrón et al., 2019; but see review of Toews & Brelsford, 2012). In that case, the use of multiple independent nuclear markers, biparentally inherited, should be used to confirm the mtDNA-derived cryptic lineages detected here (Fennessy et al., 2016;Fišer, Robinson, & Malard, 2018). Despite using a reference library with objectively delineated OTUs for a first round of specimen assignment, the second iteration of species delimitation showed that 14% of 245 consensus identifications made during the first iteration of specimen assignment were false positives, most of which actually corresponded to new OTUs.
The iterative use of species delimitation and specimen identification methods also resulted in a substantial decrease in the proportion of false positives and a shift toward higher probabilities values for true positives. The resulting delimitation scheme during the second iteration benefitted from an increase in the taxonomic coverage and the number of sequences per OTU in the reference F I G U R E 6 Distribution of the specimen assignment probabilities across the ratio of the K2P genetic distances to the selected OTU upon the K2P genetic distances to the nearest neighbor in the reference library (selected OTU/ NN) for each of the specimen assignment methods for the first round (1) and second round (2) of specimen assignment Probability F I G U R E 7 Distribution of the average probability of the three assignment methods for each specimen assigned during the first round (1) and second round (2)  The present study also clearly emphasizes the importance of developing comprehensive reference libraries to overcome the susceptibility of specimen assignment methods to spurious OTU delimitation resulting from insufficient taxonomic and/or intraspecific genetic diversity coverage. The accuracy of the specimen assignment procedure, that is used to assign an unknown specimen to a known species (Hubert et al., 2008) (Durand et al., 2017). Yet, most specimen assignment methods require a well-parameterized reference library, locally or globally accessible. Not all international repositories, unlike BOLD, have been conceived to handle taxonomic updates in the long term, a task that requires extensive collateral data (Ratnasingham & Hebert, 2007; F I G U R E 8 Specimen assignment probabilities for each specimen assignment method for the first round (1) and second round (2)  (c) (d) Ward, Hanner, & Hebert, 2009). Ideally, local, limited, and self-generated DNA barcode reference libraries should be the starting point as they allow to perform a variety of species delimitation and specimen identification methods, but at the same time provide control over the accuracy of the identifications (DiBattista et al., 2017;Olds et al., 2016;Sonstebo et al., 2010;Willerslev et al., 2014).
The combined application of several species delimitation methods allows for the normalization of over-or underestimation that can occur with each of these methods (Blair & Bryson, 2017;Huang et al., 2018;Kekkonen & Hebert, 2014;Kekkonen et al., 2015).
Similarly, the possibility to compute different specimen identification methods within the same framework allows for the improvement of the confidence in the inferences by application of consensus methods. In fact, specimen assignment methods will always assign a unknown specimen to a known species, with a certain level of confidence, leading potentially false positives. Despite using a well-curated library, we showed that the issue of false positive could not be put aside.
We suggest two approaches to improve the accuracy of specimen assignment and to avoid false positive identifications. The first approach is to use probability thresholds, and the second approach involves the use of several assignment methods to establish a consensus. After our first assignment round, we observed the presence  (2016): 29 species). We found that the 245 specimens studied here belonged to 27 OTUs spreading across five different genera. Less than half of these OTUs correspond to known species, cryptic diversity was found for three of these species, hosting up to three different lineages, and eight (30%) potentially new species have been collected (i.e., unknown and newly detected), enlightening the importance of integrative approaches to disclose hidden diversity (Hebert, Penton, et al., 2004;Janzen et al., 2005;Smith et al., 2008Smith et al., , 2007. Such results represent a new example of the benefits of using the DNA barcoding standards in taxonomy (Butcher et al., 2012;Hubert et al., 2008;Ratnasingham & Hebert, 2007, 2013 especially when confronted to such complex group and advocate toward covering the largest area possible to depict the diversity of species and complex of species. Indeed, our starting reference library (Durand et al., 2017) contained no specimens from the Indo-Australian archipelago despite a rather large taxonomic coverage. The occurrence of these new OTUs after adding specimens from these regions is not surprising given they are located in the Coral Triangle (CT) region, a region characterized by a large still not fully explored diversity of shore fishes (Allen & Erdmann, 2012). To date, no endemic species of grey mullets have been described in this region so far; our results suggest that regional species richness of grey mullets is underestimated. The very restricted distributions of these endemic species among the IAA call for a thorough investigation of the underlying mechanism that led to such pattern, making the mullets a great candidate to investigate the origin of endemism in the IAA region (Connolly, Bellwood, & Hughes, 2003;Hughes, Bellwood, & Connolly, 2002;Mora et al., 2003;Reaka, Rodgers, & Kudla, 2008). As mullets represent a target for local communities in this region, such high proportion of very restricted endemic species retrieved here is also of importance for conservation program.

| CON CLUS ION
DNA barcoding has prompted the development of a wide range of genomic tools. As concerning the use of bad taxonomy can be in ecology (Bortolus, 2008), the use of incomplete reference libraries can have dramatic consequences too. We demonstrated the benefits of working with curated libraries and proposed a workflow to minimize the effect of incomplete sampling on specimen identifications, proposing the iterative use of species delimitation and specimen assignment methods. This iterative approach showed that despite extensive effort to clarify the taxonomy of Mugilidae (Durand & Borsa, 2015;Durand et al., 2017;Shen & Durand, 2016;Xia et al., 2016), cryptic diversity can still be found in this group. Our inventory of the biodiversity of grey mullets in the IAA region also led to the discovery for the first time of a large portion of endemic species of mugilids with very restricted range size, a result of importance in term of both conservation and evolutionary process.

ACK N OWLED G M ENTS
The authors wish to thank Siti Nuramaliati Prijono, Bambang Mareen Wellenreuther and three anonymous reviewers for providing constructive reviews of earlier versions of the manuscript. This publication has ISEM number 2020-009-SUD.

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
These sequence and collateral data have been deposited in BOLD (projects BIFV and WPRFM) and are publicly available as a dataset