DNA Barcoding and geographical scale effect: The problems of undersampling genetic diversity hotspots

Abstract DNA barcoding identification needs a good characterization of intraspecific genetic divergence to establish the limits between species. Yet, the number of barcodes per species is many times low and geographically restricted. A poor coverage of the species distribution range may hamper identification, especially when undersampled areas host genetically distinct lineages. If so, the genetic distance between some query sequences and reference barcodes may exceed the maximum intraspecific threshold for unequivocal species assignation. Taking a group of Quercus herbivores (moths) in Europe as model system, we found that the number of DNA barcodes from southern Europe is proportionally very low in the Barcoding of Life Data Systems. This geographical bias complicates the identification of southern query sequences, due to their high intraspecific genetic distance with respect to barcodes from higher latitudes. Pairwise intraspecific genetic divergence increased along with spatial distance, but was higher when at least one of the sampling sites was in southern Europe. Accordingly, GMYC (General Mixed Yule Coalescent) single‐threshold model retrieved clusters constituted exclusively by Iberian haplotypes, some of which could correspond to cryptic species. The number of putative species retrieved was more reliable than that of multiple‐threshold GMYC but very similar to results from ABGD and jMOTU. Our results support GMYC as a key resource for species delimitation within poorly inventoried biogeographic regions in Europe, where historical factors (e.g., glaciations) have promoted genetic diversity and singularity. Future European DNA barcoding initiatives should be preferentially performed along latitudinal gradients, with special focus on southern peninsulas.


| INTRODUC TI ON
A decade and a half ago, DNA barcoding was presented as a novel system to provide wide-scale and quick species identification using certain gene sequences as molecular species-specific tags (Hebert, Cywinska, Ball, & de Waard, 2003). Since then, the number of species sequenced has increased exponentially, and DNA barcodes are available for almost 200K named species in international databases such as the Barcode of Life Data Systems BOLD (Ratnasingham & Hebert, 2007). DNA barcoding often allows for the identification of morphologically cryptic species and individuals at life stages difficult to determine morphologically (e.g., insect larvae) (Ahrens, Fabrizi, Šípek, & Lago, 2013;Bonal, Muñoz, & Vogler, 2011). It has boosted biodiversity inventories and environmental monitoring and constitutes a useful tool in taxonomy, ecology, agriculture, and conservation as well as for customs, police, food, and feed control (Bergsten et al., 2012;Jinbo, Kato, & Ito., 2011;Savolainen, Cowan, Volger, Roderick, & Lane, 2005).
The Barcoding of Life initiative constitutes a historical feat, but this success does not mean that the method is free of shortcomings (Bergsten et al., 2012;Berthier, Chapuis, Moosavi, Tohidi-Esfahani, & Sword, 2011;Dubey, Michaux, Brünner, Hutterer, & Vogel, 2009;Nicholls, Challis, Mutun, & Stone, 2012). One of them is the potential decline of identification accuracy as intraspecific genetic divergence increases along with the geographical scale (Meyer & Paulay, 2005;Bergsten et al., 2012, but see Lukhtanov, Sourakov, Zakharov, & Herbert, 2009. In this study, we approach whether this problem aggravates when genetic diversity hot spots are undersampled. In animals, a 648-bp section of the universal mitochondrial gene encoding for the protein cytochrome c oxidase subunit I (COI) has been adopted as the standard barcode (Hebert et al., 2003;Ratnasingham & Hebert, 2007). The logic behind DNA barcoding relies on the structure of genetic variability above and below the species level. Individuals of the same species display lower levels of genetic divergence among themselves compared with heterospecific individuals (Hajibabaei, Singer, Hebert, & Hickey, 2007;Hebert et al., 2003). Any genetic threshold used for identification of queries is arbitrary, ideally optimized for the dataset in question, and may for instance be 1%, 2%, or 3% (Collins & Cruickshank, 2012;Lemos, Fulthorpe, Triplett, & Roesch, 2011;Ratnasingham & Hebert, 2007). BOLD identification engine for instance uses 1% for species-level taxon assignment (Ratnasingham & Hebert, 2007). However, a plethora of methods have been proposed for identification of unknowns against a reference library, tree-based, distance-based, character-based, but few improve noticeably on a standard "best close match" sequence distance strategy (Spouge, 2016). Many of the more sophisticated methods are also too slow to be applicable to the growing needs of taxon assignment from the DNA metabarcoding community.
While the presence of pseudogenes (Berthier et al., 2011;Dubey et al., 2009), former hybridization, or incomplete lineage sorting (Nicholls et al., 2012) may mislead identification, one of the main caveats of DNA barcoding is not related with the evolutionary history of the genes, but with the geographical distribution of the samples. When the geographical scale increases, intraspecific divergence increases, and the distance to the closest related taxa decreases, which results in more ambiguous specimen identification (Bergsten et al., 2012).
The geographical scale effect on intraspecific divergence is based on the well-known concept of genetic isolation by distance (Wright, 1943), but the relationship between genetic divergence and distance may differ geographically. Taking Europe as an example, studies with different types of organisms have demonstrated that, far from being homogeneously distributed, genetic diversity is concentrated in certain areas of the continent (Avise, 2000;Hewitt, 1996;Schmitt, 2007). Thus, for a given spatial distance between the sampling sites of two DNA barcodes, the genetic distance could be higher if at least one of them comes from a genetic diversity hot spot.
In this study, we analyzed the geographical scale effect on intraspecific genetic distance using as study model a group of Heteroceran Lepidoptera (i.e., moths) whose caterpillars feed on oak (Quercus spp.) leaves. These moth species are widely distributed over most parts of Europe (Camus, 1936(Camus, -1954. We could thus download a high number of DNA barcodes from the public repository BOLD that were pooled in the analyses with newly sequenced Iberian samples. Previous reports on Lepidoptera have shown little intraspecific divergence at a large geographical scale between Central and Northern Europe (Huemer et al., 2014). In this study, we included individuals from the south of the continent to assess the effect of genetic diversity hot spots on intraspecific genetic distance and identification success. Our concrete objectives were as follows: 1. To analyze to which extent the availability of DNA barcodes is biased toward central and northern Europe.
2. To know whether, in pairwise sequence comparisons, for any given spatial distance, the genetic divergence is higher if at least one of the sequences comes from a southern European peninsula.
3. To reconstruct a COI gene tree to assess the geographical distribution of genetic diversity on the continent and the presence of monophyletic clades (intraspecific distinct lineages and potential cryptic species) exclusive of southern Europe.

| Study system and field sampling
The Heteroceran Lepidoptera (moths) whose caterpillars feed on Quercus spp. leaves were used as study model. There is a high diversity of lepidopterans linked to the Quercus genus as host plant.
The most abundant oak herbivore families are noctuids, tortricids, erebids, geometrids, and nolids, which start feeding on the new shoots in early April (Elkinton et al., 1996). The community of caterpillars also changes during the season, with tortricids the first to feed on the new shoots and geometrid species the last (Soria, 1988).
Herbivory damage on oak differs among years too, being able to have outbreaks under specific conditions of temperature and climate (Schroeder & Degen, 2008). These insects are widely spread over Europe and have been recorded feeding on different oak species (Führer, 1998;Soria, 1988). The number of DNA barcodes available for the southern European peninsulas (Balkans, Italy, and Iberia) at public repositories was very low compared with the rest of continental Europe (see below). Thus, we carried out a field campaign in Spain to reduce such an imbalance. We sampled 10 oak forests along a latitudinal gradient ( Figure 1) from April to June 2017, the period of maximum activity of oak defoliating caterpillars in the Iberian Peninsula (Soria, 1988). Oak branches were shaken and the falling caterpillars collected on a white cloth of a fixed surface placed beneath (see Ruiz-Carbayo, Bonal, Espelta, Hernández, & Pino, 2017 for a detailed description of the sampling methodology). The caterpillars collected at each oak were placed within a plastic box and taken to the laboratory, where they were housed individually in Petri dishes and fed with fresh oak leaves. Caterpillars were identified to species level based on morphological characters following guides and dichotomous keys (Gaytán, Canelo, González-Bornay, Pérez-Izquierdo, & Bonal, 2018;Gomez de Aizupura, 2002). In those few cases in which the caterpillar could not be identified, it was raised to the adult stage and then identified using guides (Fibiger, 1997;Goater, Ronkay, & Fibiger, 2003;Sihvonen & Skou, 2015). In total, 21 species of 9 families were collected (Table 1). All specimens (caterpillars and adults) were stored in 1.5-ml Eppendorf tubes filled with 96% alcohol for further molecular analyses.

| Molecular laboratory work
DNA of 384 individuals of different species and Iberian localities was extracted using commercial extraction kit (EZNA ® Tissue DNA Kit) according to manufacturer's instructions. For all of them, a fragment of the mitochondrial gene COI was amplified using the universal primer pair LCO1490/HCO2198, common in DNA barcoding (see Folmer, Black, Hoeh, Lutz, & Vrijenhoek, 1994 for details on the primer sequences and PCR protocols). Sequencing was performed using Big-Dye (Perkin-Elmer) technology and an ABI3700 sequencer. We obtained good quality DNA sequences for 375 individuals (21 species); the DNA from the remaining 9 individuals could not be further used due to sequencing failures. Sequence chromatograms were assembled, inspected individually, and edited using Sequencher 4.6 (Gene Codes Corp.). The final alignment length after trimming was 658 bp. To rule out the presence of nuclear mitochondrial insertions (numts), we translated all the sequences into amino acids using the software MacClade (Maddison & Maddison, 2005).
After translation, we did not find any stop codons indicating the F I G U R E 1 Geographical distribution of sampled localities including BOLD records (Europe: red; Iberian Peninsula: blue; and Italy: green) and new localities sampled for this study (yellow) presence of numts (no stop codons can be present in the intronless mitochondrial COI gene). Moreover, there were additional evidences against the presence of numts. In no case, the sequences of the same species were paraphyletic and intraspecific genetic divergence was lower than that expected if some of the sequences were numts ).

| Preparation of databases of DNA barcodes
Sequences of the 375 individuals collected during our field sampling were queried in BOLD search engine (Ratnasingham & Hebert, 2007) to double-check the species identity in those specimens previously determined according to morphology. We used the option All Barcode Records in BOLD, as all our sequences were longer than 500 bp and doing so the query sequences are compared with a larger number of reference barcodes. Besides, we mined BOLD searching for COI DNA barcodes of the same species that we had sequenced in the Iberian Peninsula; in total, we could download 277 public sequences (available in January 2018), which were used in further analyses (see Appendix S1 for BOLD process IDs). For the data downloaded from BOLD, we considered two sequences to come from the same locality when the name for that field site was identical in the database or when the coordinates showed that the specimens had been collected within a distance lower than 5 km. We chose this threshold because, given that these species of oak-feeding moths are not migratory and have limited dispersal abilities (Ruiz-Carbayo et al., 2017), individuals within this range may belong to the same population.

| Calculation of intraspecific genetic distance and geographical distance among populations
For each species, the maximum intraspecific genetic divergence and the geographical distance between localities were needed for further analyses (see below the section on statistical tests). For measuring genetic divergence, we first aligned all the sequences of each species separately using MUSCLE software (Edgard, 2004) as implemented in MEGA 7 (Kumar, Strecher, & Tamura, 2016;default values). For each separate species alignment, all pairwise genetic distances were calculated using the Kimura 2-parameter model (K2P%; Kimura, 1980) as implemented in MEGA 7 (Kumar et al., 2016); we used K2P% because this is the method more frequently used in DNA barcoding studies (Bergsten et al., 2012;Gunay, Alten, Simsek, Aldemir, & Linton, 2015;Shen, Guan, Wang, & Gan, 2016). The maximum genetic divergence for each pair of populations was the maximum genetic distance (K2P%) between any pair of individuals of those two populations. Pairwise geographical distances between populations were calculated using QGIS 2.18.9 (QGIS Development Team, 2009) and later corroborated using the cosine-haversine formula (Robusto, 1957).  Note: NA = not applicable parameter combination according to ABGD web version. Automatic barcode gap discovery (ABGD) uses genetic distances to identify a barcode gap between intra-and interspecific distances, but needs a prior on maximum intraspecific divergence. The prior is used to calculate an estimate of the population mutation rate parameter θ. This estimate is calculated as the average pairwise of all distances below the specified prior. The estimate of θ is used to set a distance limit above which there is a 5% risk of a barcoding gap being intraspecific, directly inferred from a coalescent model of a single nonstructured panmictic population. Only barcode gaps found above this distance limit are used to infer a partition. Gaps are inferred based on peak slope values of ranked ordered distances with a dynamic window size for calculating the slope. ABGD performs recursive partitioning, using the same prior, on each group from the initial partition until no further splitting occurs. Apart from the prior, which can be set to a range of values, the method requires parameter X to be specified which is the minimum gap width relative to any gap in the prior intraspecific divergence (default X = 1.5). X relates to the sensitivity of the method to the size of the barcoding gap relative to intraspecific gaps. Genetic distances were calculated under two models (JC69 and K80).

FAMILY SPECIES
Under the K80 model, we used two values (2 and 4) on the transition/transversion ratio (TS/TV) after this parameter was estimated for the dataset to 3.06 (95% HPD: 2.66-3.43) in a MrBayes 3.2.6 run under a HKY85 model with one million MCMC generations. We explored four values for X (0.5, 1.0, 1.5, and 2.0) and ten steps in prior space between 0.005 and 0.05. The ABGD analyses yielded very stable results under different models used to calculate distances. At prior maximum intraspecific divergence values from 0.03 and above, all analyses divided the dataset into 17 groups, corresponding to named species. At lower prior values (0.023-0.005) between 19 and 25 groups were delimited and the recursive partitioning function split the data into 2-3 additional groups compared to the primary partition. This is indication of a dataset with multiple distance thresholds throughout included taxa. Primary partitions make a jump from 22 to 17 delimited groups somewhere between a prior value of 0.008 and 0.018, depending on the gap width parameter. The recursive partitioning makes a smoother transition over this prior space with an intermediate number of delimited groups from 24 at 0.008 to 20 at 0.018. At the default value of X (1.5) and at a prior value of 0.01 indicated by Puillandre et al. (2016) as a suitable value to match species hypotheses based on independent data across a majority of tested empirical datasets, 23 groups were delimited. This partitioning scheme involved subdivision of Euproctis chrysorrhoea (2 groups), Peribatodes ilicaria (2 groups), Ennomos quercaria (2), Catocala nymphagoga (2 groups), and Colotois pennaria (3 groups). In the maximum division at the lowest prior value (0.005) also Watsonalla uncinula was subdivided into 2 groups and Ennomos quercaria into

| Molecular single-gene species delimitation
The sequences of each species alignment were collapsed into unique haplotypes using the online.fasta sequence toolbox FaBox (Villesen, 2007). We discarded four species either because no sequences were available from any of the southern European peninsulas (Orthosia cerasi) or elsewhere on the continent (Eupithecia massiliata, Phycita torrenti, and Dryobota labecula). We removed them because those missing data precluded any pairwise comparison between populations from at least one southern Peninsula and the rest of the continent. Our final database included a total of 18 taxa, 17 species plus the subspecies Colotois pennaria ssp. carbonii, which was treated as distinct species in the analyses on intraspecific genetic divergence (Table 1). We did so to be conservative and avoid inflating intraspecific genetic divergence in C. pennaria. The ssp. carbonii is morphologically distinct (Sihvonen & Skou, 2015), and so the intraspecific divergence would be expected to be higher than in those species in which no subspecies have been described  (Fujisawa & Barraclough, 2013;Pons et al., 2006). Some OTUs can correspond to cryptic species or isolated genetic lineages; thus, this sort of analyses may well illustrate the patterns of geographical genetic variability of these insects in Europe. GMYC requires an ultrametric gene tree as input, which was built using the software Bayesian Evolutionary Analysis Sampling Trees (BEAST  (Fujisawa & Barraclough, 2013). The multiple-threshold GMYC relaxes the assumption that speciation events must be older than all coalescent events in the gene tree. The method iteratively splits and fuses existing species clusters starting from the single-threshold solution, until no further improvement in the maximum likelihood occurs (Fujisawa & Barraclough, 2013).
We performed both and compared the results in terms of the number of OTUs delimited. Besides these tree-based methods, we used two distance-based approaches for delimiting the number of OTUs, namely ABGD (Automatic Barcode Gap Discovery) (Puillandre, Lambert, Brouillet, & Achaz, 2016) and jMOTU (Jones, Ghoorah, & Blaxter, 2011). ABGD uses genetic divergences between sequences and a coalescent model to identify a barcode gap between intra-and interspecific distances and defines OTUs accordingly (Puillandre et al., 2016). jMOTU takes one, or a range of, user-supplied distance cutoff values and clusters sequences using single-linkage clustering so that no member of one OTU is closer than the cutoff to any member of any other cluster (Jones et al., 2011). A detailed explanation of the basis and principles of each method is provided in Table 2 and

| Statistical analyses
Our main goal was to assess whether the inclusion of sequences from genetic diversity hot spots (in our case the Iberian and Italian Peninsulas) increases intraspecific genetic divergence. We did not consider the Balkan Peninsula, as barely any sequences were available in BOLD and we had not sampled in that geographic region.
In order to test whether southern Europe was underrepresented in the Barcode of Life Data System relative to its species richness, we checked the geographical distribution of all the study species at the GBIF website (GBIF.org, 2017). The study species are common ones and we could thus assess their distribution range reliably on GBIF records. In parallel, we checked another database (Lepidoptera Mundi, lepidoptera.eu) based on records and bibliographical data to confirm the species geographical distribution. We took the southernmost and northernmost European records for each species of the study group and assumed that these were the limits of its geographical distribution in Europe; in between them, the species would be present. We then counted to which extent the number of species European population (not Italian) and one Iberian (abbreviation F I G U R E 2 Number of molecular operational taxonomic units (MOTU) delimited by jMOTU 4.1 over a range of cutoff values in unit of base pair differences. jMOTU is a distance-based method that calculates pairwise distances between sequences and then clusters the sequences into MOTUs using single-linkage clustering for each user-supplied cutoff value (Jones et al., 2011). Resulting clusters are not affected by input sequence order and no members of a given OTU will be closer than the cutoff to any member of any other cluster. The Java program is speed-optimized by a preclustering step of exact subsequences and can therefore (in contrast to ABGD) be run on a reduced haplotype-collapsed dataset without any effect on the final clusters. jMOTU does not use a substitution model that corrects for homoplastic substitutions but calculates only exact base pair differences. We ran the jMOTU analysis with the haplotype-collapsed dataset (cropped to 624 bp and filtered for sequences >500 bp long due to the sensitivity of jMOTU to sequence length variation in the dataset (see Stahlhut et al., 2013, https://doi.org/10.1186/1472-6785-13-2), which we also noted), a low BLAST identity filter of 95%, and a percentage of minimum sequence length overlap of 90% following the software manual's recommendations. We calculated how the number of delimited MOTUs varied over a cutoff range between 3 and 23 bp, which for 624 bp equals to 0.5%-3.7% in genetic divergence. At all cutoff values above 18 bp (>2.9%), jMOTU delimits 17 clusters corresponding to named species, similar to ABGD at prior maximum intraspecific divergence values from 0.03 and above (See Table 2). At decreasing cutoff values from 17 to 6 bp (2.7%-1.0%), there is a gradual increase in the number of delimited clusters from 18 to 26 MOTUs. The transition mirrors well the recursive partitioning results from ABGD from 19 to 25 clusters in prior space between 0.023 and 0.005 (See Table 2). The 25 clusters delimited by jMOTU at a 7 bp (1.1%) cutoff are identical to the 25 units delimited by single-threshold GMYC and differ from the 25 groups delimited by ABGD at a prior of 0.005 only by the same two exceptions as between GMYC and ABGD: Ennomos quercaria is subdivided into two and not three groups and instead Tortricodes alternella is subdivided into two groups We performed three LMM tests: the first one to assess whether the genetic divergence differed between EUEU and EUIB pairwise population contrasts; the second to calculate the same but between EUEU and EUIT; and the third one to assess it within the same geographical area (EUEU vs. IBIB contrasts). In all the analyses, the genetic divergence (measured as K2P% distance) was the dependent variable and the type of population contrast the independent factor; the pairwise spatial distance between populations was the covariate. Additionally, the largest number of sequences at each pairwise comparison between populations was also included as covariate to control for the potential effect that sample size could have on genetic divergence. In the EUEU versus IBIB analysis, the spatial range was reduced to 1,000 km, as the maximum distance between any pair of Iberian populations was lower than that. In the three tests, the species of Lepidoptera was included as a random factor.

| Geographical distribution of DNA barcodes in Europe
The geographical distribution of the DNA barcodes of the study species showed a strong bias with an underrepresentation of southern Europe. If the barcodes available in BOLD are divided between those from countries located at southern European Peninsulas (Italy, Iberia, and the Balkans) and the rest of the continent, the number is clearly imbalanced (73 vs. 204). Germany is the country in which most DNA barcodes are available (60), followed by some of its neighbor countries (Austria, Czech Republic, France, and the Netherlands) with more than 15 each (Table 3). The highest number in the northern countries is not related to their area.   The higher intraspecific genetic divergence in the pairwise comparisons including Iberian populations (EUIB and IBIB) held true also when the spatial distance was included as a covariate. In the analyses comparing EUEU versus EUIB contrasts, the distance between localities had a significant positive effect on genetic divergence (Table 5; Figure 4). However, the type of contrast (EUEU or EUIB) had an independent and significant effect too as, for any pairwise spatial distance, the average genetic divergence was higher if one of the two populations compared was Iberian (EUIB divergence higher than EUEU) (Table 5; Figure 4). The same happened in the analyses of EUEU versus IBIB over a maximum range of 1,000 km. The genetic divergence between Iberian populations (IBIB) was higher than in the pairwise comparisons between European ones (EUEU) independently of the effect of the geographical distance between localities (Table 5; Figure 5). In the analyses comparing EUEU versus EUIT contrasts, the distance between localities had a significant positive effect on genetic divergence too (Table 5; Figure 4). The effect of including Italian populations was not as high as in the case of Iberia, but the pairwise genetic divergence was higher between EUIT contrasts than between EUEU ones (Table 5; Figure 4). As expected, the maximum sample size of each pairwise comparison had a significant effect in all LMMs but, after controlling for it, the type of contrast remained highly significant in all cases.

| Intraspecific genetic divergence and biogeography
The species delimitation using the GMYC single-threshold model defined 25 operational taxonomic units (OTUs) or putative species (Table 6), of which 18 corresponded to recognized Linnean species or (1) subspecies ( Figure 6). The rest correspond to Iberian haplotype clusters classified as independent OTUs mostly within those species with a maximum intraspecific genetic divergence over 2% ( with respect to the rest of Europe was higher than in the case of Italy. There were fewer Italian OTUs/monophyletic clades and more haplotypes were shared between Italy and the countries north of the Alps than between Iberia and the rest of the continent (Figure 6).
Haplotype sharing is also frequent between populations located in the central and northern European countries ( Figure 6).
The number of putative species delimited by the multiple-threshold GMYC largely exceeded that of the single-threshold model (Table 6). All the putative species delimited by the latter were also different OTUs according to the former. However, according to the multiple-threshold GMYC, more Iberian monophyletic clades were classified as different putative species (Figure 7). Both ABGD and jMOTU produced very similar result to the GMYC single-threshold approach, both in terms of the number and in terms of identity of the OTUs retrieved (Table 6). jMOTU delimited the exact same 25 OTUs as single-threshold GMYC at a cutoff value of 7 bp (equivalent to 1.1% genetic divergence) ( Table 6; Figure 2). ABGD delimited 23 of the same 25 OTUs at a recommended prior value on maximum intraspecific divergence of 0.01, the sole difference being no subdivision of T. alternella and W. uncinula (Tables 2 and 6).

| D ISCUSS I ON
The number of DNA barcodes for oak-feeding Lepidoptera is lower in southern Europe, despite the higher species richness. As expected, the effect of the geographical scale on the genetic divergence depended on the latitude. In pairwise sequence comparisons, for any given spatial distance the genetic divergence was higher when at least one of the sequences came from one of the southern European peninsulas included in the study (Italy and Iberia). This made identification of some southern query sequences problematic, as the genetic distance with respect to the reference barcodes in BOLD was above the maximum intraspecific threshold allowed. The effect of the latitude is due to the presence of southern haplotypes with a reduced geographical distribution. Accordingly, the COI gene TA B L E 5 Results of the Linear Models testing the effects on the intraspecific genetic distance (dependent variable) of the pairwise spatial distance between populations (Dis.Geo, covariate) and the type of pairwise geographic comparison between populations (Compar, independent factor), considering the effect of maximum sample size (Nmax, covariate  (Gemeinholzer et al., 2011).
Previous studies have shown that the larger the sampling scale, the greater the intraspecific divergence and the more likely finding overlapping closely related taxa (Bergsten et al., 2012). In this study, we did not analyze the effects on the barcoding gap, because the closest relatives of the study species often feed on other host plants (Cates, 1981;Thompson & Pellmyr, 1991). Rather, we focused on intraspecific genetic divergence but considering not only the effect of the spatial distance alone, but its interaction with the latitude.
Doing so we found that intraspecific genetic divergence was higher in pairwise comparisons that included at least one DNA sequence from a southern peninsula than when both came from elsewhere in the continent. The peninsulas of southern Europe are hot spots of species and genetic diversity (Geiger et al., 2014;Murienne & Giribet, 2009;Pinto et al., 2012); thus, when they are undersampled, intraspecific genetic divergence is underestimated more than expected by the mere reduction of the geographical scale. The low availability of DNA barcodes or their reduced geographic distribution is a main concern in DNA barcoding (Bergsten et al., 2012;Dincă et al., 2015;Geiger et al., 2014;Savolainen et al., 2005). Our results show that, to capture as much intraspecific genetic variability as possible, sequencing efforts should be concentrated in southern F I G U R E 5 Relationship between geographic distance (x-axis, km) and intraspecific genetic divergence (y-axis, K2P%) (Mean ± SD) in pairwise contrasts between Iberian populations (IBIB, green line) and only European populations (EUEU, red line)

TA B L E 6
Subdivided taxa and number of operational taxonomic units defined by GMYC (single and multiple threshold), jMOTU (at a cutoff value of 1.1%) and ABGD (at a 0.01 prior on maximum intraspecific divergence and gap width (X) as default) See Table 6 and Figure 2 for an exploration of the effect of different parameter values in ABGD and jMOTU analyses European genetic diversity hot spots (Dincă et al., 2015;Murienne & Giribet, 2009;Pinto et al., 2012) where, paradoxically, the number of available DNA barcodes is lower.
The disproportionately strong effect of the Iberian samples is largely related with the distribution patterns of genetic diversity determined by Pleistocene glaciations (Hewitt, 1996;Schmitt, 2007). The southern European Peninsulas were refugia that hosted a large number of plant and animal taxa when the ice sheet covered large extensions of the continent. This was the case of our study group (insects associated with oaks and other species of broad-leaved trees). In the Iberian Peninsula, where a greater geographic isolation is observed than, for example, in the Italian peninsula, deciduous and evergreen oak forests were restricted to a few refugia close to the coast or at the south-facing slopes of some mountainous systems (Koster, 2005;Magri et al., 2007).
When the ice retreated, not all haplotypes spread northwards but just some of them. This "founder effect" is responsible for the higher species richness in the south and the genetic homogeneity of the recently colonized areas in the central and northern parts of the continent (Hewitt, 1996(Hewitt, , 1999Taberlet, Fumagalli, Wurst-Saucy, & Cosson, 1998 (Dincă et al., 2015;Geiger et al., 2014). Some alpine butterflies, for example, show higher intraspecific genetic distance among the populations in the Alps and the nearby Pyrenees than among the Alps and Scandinavia (Dincă et al., 2015). However, it is well known that the Alps have a deep impact on the genetic variation of other groups of animals (Arntzen, 2001;Cornetti et al., 2016;Leys, Keller, Räsänen, Gattolliat, & Robinson, 2016).
Four Iberian clades were retrieved as different OTUs by the GMYC single-threshold model, which confirmed its utility for species delimitation within poorly inventoried biogeographic regions in Europe.
According to Fujisawa and Barraclough (2013), GMYC single-threshold model is more reliable than the multiple thresholds one, which overestimates the number of OTUs. Moreover, ABGD and jMOTU retrieved identical or very similar results to the single-threshold GMYC, while the latter was considered the most reliable of the three in a comparison of performances (Ratnasingham & Herbert, 2013).
The presence of different putative species in southern Europe conditions the efficacy of species identification on the basis of DNA barcoding (Derkarabetian & Hedin, 2014;Dincă et al., 2015;Fossen, Ekrem, Nilsson, & Bergsten, 2016;Geiger et al., 2014). In the hypothetical case that there had not been any Iberian barcode in BOLD, in 7 out of 15 species (possible comparisons between Iberia   and Europe, Table 2), there would have been at least one haplotype that would have not been determined to the species level (EUIB K2P distance above 1%). Even including the Iberian barcodes available in BOLD before the present study, the same still happened in three species. The case of T. alternella is specially remarkable: from 16 new haplotypes recorded in this study, none of them could not be matched to any reference sequence in BOLD using the 1% threshold (Table 4). This lack of identification due to the absence of Iberian reference sequences is not exclusive of the study species, having been reported for other insect taxa as well (e.g., Cerambyx cerdo, Coleoptera) (Torres-Vila & Bonal, 2019).
If well the present dataset shows a clear trend, we have to be cautious before generalizing, as it is restricted to a limited number of species of oak-feeding moths. The occurrence of putative cryptic species in southern Europe in other taxa (Dincă et al., 2015;Geiger et al., 2014) suggests that the pattern may be widespread, but further studies are needed to confirm it. Future large-scale DNA barcoding initiatives in Europe should cover latitudinal gradients, rather than large distances at the same latitude, to avoid neglecting genetic diversity hot spots like the Iberian Peninsula.

RB was supported by a contract of the Atracción de Talento
Investigador Programme (Gobierno de Extremadura TA13032).
This work was financed by the projects PPII-2014-01-P from Junta de Comunidades de Castilla-La Mancha/European Social Fund and AGL2014-54739-R from Spanish Ministry of Economy and Competitiveness and the European Social Fund. We thank the comments of three anonymous reviewers that helped to improve this work.

CO N FLI C T O F I NTE R E S T
The authors declare that there is no conflict of interests regarding the publication of this article.