The impact of genetic diversity on the accuracy of DNA barcoding to identify species: A study on the genus Phellodendron

Abstract DNA barcoding is widely used in species identification, but there is considerable controversy regarding the extent of sampling in research methods. Some scholars have proposed that this small sample size underestimates the intraspecific genetic diversity, which would impact on the accuracy of DNA barcoding to identify species. In study, we selected all Phellodendron species (including P. amurense Rupr., P. chinense Schneid., and P. chinense var. glabriusculum Schneid.) as the materials, collected 59 P. amurense samples from 35 populations greatly to represent the genetic diversity, and analyzed the haplotype, genetic distance, barcoding gap, and Neighbor‐Joining (NJ) trees based on psbA‐trnH and internal transcribed spacer gene sequences. Additionally, a sampling simulation was conducted to assess the correlation between genetic diversity and the number of populations. Finally, analysis of critical geographical populations was performed. Based on analysis of haplotype, genetic distance, barcoding gap, and NJ trees, we found that eight P. amurense samples impacted on the effectiveness of DNA barcoding, which genetic information were very important to identify Phellodendron species. Moreover, the result of the NJ tree analysis performed the small‐scale P. amurense sample size did not completely match the objective phylogenetic relationship in Phellodendron. In simulation sampling analysis, the data showed the genetic diversity indexes at the same population level gradually decreased and stabilized as the number of simulation sampling populations increased. We found that 1–2 samples from over 24 populations based on uniform geographical distribution could represent 80% of the genetic diversity of P. amurense and ensure authenticity and reliability of DNA barcoding. Thus, we proposed it is particularly important adequately samples to cover infraspecific genetic diversity in order to ensure identification accuracy of DNA barcoding.


| INTRODUC TI ON
DNA barcoding, which is based on one or more common and standard DNA sequences for species identification and characterization techniques, is widely used in the survey and inventory of biodiversity, species identification, and discovery of new species (Hebert, Cywinska, Ball, & Dewaard, 2003;HollingsworthGraham & Little, 2011;Kress, Wurdack, Zimmer, Weigt, & Janzen, 2005). Cytochrome oxidase I gene (COI gene) sequences from the mitochondrial genome can be used for the identification of multiple animal groups, making it an ideal animal barcode sequence (Tavares & Baker, 2008;Ward, Zemlak, Innes, Last, & Hebert, 2005). In terms of plants, no one barcode can be a universal barcode similar to the animal COI gene. The For example, Yao, Song, and Ma (2009) reported that the psbA-trnH sequence can be used to identify Dendrobium species. Liu, Zhang, et al. (2012) showed a 100% identification using the psbA-trnH sequence in 38 species from the genus Rhododendron. Li, Chen, Wang, and Xiong (2012) discovered that the efficiency of the ITS sequence is 70% when examining 63 species in the genus Ficus. Yang, Wang, Möller, Gao, and Wu (2012) used ITS + psbA-trnH sequences to identify 30 species from the genus Parnassia with 90% efficiency.
However, the genetic diversity of each species was neglected by almost all DNA barcoding studies, which reflected in sampling strategies. The International Barcode of Life Project calls for at least 10 samples for each species . To obtain more species sequences, one must decrease the number samples for each species when there is limited funding (Matz & Nielsen, 2005). Currently, most species have only 5-10 sequences in the established DNA barcode database, though several species have only 1-2 sequences (http:// www.barco dingl ife.org/views/ ligin.php), which is far from sufficient to speculate the number of samples needed to establish a database to represent the genetic diversity of each species. The efficiency and accuracy of DNA barcoding depends on the degree of sampling per species, because a large enough sample size is needed to provide a reliable estimate of genetic polymorphism and for delimiting species (Luo et al., 2015). Incomplete sample surveys, errors in sample identification and weak taxonomic are obstacles (Young, McKelvey, Pilgrim, & Schwartz, 2013). A small range and unevenness in sampling could cause differences in thresholding between intraspecific and interspecific variations (Meyer & Paulay, 2005). We believed that mostly representation of genetic variation in samples had defects, as many studies only analyze and collect 1 or 2 samples from one species. Therefore, we chose Phellodendron genus as a case to discuss genetic diversity impacting on the accuracy of DNA barcoding to identify species.
glabriusculum Schneid., are distributed in China (Huang, 1997). As tertiary paleotropical flora relict plants, Phellodendron has scientific value for studying ancient flora, paleogeography, and quaternary glacial climate (Huang, 1958). A large number of wild populations drastically reduced, especially since the late 19th century, because its cortex is a kind of precious Chinese traditional medicine, and its wood is widely used for its hard texture and beautiful grain and color (Qin, Wang, & Yan, 2006).
The endangered plant P. amurense is distributed in northeastern China, and P. chinense is distributed in southwestern China (Huang, 1997; State

Bureau of Environmental Protection of China & Institute of Botany
Chinese Academy of Sciences, 1987); the gap in their distribution area is approximately 1,000 km. These genetic studies mainly focused on intraspecific genetic diversity, the analyses of intraspecies and intergenus have not been reported (Wang, Bao, Wang, & Ge, 2014;Yan, Zhang, Zhang, & Yu, 2006;Yu et al., 2013). Phellodendron genus does not exist controversy in plant taxonomy and have the geographical isolation of species. These studies mainly focused on intraspecific genetic diversity and concerned the species in China; however, analyses of intraspecies and intergenus have not been reported in the genus Phellodendron.
Thus, it is an ideal material. To assess the impact of genetic diversity on DNA barcoding, we selected the Phellodendron genus as a model and collected numerous samples to represent intraspecific diversity in study.
Also, the data of previous study showed that there was no variation of rbcL, matK among individuals of Phellodendron, psbA-trnH and ITS were polymorphism in Phellodendron. Thus, we assessed the accuracy of DNA barcoding to determine Phellodendron species by ITS and psbA-trnH.

| Plant materials
We collected 1 or 2 samples from each population, which based on a viewpoint that DNA barcoding variations within a population are usually less than that between populations (Liu, Provan, Gao, & Li, 2012). In total, 59 P. amurense cortex samples were densely collected from 35 populations Highlights • Used a case to prove the genetic diversity impacting on the identification accuracy of DNA barcoding.
• Analyzed the relationship between samples size and genetic diversity parameters of P. amurense by simulation sampling.
• Proposed adequately samples covering infraspecific variation of species being the key to DNA barcoding. throughout the distribution area, which enabled us to have a large size and representative samples to ensure the credibility of this study.
The survey found that P. chinense and P. chinense var. glabriusculum populations included types that were both were rare and wild, as well as cultivated; thus, we collected one wild population from P. chinense and P. chinense var. glabriusculum. Fourteen P. chinense and P. chinense var. glabriusculum samples were collected from eight populations to ensure the samples represent the entire distribution area.

| DNA extraction and amplification
All samples (40 mg) were rubbed for 2 min at a frequency of 30 r/s. Total genomic DNA was extracted using the Plant Genomic DNA Kit (Tiangen Biotech Co.) according to the manufacturer's instructions.
The extracted genomic DNA was amplified by polymerase chain reaction (PCR) using the ITS (ITS5F and ITS4R) and psbA-trnH (fwdPA and revTH) primers (Chen, 2012). The PCR mixtures and conditions were described by Chen (2012). PCR products were separated and detected by 1.5% agarose gel electrophoresis. Purified products were sequenced in both directions using the PCR primers on a 3730XL sequencer (Applied Biosystems).

| Statistical analysis
Sequences were assembled and aligned with the CodonCode Aligner 3.7.1 (CodonCode Co.) as well as the base quality was evaluated to avoid technical error. The inter/intraspecific genetic distances and barcoding gap were analyzed with the P Language based on Kimuraʼs 2-parameter theory (Tamura et al., 2011). The variable sites and the bootstrap Neighbor-Joining (NJ) tree were conducted with MEGA (4.0 version) according to Kimuraʼs 2-parameter method with 1,000 replicate bootstrap testing (Tamura et al., 2011).
To assess the relationship between a number of populations and the genetic diversity of P. amurense, a sampling simulation was conducted in this study. In total,3,6,9,12,15,18,23,28

| Haplotype analysis
The GenBank accession No. for the ITS and psbA-trnH contig sequences from all the samples in this study are shown in Table 1. The average base quality value (QV) of forward sequence or reverse sequence was ≥30, and the coincidence ratio of forward and reverse sequences was 100%. The haplotype and variable sites in the ITS sequence are shown in Table 2. The quality value (QV) of variable sites was verified as ≥30 by traceability. The data revealed that P. amurense had 6 haplotypes, P. chinense had four haplotypes, and P. chinense var. glabriusculum had three haplotypes. P. chinense was different from P. amurense and P. chinense var. glabriusculum at bp 173 with a T; simultaneously P. chinense var. glabriusculum was also different from P. amurense and P. chinense at bp 208 with a T, except in haplotypes A12 and A13.
The haplotype and variable sites in the psbA-trnH sequence ( (.) indicated the same base as the first row.

| Genetic distance and barcoding gap analysis
Six parameters were used to analyze intraspecific variation and interspecific divergence with two barcodes (Table 4). In this instance, the maximum intraspecific distance was higher than the minimum interspecific distance for two barcodes, which indicated that the two barcodes did not perform well in the discrimination of Phellodendron species.
The barcoding gap presents the remarkable variation between inter-and intraspecies and demonstrates the separate or overlapping distributions between intra-and interspecific samples. In this study (Figure 2), the ITS sequence did not exhibit gaps in the intraand interspecific variation distributions. In contrast, the psbA-trnH sequence displayed murky barcoding gaps with overlapping intraand interspecific variation distributions. Through calculation and traceability, we found that eight P. amurense samples had overlapping regions with haplotypes B2 and B3. When we performed the NJ tree analysis using a small-scale randomly selected samples from P. amurense, two typical NJ tree patterns existed. One pattern was the reciprocal monophyly in Figure 4, in which members from P. amurense and other species shared a unique common ancestor. The other was paraphyly ( Figure 5), in which the P. amurense species is monophyletic but nests within another recognized species. Therefore, the adequately samples to cover infraspecific variation is essential for DNA barcoding.

| NJ tree analysis
(.) indicated the same base as the first row.

| Genetic diversity parameters in the simulation sampling analysis
Since psbA-trnH performed better than ITS for identifying the is used to perform nonlinear regression (curve fit), which satisfied the requirements in this study. The haplotype discovery curve (HDC) is presented in Figure 6 with the theoretical equation f(x) = 7.072x/ (5.756 + x) and an r 2 = .8082. The results showed that the number of haplotypes (H) index gradually increased with the increase in the simulation sampling populations and had an overall sample level as shown in Figure 6. The scatter plots of the population number with haplotype diversity (H d ) and nucleotide diversity (P i ) are shown in Figures 7 and 8, respectively. These plots explained that the dispersion of the genetic diversity index at the same population level gradually decreased and stabilized as the number of simulation sampling populations increased.

| Theoretical key number of sample sizes in the simulation sampling analysis
The number of haplotypes (H) was a pivotal criterion for estimating the genetic diversity of P. amurense in simulation sampling. We arrived at the haplotype discovery curve (HDC) in Figure 6 with the theoretical equation f(x) = 7.072x/(5.756 + x). We focused on the following two theoretical key parameters for sample sizes in this study: (a) the threshold of sample sizes where new haplotypes become considerably more difficult to identify with extra sampling efforts (the first-order derivative of the HDC curve is equal to zero) and (b) the sample size that includes the majority of haplotypes (indicating that 80% of haplotypes are found).
The theoretical key parameters are presented in Figure 6 after careful calculation based on the HDC curve theoretical equation.
The figure showed that the theoretical threshold of the sample size is 5.756, which meant that discovering a new haplotype is much harder with only six populations in the simulation sampling.
Furthermore, we could obtain 80% of P. amurense haplotypes when the theoretical threshold of the sample size was 23.024 populations using simulated sampling with the theoretical equation or sample sizes with no less than 24 populations in an actual sampling. To obtain a higher percentage of P. amurense haplotypes, we should increase the sample size.

| Critical analysis on populations
Since psbA-trnH performed better than ITS for identifying Phellodendron species, we analyzed the key populations based F I G U R E 2 Relative distribution of the interspecific and intraspecific variation using the two barcodes based on the K2P genetic distance on psbA-trnH. Combining the haplotype analysis with the NJ tree analysis, we discovered eight special samples with seven populations existing in all the P. amurense populations. The samples' variable sites and genetic relationship were similar to P. chinense and P. chinense var. glabriusculum based on our data (Table 3 and  Figure 1 as red circles.
We observed that they were uniformly distributed in the whole P. amurense area. This meant there was no key geographical group of P. amurense. Thus, it is highly meaningful to assess the sampling size from the uniform geographical distribution based on the genetic diversity.

| D ISCUSS I ON
It is fact that different plant species will vary in the amount of infraspecific variation they contain, because of variation in length of time since divergence, and variation. Speciation is a gradual process, meaning that the discussion of species based on the evolutionary history of a cross section has significant limitations (Hennig, 1966).
Intraspecific genetic variation of species is not only an integral part of evolutionary history but also changes continuously. Objectively, each species possesses abundant intraspecific genetic variation, and the intraspecific variation of ITS and psbA-trnH sequences is natural.
It is necessary to assess the intraspecific genetic diversity of species. Furthermore, lack the genetic information of key populations will lead to DNA barcoding gaps, and the number of samples for collection is crucial to establish a reliable reference database for species identification (Meyer & Paulay, 2005;Wiemer & Fiedler, 2007).
F I G U R E 3 Phellodendron species NJ tree with the psbA-trnH sequence (The bootstrap scores [1,000 replicates] are shown for each branch) Dasmahapatra and Mallet (2006) believes that many studies only analyze and collect 1 or 2 samples from one species within a limited geographical scope, which seriously underestimates the intraspecific variation and may lead to false-positive results.
In study, we paid close attention to the genetic diversity of P. amurense and collected adequately samples throughout the distribution area to ensure covering infraspecific variation. We conducted a  (Bergsten et al., 2012;Meyer & Paulay, 2005). The B2 and B3 haplotype populations were uniformly distributed in the entire P. amurense area, which meant there were no key geographical groups ( Figure 1).
P. amurense clearly underwent genetic variation among the existing populations. It was successful to take large-scale uniform sampling covering the entire distribution area in study.
The most important challenge for species identification is DNA barcodes used for closely related species and recently differentiated species (Newmaster, Fazekas, Steeves, & Janovec, 2008). DNA barcoding has been applied to crude drug identification (Kool et al., 2012;Techen, Parveen, Pan, & Khan, 2014), but we had to keep doubts about the identification accuracy of DNA barcoding based on the results of our study. Whether the sample size of each species may represent the actual levels of genetic diversity in the current database needs to be studied (Chen, 2015;Chen et al., 2014).
In addition, the lack of key plant specimens will lead to narrow genetic levels in the database resulted in DNA barcoding failures.
Furthermore, the input and output ratio is an important factor to restrict large sample strategies implemented in DNA barcoding database of each species.
In order to ensure identification accuracy of DNA barcoding which is used as tools for species identification, it is particularly important to collect adequately samples covering infraspecific genetic diversity of species.

ACK N OWLED G M ENTS
The authors are grateful for the financial support provided by the National Natural Science Foundation of China (No. 81473305) and the National Science and Technology Foundational Special Project (2015FY111500).

CO N FLI C T O F I NTE R E S T
The authors declare no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
ITS and psbA-trnH sequence data can be accessed in GenBank. The