DNA barcoding of Corydalis, the most taxonomically complicated genus of Papaveraceae

Abstract The genus Corydalis is recognized as one of the most taxonomically challenging plant taxa. It is mainly distributed in the Himalaya–Hengduan Mountains, a global biodiversity hotspot. To date, no effective solution for species discrimination and taxonomic assignment in Corydalis has been developed. In this study, five nuclear and chloroplast DNA regions, ITS, ITS2, matK, rbcL, and psbA‐trnH, were preliminarily assessed based on their ability to discriminate Corydalis to eliminate inefficient regions, and the three regions showing good performance (ITS, ITS2 and matK) were then evaluated in 131 samples representing 28 species of 11 sections of four subgenera in Corydalis using three analytical methods (NJ, ML, MP tree; K2P‐distance and BLAST). The results showed that the various approaches exhibit different species identification power and that BLAST shows the best performance among the tested approaches. A comparison of different barcodes indicated that among the single barcodes, ITS (65.2%) exhibited the highest identification success rate and that the combination of ITS + matK (69.6%) provided the highest species resolution among all single barcodes and their combinations. Three Pharmacopoeia‐recorded medicinal plants and their materia medica were identified successfully based on the ITS and ITS2 regions. In the phylogenetic analysis, the sections Thalictrifoliae, Sophorocapnos, Racemosae, Aulacostigma, and Corydalis formed well‐supported separate lineages. We thus hypothesize that the five sections should be classified as an independent subgenus and that the genus should be divided into three subgenera. In this study, DNA barcoding provided relatively high species discrimination power, indicating that it can be used for species discrimination in this taxonomically complicated genus and as a potential tool for the authentication of materia medica belonging to Corydalis.


| INTRODUC TI ON
The Himalaya-Hengduan Mountains represent a global biodiversity hotspot with high levels of biodiversity and endemism and has recently become a priority conservation area due to the negative effects of climate change and intensive human activities in this region (Yan et al., 2014). Corydalis DC., the largest genus of Papaveraceae (Zhang, Su, & Liden, 2008), is an important component of the biodiversity in the Himalaya-Hengduan Mountains.
This genus originated from the Hengduan Mountains and was recently distributed from this region to the Qinghai-Tibet Plateau (Linden, Fukuhara, & Axberg, 1995;Wu, 1996). Due to the complicated geological history of this region and the dramatic variations in its local climates and topography (Yan et al., 2014), as well as the reticulate evolution and intensive differentiation in the phylogenesis of Corydalis (Wang, 2006;Wu, 1996), this genus exhibits high levels of morphological and habitat diversity. Some species grow in specialized habitats, such as dry limestone cliffs ( Figure   1) and alpine hillsides (Zhang et al., 2008), which are inaccessible.
Most species exhibit complicated morphological characteristics.
The leaves, subterranean organs, fruits, seeds, and particularly the floral structures of Corydalis species are very complex and show high variability, which seriously hampers accurate species discrimination and taxonomic assignment. Species identification is a precondition of biodiversity conservation and is also fundamental to almost all disciplines of botany (Chen et al., 2016). However, due to its complicated morphological characteristics, arduous procedures for sample collection, the absence of seasoned specialists, and the limitations of traditional morphology-based taxonomy, Corydalis is recognized as one of the most taxonomically complicated plant taxa.
Various regions in the nuclear and chloroplast genomes have been proposed as DNA barcodes for plants. The psbA-trnH intergenic spacer region has been proposed as a DNA barcode for land plants (Kress et al., 2005). Portions of the plastid coding genes rbcL and matK have been suggested as the core barcodes to establish a barcoding database for plant species (CBOL Plant Working Group, 2009). The nuclear internal transcribed spacer (ITS), which has a high rate of nucleotide substitution and thus relatively high discrimination power, has been proposed to be incorporated into a core barcode for seed plants (China Plant BOL Group, 2011). The ITS2 region, a subregion of ITS, has been selected as a valuable sequence tag for the identification of medicinal plants and materia medica Han et al., 2013;Yao et al., 2010).
However, no single barcode for plants can perform as well as COI does in animals (Hollingsworth, Graham, & Little, 2011). In fact, individual barcodes usually exhibit unequal species discriminatory F I G U R E 1 The specialized habitats of Corydalis saxicola in dry limestone cliffs ability in different plant groups, and therefore, it is necessary to select appropriate barcodes for Corydalis. Furthermore, several different analytical methods, such as tree-based, distance-based, sequence similarity-based, and character-based methods, have been used for the assessment of species discrimination ability (Austerlitz et al., 2009;Frezal & Leblois, 2008;Sandionigi et al., 2012;Yan et al., 2014). Different analytical methods typically show dissimilar species discrimination power on the same datasets (Kool et al., 2012;van Velzen, Weitschek, Felici, & Bakker, 2012;Yan et al., 2014), but the discrimination ability of different analytical methods in Corydalis remains unknown.
An ideal DNA barcode should have a highly universal single primer pair, provide high-quality bidirectional sequences, have a high discriminatory power among species (CBOL Plant Working Group, 2009;Kress et al., 2005;Lahaye et al., 2008), and exhibit a "barcode gap" between intraspecific and interspecific genetic divergences (Lahaye et al., 2008;Meyer & Paulay, 2005). We preliminarily evaluated the discrimination ability of the five most commonly used regions, ITS, ITS2, matK, rbcL, and psbA-trnH, for Corydalis. The sequence data for this preliminary evaluation were downloaded from NCBI and have been used in previous molecular phylogenetic studies of Papaveraceae and Corydalis (Linden, Fukuhara, Rylander, & Oxelman, 1997;Zhang et al., 2015;Zhang, Wang, & Yang, 2016). The results showed that the rbcL and psbA-trnH regions were too well conserved to be able to sufficiently discriminate among species of Corydalis (Supporting Information Figures S2 and S3), whereas ITS, ITS2, and matK exhibited high rates of nucleotide substitution and a relatively high species discrimination rate.
In this study, we chose three regions (ITS, ITS2, and matK) as

| Sampling strategy
A total of 131 individuals representing 28 Corydalis species, including four Pharmacopoeia-recorded medicinal plants (C. yanhusuo, C. decumbens, C. saxicola, and C. bungeana) and their crude drugs (nine leaf specimens of C. saxicola and three tuber specimens of each of the other three species), and two outgroups (Lamprocapnos spectabilis and Papaver somniferum) were used in this study (Table S1).
Sequences for 118 individuals of 22 species were obtained de novo, and sequences for 13 individuals were obtained from GenBank.
The GenBank sequences are derived from published articles, and we rechecked the sequences through BLAST with conspecifics or closely related species to ensure their correctness. All speci-

| DNA extraction, PCR amplification, and sequencing
Genomic DNA was extracted from silica gel-dried leaves using a Plant Genomic DNA Kit (Tiangen Biotech, Beijing, China) according to the manufacturer's recommended protocol, which was optimized with slight modifications. These modifications included the addition of more lysis buffer GP1 and cleaning the products one to three times using a nucleus separation liquid until the supernatant layer became lightly colored or colorless.
The PCR amplification of the ITS and matK regions was conducted using a Peltier Thermal Cycler PTC2000 (Bio-Rad) with approximately 30 ng of genomic DNA as the template in 25 μl of 2×  Figure S1). The PCR products were run on a 1.0% agarose gel in 0.5× TBE buffer to assess the success of the amplification and purified with a 1.0% agarose gel using the

| Data analysis
All raw sequences, excluding primer regions, were assembled and edited with Condon Code Aligner V 5.1.5 (Condon Code Co., USA). The ITS2 sequences were obtained by removing the con- To evaluate the success of species discrimination, the three markers and their possible combinations were analyzed using three widely used methods, namely, tree-based, similarity-based, and distance-based methods. For the tree-based methods, three different phylogenetic trees, namely, neighbor-joining (NJ) tree, maximum parsimony (MP) tree, and maximum likelihood (ML) tree, were evaluated to select the most suitable tree. The three trees were constructed with MEGA6.06 according to published protocols for species-level discrimination within closely related groups (Liu et al., 2012;Tamura, Dudley, Nei, & Kumar, 2007;Yuan et al., 2015).
Species discrimination was considered successful if all conspecific individuals formed a single clade. For the similarity-based method (BLAST), NCBI BLAST 2.2.29+ (Tao, 2010;Yan et al., 2014) was used to build local reference databases, and all the sequences were then queried using the blastn command. Species discrimination was considered successful if all individuals of a species had a top matching hit of only a conspecific individual (Yan et al., 2014). For the distance-based analysis, we used the K2P-distance method (Little & Stevenson, 2007). Species discrimination was considered successful if the minimum interspecific K2Pdistance involving a species was larger than the maximum intraspecific distance for that species (CBOL Plant Working Group, 2009;China Plant BOL Group, 2011).

| Barcode universality and sequence characteristics
The separate evaluations of the success rates of PCR amplification and sequencing revealed that the proportions of ITS and matK regions were both 100%. With regard to primer universality, ITS exhibited high amplification success with the commonly used primer pair ITS5F/4R, but high-quality bidirectional sequences from 14 samples (11.0%) could not be generated using this primer pair. Nevertheless, these samples were successfully amplified and sequenced using the primer pair ITSa and ITSb (Tables 1 and 2). Thus, the ITS barcode could be successfully amplified and sequenced from all the samples, and ITS2 sequences were then obtained from the ITS sequences. All matK sequences obtained in this study were generated by direct sequencing.

| Genetic distance and DNA barcoding gap assessment
Of the three regions, ITS2 exhibited the greatest intraspecific and interspecific distances, followed by ITS, and matK showed the lowest values ( Table 2). The relative distribution of K2P distances based on single barcodes and barcode combinations is shown in Figure 2.
In general, the mean interspecific distances were greater than the mean intraspecific distances for all three barcodes ( Table 2).
The rank of the three sequences in terms of mean sequence divergences in Corydalis was ITS2>ITS>matK. The barcoding gaps obtained with the single barcodes and barcode combinations are shown in Figure 2.

| Comparison of species resolution with different barcodes and their combinations
Among the single barcodes, ITS exhibited the highest success rates for the identification of Corydalis species, ITS2 showed a lower identification success rate, and matK provided the lowest identification success rate (Table 3). Three of the four Pharmacopoeia-recorded medicinal plants (C. yanhusuo, C. decumbens, and C. bungeana) and their crude drugs were identified successfully using the ITS and ITS2 regions (Figure 4). The Pharmacopoeia-recorded species C. saxicola and its close relative C. tomentella was discriminated using the QR code (two-dimensional code) of the ITS2 region (Figure 3).
Among the barcode combinations, ITS + matK provided a higher discrimination success rate than ITS2 + matK. In addition, the barcode combinations exhibited a higher discrimination success rate than any single barcode (Table 3).
Among all single barcodes and their combinations used in this study, ITS + matK provided the highest identification success rate, and matK alone provided the lowest identification success rate (Table 3).

| Comparison of different analytical methods for species resolution
The species discrimination ability depends on the analytical method used. A comparison of different analytical methods using data obtained with a single barcode revealed that the BLAST method F I G U R E 2 Relative distributions of intraspecific and interspecific K2P distances yielded the highest discrimination success rate (Table 3 and S2-S6).
The barcode combinations yielded different results. When applied to data obtained from the combination ITS + matK, BLAST provided the highest identification success rate, but when applied to data obtained with ITS2 + matK, the NJ tree-and K2P distance-based methods showed the highest identification success rate (Table 3). The comparisons of different analytical methods using data obtained with all the single barcodes and their combinations used in this study showed that BLAST tended to provide the highest discrimination success for all barcodes with the exception of ITS2 + matK. The results obtained using the NJ tree-based method were similar to but slightly better than those obtained using the K2P distance-based method (Table 3). Overall, regardless of the method used, the barcode combinations nearly always resulted in improved species resolution, and in this study, the barcode combination ITS + matK (69.6%) with the BLAST method provided the highest species resolution.

| Phylogenetic analyses of chloroplast and nuclear DNA regions
The ML tree constructed using the data obtained with ITS + matK recovered Corydalis as a monophyletic group. C. rupestris was strongly supported as the basal taxon that diverged first, and two major clades were then recognized ( Figure 4). The first clade in- The ML tree constructed using the data obtained with ITS was slightly distinct from that constructed using the data obtained with ITS + matK. The ITS-based ML tree recovered Corydalis as a monophyletic group and revealed two well-supported clades. The division of the two clades into two subclades was weakly supported.
Furthermore, compared with the ITS + matK-based tree, the phylogenetic positions of some species were changed in the ITS-based tree. Specifically, C. rupestris and C. capnoides formed a subclade that diverged from the first clade, and C. decumbens and C. ochotensis formed a clade that diverged from the second subclade (Supporting Information Figure S4).
The ML tree constructed using the data obtained with matK was similar to the ITS + matK-based tree. It also recovered Corydalis as a monophyletic group, C. rupestris was strongly supported as the basal taxon and diverged first, and the remaining species formed two well-supported clades. The species composition of the first clade was consistent with that found with the ITS + matK-based tree. However, the second clade did not divide into three subclades as in the ITS + matK-based tree; instead, C. nobilis of sect.
Capnogorium formed its own subclade that diverged first as a sister subclade to the rest of the clade, and the second subclade was then further divided into two clades (Supporting Information Figure S5).

| Evaluation of DNA barcodes for Corydalis
Corydalis is one of the most taxonomically complicated plant genera, and the discrimination of species within this genus has always been recognized as a great challenge. Jiang et al. (2018)  The evaluation of different analytical methods showed that BLAST exhibited the highest species discrimination ability. The superior performance of the BLAST approach compared with other methods has been observed in several previous studies Kool et al., 2012;Van Velzen et al., 2012;Yan et al., 2014). The species discrimination power of an approach is related to the theory and algorithm used by the approach. BLAST often shows significantly higher identification rates than other approaches Sandionigi et al., 2012), and it appears to be the best choice for the identification of Corydalis species.
The NJ tree usually shows low species resolution, which significantly reduces its usefulness (Liu et al., 2012;Van Velzen et al., 2012;Yan et al., 2014), and in this study, the NJ tree provided a lower species resolution than BLAST. However, because of its advantages of faster speed and a more intuitive display of genetic relationships, which facilitates understanding and analysis, we still advocate that the NJ tree-based method is useful for the discrimination of Corydalis species.  . Compared with the species resolution obtained with ITS for other large genera, such as Rhododendron (12.2%, 15.3%), Angelica (73.9%), Pedicularis (86.2%), and Primula (88.2%) Yan et al., 2014;Yuan et al., 2015), that obtained in this study corresponds to a medium level of identification efficiency.
However, a previous study showed that the ITS region is not suitable for the molecular analysis of Corydalis due to a low PCR amplification The maximum likelihood tree of 28 Corydalis species and two outgroup species of Papaveraceae based on ITS + matK regions and sequencing success rates (61.9% and 28.6%, respectively) and that matK provides the highest species resolution (100%) and can thus be considered an ideal barcode for Corydalis .
In this study, we initially used the universal primer pair ITS5F/4R and obtained low PCR amplification and sequencing efficiency.
Therefore, we designed two pairs of primers to perform a fractional amplification of the full-length sequence of ITS and thus obtain a complete ITS sequence (Table 1, Supporting Information Figure S1), and our resulting PCR amplification and sequencing success rates were both 100%. The ITS and matK regions were then evaluated using 131 samples representing 28 species of Corydalis, which included 23 species with more than three samples and 16 species that were not included in previous studies (two species are herbal species in the Pharmacopoeia of China) . Our results showed that ITS exhibited a higher species resolution (65.2%) than matK (56.5%) ，15 of the 23 species could be successfully identified by ITS, 13 of the 23 species could be successfully identified by matK, and one herbal species in the Pharmacopoeia of China could not be successfully identified by matK. The high species resolution of DNA barcoding is likely due to the relatively small sample size and wide taxonomic sampling employed Yan et al., 2014), and the species identification resolution usually decreases with increases in the sample size. Thus, sufficient sampling for a taxon-based DNA barcoding study is a pivotal issue that should be considered (Yan et al., 2014). Based on the highest species resolution obtained with a single barcode in this study and the successful optimization of PCR amplification and sequencing methods, we considered ITS to be the most appropriate barcode for the discrimination of species belonging to the genus Corydalis.
The combination of DNA barcodes usually improves species identification (CBOL Plant Working Group, 2009;Yan et al., 2014;Yuan et al., 2015). In this study, any combination of the barcodes yielded higher discrimination rates. Although matK provided the lowest species identification ability when used alone, combinations including matK exhibited significantly increased discrimination power. Thus, matK can be used as an additional barcode for Corydalis, and ITS + matK was identified as the best barcode combination for Corydalis. Wu (1996) first classified this genus into 40 sections and two subgenera, subg. Corydalis and subg. Pistolochia, based on morphological characteristics and geographical distributions. Linden (1997) claimed that this genus should be divided into 25 sections and three  (Wu, 1996), Linden's subg. Sophorocapnos (Linden et al., 1995), and

| Phylogenetic analysis of Corydalis based on DNA barcoding
Wang's subg. Sophorocapnos (Wang, 2006) In the second major clade, the further divisions created by the ML tree using different loci exhibited a number of discrepancies.
The matK-and ITS + matK-based trees were strongly supported, whereas the ITS-based tree was weakly supported; therefore, we will mainly refer to the ITS + matK-based tree. The second major clade tended to be divided into two larger subclades and one smaller subclade. Sections Pes-gallinaceus Irmisch and Chinenses formed the first larger subclade; sections Asterostigma, Duplotuber, and Ramososibiricae formed the second larger subclade; and sect. Capnogorium alone formed the smaller subclade. In Wang's division of subgenera (Wang, 2006), subg. Sophorocapnos was recognized in the first major clade, but the remaining subgenera in the second major clade could not be recognized, and our results, therefore, do not support Wang's division. In the second larger subclade, sect. Ramoso-sibiricae and sect. Asterostigma are divided into subg. Corydalis, as in Wu's classification (Wu, 1996). Although our results support the division of this genus into two major subgenera, we do not agree with the species coverage of subgenera in Wu's division.
Overall, Wu (1996) classified Corydalis into two subgenera based on morphological characteristics and geographical distributions. Linden (1997) claimed that this genus was divided into three subgenera based on morphological characteristics and the rps16 locus.
Using molecular systematics (rps16 and matK) and palynology, Wang (2006) advocated that the most appropriate division of Corydalis was five subgenera. Our molecular evidence is consistent with Linden's proposal, and we tended to divide this genus into three subgenera.
Sophorocapnos, sect. Racemosae, and sect. Aulacostigma, which are recognized in the division proposed by Linden (1997) andWang (2006), have been classified within subg. Sophorocapnos, and our molecular results support this view. Furthermore, this study provides the first molecular analysis of sect. Corydalis, and its inclusion in subg.
Sophorocapnos is strongly supported by our phylogenetic analysis.
Therefore, we suggest that sect. Thalictrifoliae, sect. Sophorocapnos, sect. Racemosae, sect. Aulacostigma, and sect. Corydalis should be classified as one subgenus. Of course, more molecular and morphological evidence will be required to test this finding. Our current results based on molecular data are expected to provide some reference for future research.
More than 30 species are used in folk medicine or are recorded in the Pharmacopoeia (Pharmacopoeia Commission of People's Republic of China, 2015;Sang, 2002), and these species are often closely related.
To ensure the safety, efficacy, and legality of these medicines, accurate identification is essential. The authentication of materia medica is time-consuming and knowledge-intensive because they are always presented as crude drugs or decoctions, which are air-dried or processed via various methods and thus exhibit modified morphological and anatomical features (Yuan et al., 2015). DNA technology exhibits an outstanding advantage in that it is not affected by morphological characteristics, developmental stage, environmental factors, or harvesting period (Heubl, 2010). Since Chen et al. (2010) proposed ITS2 as a standard DNA barcode for the identification of medicinal plants, DNA barcoding has been increasingly used in the authentication of medicinal plants and materia medica (Xin et al., 2012;Yuan et al., 2015). In this study, three Pharmacopoeia-recorded medicinal plants and their crude drugs were identified successfully, and another Pharmacopoeia-recorded species was discriminated from closely related plants. Based on the good performance of DNA barcoding in Corydalis, it can be used as a potential tool for the authentication of the medicinal plants and materia medica belonging to this genus.
The short-region DNA barcodes showed a relatively high species discrimination power for Corydalis species in this study; however, 30.4% of the species, most of which are closely related, could not be identified successfully. Recent barcoding studies have placed high emphasis on the use of whole-chloroplast genome sequences, which are now more readily available due to improvements in sequencing technologies (Li et al., 2015). The whole-chloroplast genome is termed a "super-barcode" and provides more abundant informative sites for species identification.
These super-barcodes exhibit higher discrimination power for closely related species and have been applied for the identification of species in various taxa, such as the genera Fritillaria (Li, Zhang, Yang, & Lv, 2018), Epipremnum (Tian, Han, Chen, & Wang, 2018), and Papaver (Zhou et al., 2017). The use of a super-barcodes is a good option for the identification of closely related Corydalis species and obtaining improved species discrimination power. The use of data inferred from DNA barcodes and whole-chloroplast genomes, along with data on morphological characteristics and geographical distributions, might result in more precise species discrimination and resolve phylogenetic disputes, which will aid the reconstruction of a more integrated taxonomic system of this taxonomically complicated genus.

ACK N OWLED G M ENTS
We

CO N FLI C T O F I NTE R E S T
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

AUTH O R CO NTR I B UTI O N S
The study was conceived by FR, JS, sample collection was performed by FR, YW, YQ, data generation was carried out by YW, data were analyzed by ZX, YL, JZ, and TX, the manuscript was written by FR, JS.

DATA ACCE SS I B I LIT Y
DNA sequences: GenBank accessions MH349216-MH349333.