Development of high‐resolution DNA barcodes for Dioscorea species discrimination and phylogenetic analysis

Abstract The genus Dioscorea is widely distributed in tropical and subtropical regions, and is economically important in terms of food supply and pharmaceutical applications. However, DNA barcodes are relatively unsuccessful in discriminating between Dioscorea species, with the highest discrimination rate (23.26%) derived from matK sequences. In this study, we compared genic and intergenic regions of three Dioscorea chloroplast genomes and found that the density of SNPs and indels in intergenic sites was about twice and seven times higher than that of SNPs and indels in the genic regions, respectively. A total of 52 primer pairs covering highly variable regions were designed and seven pairs of primers had 80%–100% PCR success rate. PCR amplicons of 73 Dioscorea individuals and assembled sequences of 47 Dioscorea SRAs were used for estimating intraspecific and interspecific divergence for the seven loci: The rpoB‐trnC locus had the highest interspecific divergence. Automatic barcoding gap discovery (ABGD), Poisson tree processes (PTP), and generalized mixed Yule coalescence (GMYC) analysis were applied for species delimitation based on the seven loci and successfully identified the majority of species, except for species in the Enantiophyllum section. Phylogenetic analysis of 51 Dioscorea individuals (28 species) showed that most individuals belonging to the same species tended to cluster in the same group. Our results suggest that the variable loci derived from comparative analysis of plastid genome sequences could be good DNA barcode candidates for taxonomic analysis and species delimitation.


| INTRODUC TI ON
The genus Dioscorea (family Dioscoreaceae) is comprised of approx imately 630 species which are distributed across Southeast Asia, Africa, Central America, South America, and other tropical and sub tropical regions. This genus is economically important for their tu bers, which provide starch as a dietary staple as well as cortisone and other steroid hormones, such as dioscin (Aumsuwan et al., 2016;Cho et al., 2013;Jeon et al., 2006). However, Dioscorea species are hard to identify due to high morphological diversity, dioecy, small flowers, and morphological similarities between various species in this genus (Raman et al., 2014;Wilkin et al., 2005). Distinguishing Dioscorea species based on morphological traits is unreliable, while using DNA barcodes (matK, rbcL, psbAtrnH, trnLF) for Dioscorea species, identification has previously showed relatively low discrim ination success, with the highest rate of 23.26% derived from use of the matK sequences (Gao et al., 2008;Mukherjee & Bhat, 2013;Sun et al., 2012). Currently, chloroplast genome sequences of four spe cies in the Dioscorea genus are available (Mariac et al., 2014;Wu et al., 2016;Zhou, Chen, Hua, & Wang, 2016), and thorough sequence comparison between these genomes could perhaps provide candi date regions for developing useful barcodes.
In the past decade, seven plastid DNA regions (atpF-atpH spacer, matK gene, rbcL gene, rpoB gene, rpoC1 gene, psbK-psbI spacer, and trnH-psbA spacer) and 2locus combinations were frequently used to distinguish the land plants (Hollingsworth, Graham, & Little, 2011). To date, these DNA barcodes along with other barcodes such as ycf5, psbKI, psbM, trnD, and rps16 are still widely used for the identification of varieties and analysis of the provenances of variet ies (Lee, Wang, Yen, & Chang, 2017;Techen, Parveen, Pan, & Khan, 2014). Moreover, new barcodes have been developed based on in creasingly available sequence data and on deep mining for highly variable regions. Dong et al. (2015) were analyzed available plastid genomes and designed suitable primers for the most variable re gions, and finally found that ycf1b generally performed better than any of the matK, rbcL, and trnHpsbA barcodes. Among 18 Oryza chloroplast genomes, five variable regions (rps16trnQ,trnTEYD,rpoC2,and rbcLaccD) were analyzed for species discrimina tion (Song et al., 2017). However, systematic comparisons for plastid genome sequences have not been conducted between Dioscorea species and would provide useful information for identifying better performing DNA barcodes for Dioscorea species.

| Sequence analysis for plastid genome of Dioscorea species and primer design
The chloroplast genome sequences for D. rotundata, D. elephantipes, and D. zingiberensis were downloaded from the National Center for Biotechnology Information (NCBI; https ://www.ncbi.nlm.nih.gov/).
We used BLAST to align the genic and intergenic regions of the three plastid sequences. Divergent hot regions were identified, and a set of primers were designed to cover these plastid regions (Table S2).
The primer design was using the software-Primer Premier 5.

| DNA extraction, amplification, and sequencing
DNA extraction was following a cetyl trimethylammonium bromide (CTAB) protocol modified from Paterson, Brubaker, and Wendel (1993). One individual for each of the 10 Dioscorea species we sam pled was used to select primers and test the amplification efficiency.  using the amplification primers. The nucleotide sequence data were deposited in the European nucleotide Archive database (Table S1).

| Sequence alignment and data analysis for DNA barcode
All sequences were aligned and adjusted manually by MEGA 7.0 (Kumar, Stecher, & Tamura, 2016), and all variable sites for these se quences obtained by sequencing in this study were rechecked on the original trace files for final confirmation. Both concatenated dataset and single locus sequences were applied for phylogenetic tree con We set parameters as follows: P min = 0.001, P max = 0.01, Steps = 50, X = 1.0, and Nb bins = 20. We performed PTP analyses on the bPTP web server (http://speci es.hits.org/ptp/) with the RAxML topology (Kozlov, Darriba, Flouri, Morel, & Stamatakis, 2019) and used the 50% majorityrule consensus topology resulting from the BI analysis as output files. We ran PTP analyses for 400,000 MCMC generations, set the thinning value = 100 and burnin = 0.25. We visually confirmed the convergence of the MCMC chain as recommended by Zhang et al. (2013). We used ultrametric trees generated with BEAST 1.10.4 for GMYC analyses . The ultrametric trees were con structed as follows: Coalescent tree prior and the heterogeneity of the mutation rate across lineages were set under an uncorrelated log normal relaxed clock. The analysis was run for 100 million generations with a sampling frequency of 10,000. After checking adequate mixing and convergence of all runs with Tracer 1.7.1 (Rambaut, Drummond, Xie, Baele, & Suchard, 2018), the first 25% trees were discarded as burnin. The maximum clade credibility tree was computed using TreeAnnotator 1.10.4 . The resulting ultrametric tree was imported into R 3.6.0 (R Core Team, 2018), and GMYC anal yses were run using the Splits (Ezard, Fujisawa, & Barraclough, 2009) and Ape (Paradis, Claude, & Strimmer, 2004) libraries.

| Seven highly variable regions for candidate DNA barcodes
To develop DNA barcode for Dioscorea species discrimination, 52 primer pairs covering highly variable regions were designed (Table   S2), including the top variable regions in Table 2 Table S3). The ePCR results showed that the seven pairs of primers have ePCR success rates as 77%-100%, which were similar to the PCR results for the 10 Dioscorea species (Table 3). The primers for psaA-ycf3 and ycf4-cemA have the top ePCR success rate, followed by trnDtrnT and clpPpsbB with ePCR success rate as 94% (17/18). Combined with the above PCR results, psaA-ycf3 and ycf4-cemA were still the top primers with 100% ePCR success rate (Table 3).
We aligned sequences from PCR amplification and the as  (Table 3 and Figure S1).
The noncoding region (trnDtrnT) had the smallest average interspe cific divergence (0.013). Wilcoxon signedrank tests demonstrated that the rpoBtrnC had significantly higher interspecific divergence than that of other species, and the locus with the second highest interspecific divergence is atpF, while rpl14rpl16, ycf4cemA, and psaAycf3 had similar interspecific divergences (Table 5).

| Applicability for species discrimination
A total of 73 individuals belonging to 10 Dioscorea species (Table S1), a set of 18 Dioscorea species with available SRAs (Table S3), and the three Dioscorea species with complete plastid genomes were used for estimation of species discrimination efficiency of the above loci.

| D ISCUSS I ON
(2) have more than one species. We selected seven pairs of primers (7/52) for further analysis which had high PCR amplification rates F I G U R E 2 Relative distribution of interspecific divergence between congenic species and intraspecific variation and distinct sequence variations between species. The intraspecific and interspecific variation analysis, along with different methods of species discrimination, indicated that these loci have divergent spe cies discrimination efficiency.
DNA barcodes show a relatively variable species discrimination efficiency in different plants (Gogoi & Bhau, 2018;Hollingsworth et al., 2011;Liu et al., 2017;Sun et al., 2015Sun et al., , 2012, and more DNA barcodes for specieslevel resolution have been developed and tested (Dong et al., 2015;Song et al., 2017). At present, the devel opment of universal primers for highly variable regions relies on the availability of sequences for different species. New primers of ITS regions of plants with improved universality and specificity were de signed based on 1,264,929 sequences of 18S, 5.8S, and 26S from the plant and fungus kingdoms (Cheng et al., 2016). The comparison of chloroplast genomes for genic and intergenic region between three Dioscorea species indicated that intergenic regions had more vari able loci than genic regions and that conserved genic regions were suitable for primer design (Table 1). However, the primer sequences conserved between three Dioscorea species still have low ratios of universal amplification success across different species (Table S2).

F I G U R E 3
The phylogenetic tree constructed using maximum likelihood for Dioscorea species based on ycf4cemA + psaAycf3 + clpPpsbB + rpl14rpl16 (on the left) and summary of putative species delimitation drawn by BLAST, ABGD, PTP, and GMYC (on the right, one column per method) The low available numbers of Dioscorea chloroplast genomes se quences may limit the efficiency of primer design. With the growing of available chloroplast genome sequences, more efficient primers could be designed in silico. tions were included in analysis in this study (Figure 3 and Tables S1 and S3). Phylogenetic analysis based on ycf4cemA, psaAycf3, clpPpsbB, and rpl14rpl16 loci produced clear clustering of most species to the sections, but species discrimination for species be longing to Lasiophyton and Enantiophyllum sections was not very accurate ( Figure 3). This may be caused by the close evolutionary relationships between Dioscorea species in these sections.
With the growing availability of sequence information, species discrimination through molecular evidence is becoming both feasible and reliable. Plastid markers, such as rbcL, matK, and trnHpsbA, have been widely used with high amplification success in these regions (Hollingsworth et al., 2011). The internal transcribed spacers from nuclear ribosomal DNA, complete plastid genomes, and single copy nuclear genes have also been used in species discrimination (Cheng et al., 2016;Duarte et al., 2010;Song et al., 2017). In this study, we selected primers covering highly variable regions in the Dioscorea chloroplast genome. Although only seven pairs of primers had good amplification success, the success rates for species discrimination using these primers were high. Along with other research, in which primers for DNA barcodes have been designed based on available sequences, our results suggest that the growing amount of sequence information will greatly enhance the development of suitable DNA barcodes for taxonomy analysis and species delimitation.

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
W.X., J.W., and D.H. conceived the presented idea. B.Z., W.X., and J.S. performed the data analysis and did the experiments. Y.D., W.W., W.T., and J.Z. collected the samples and did the DNA extrac tion. X.H., Y.X., and J.X. revised the manuscript. B.Z. and W.X. wrote the manuscript.  Table S3.