Population genetic structure, migration, and polyploidy origin of a medicinal species Gynostemma pentaphyllum (Cucurbitaceae)

Abstract Gynostemma pentaphyllum, a member of family Cucurbitaceae, is a perennial creeping herb used as a traditional medicinal plant in China. In this study, six polymorphic nSSR and four EST‐SSR primers were used to genotype 1,020 individuals in 72 wild populations of G. pentaphyllum. The genetic diversity and population structure were investigated, and ecological niche modeling was performed to reveal the evolution and demographic history of its natural populations. The results show that G. pentaphyllum has a low level of genetic diversity and high level of variation among populations because of pervasive asexual propagation, genetic drift, and long‐term habitat isolation. Results of the Mantel test demonstrate that the genetic distance and geographic distance are significantly correlated among G. pentaphyllum natural populations. The populations can be divided into two clusters on the basis of genetic structure. Asymmetrical patterns of historical gene flow were observed among the clusters. For the contemporary, almost all the bidirectional gene flow of the related pairs was symmetrical with slight differences. Recent bottlenecks were experienced by 34.72% of the studied populations. The geographic range of G. pentaphyllum continues to expand northward and eastward from Hengduan Mountains. The present distribution of G. pentaphyllum is a consequence of its complex evolution. Polyploidy in G. pentaphyllum is inferred to be polygenetic. Finally, G. pentaphyllum is a species in need of protection, so in situ and ex situ measures should be considered in the future.

of studies of population genetics have used the Himalaya-Hengduan Mountains (HHM) areas and the Qinghai-Tibetan Plateau (QTP) to examine the effects of orographic uplift and climatic perturbation on plant speciation and population demography (Du, Hou, Wang, Mao, & Hampe, 2017;Liu et al., 2013). In contrast, few studies have been conducted in subtropical China (Sun, Hu, Huang, & Vargas-Mendoza, 2014;Wang et al., 2015), which consists of the hills and mountains of the Qinling Mountains-Huai River area and the south tropical region of China (Qiu, Fu, & Comes, 2011). Subtropical China is thought to have acted as a refugium for many ancient species during the Pleistocene glacial and interglacial cycles (e.g., Wang, Gao, Kang, Lowe, & Huang, 2009). Many species of this area have unique haplotypes with high levels of genetic diversity. Moreover, the level of genetic differentiation among glacial refugia should be high because of the random fixation of alleles (Hewitt, 2000;Zhang et al., 2015).
Gynostemma pentaphyllum is a perennial creeping plant found in subtropical China, Japan, Myanmar, and India (Chen & Gilbert, 2006).
In China, it mainly grows near rivers and in the shade of the forests that cover the Yangtze River basin and its southern areas (Chen, 1995). Gynostemma pentaphyllum belongs to the Cucurbitaceae family and has 5-7 foliolate leaves. It can reproduce sexually or by clonal growth of rhizomes or bulbils (Gao, Chen, Gu, & Zhao, 1995).
However, it is difficult to determine the ploidy based on the morphological features (Gao et al., 1995). At present, it is not known if the polyploid complex of G. pentaphyllum is autopolyploid or allopolyploid, and the genetic signature and origin of populations with different ploidies are still unclear. As a traditional Chinese medicinal herb, G. pentaphyllum is useful in medical science because it can inhibit the reproduction of tumor cells, regulate lipid metabolism, decrease blood sugar, and enhance immunity (Xie et al., 2010). Thus, most studies of this species have focused on the extraction (Yin, Hu, & Pan, 2004), chemistry, and pharmacology (Razmovski-Naumovski et al., 2005;Tsai, Lin, & Chen, 2010) of its bioactive components.
However, the wild populations of G. pentaphyllum have decreased and become fragmented as a consequence of the increased use of natural medicinal herbs and habitat destruction, to the extent that G. pentaphyllum has been listed as a Grade II Key Protected Wild Plant Species by the Chinese Government (Yu, 1999). It is therefore imperative to investigate the wild populations of G. pentaphyllum, including analysis of their genetic diversity and population structure, to formulate an effective conservation strategy. Existing genetic studies of G. pentaphyllum (Jiang, Qian, Guo, Wang, & Zhao, 2009;Pang, Zou, & Xiao, 2006) used RAPD and ISSR molecular markers on relatively small sample sets that did not cover the spatial distribution of G. pentaphyllum in subtropical China. The simple sequence repeat (SSR) molecular markers, also known as microsatellites, are codominant molecular markers with putative neutral evolutionary history.
In the current study, SSR markers were used to investigate the genetic diversity and population structure of G. pentaphyllum, and ENM was used to investigate the history of the evolution and demographic structure of natural G. pentaphyllum populations in subtropical China. The main objectives of our study were to: (a) assess the level of genetic diversity in natural populations; (b) evaluate the degree of differentiation and structure among populations; (c) explore the origins and migration of G. pentaphyllum; (d) speculate on the origin of polyploidy; and (e) provide basic information that can be used to formulate a conservation strategy.

| Plant sampling
Wild G. pentaphyllum samples were collected from most of the georeferenced sampling sites; the sample set covers the full longitudinal and latitudinal extent of G. pentaphyllum in China ( Figure 1; Table A1).
Five to twenty-four individuals were collected randomly from each population, with the number of samples taken dependent on population size. A total of 1,093 individuals in 72 wild populations were collected. Five individuals from each of two Gomphogyne populations were selected as outgroups. Fresh leaf materials were dried in silica gel. Root cusp samples were immersed in FAA solution (50 ml of 50% alcohol + 5 ml of glacial acetic acid + 5 ml of 37% formaldehyde) and reserved for further laboratory analysis. A handheld GPS (Garmin eTrex Handheld GPS; Garmin) was used to determine the latitude and longitude of each site. Voucher specimens for the samples were deposited at the Northwest University (Xi'an, Shaanxi).

| DNA extraction, amplification, and microsatellite genotyping
Total genomic DNA was extracted using Plant Genomic DNA Kit (TIANGEN Biotech, Beijing Co., Ltd.) following the manufacturer's protocol. Preliminary analyses investigated 14 nSSR and 16 EST-SSR primers developed in G. pentaphyllum (Liao et al., 2011;Zhao, Zhou, Li, & Zhao, 2015), most of them were monomorphic among the populations. At last, six polymorphic nSSR and four EST-SSR primer pairs (Table 1) were tested to genotype the samples. Polymerase chain reaction (PCR) amplifications were performed using a MyCycler™ Thermal Cycler (Bio-RAD). A Biometra Thermocycler was used with the following cycling conditions: 94°C for 5 min, 32 cycles of 94°C for 30 s, annealing temperature (Table 1) for 45 s, 72°C for 45 s and an extension step of 72°C for 5 min, and then a final holding temperature of 10°C.
The protocols of Sullivan were employed for the nSSR genotyping (Sullivan, 2013). The products of PCR were separated on 12% non-denaturing polyacrylamide gel (280 V, 50 W, 3 hr) and visualized using 0.1% silver nitrate stained with a PBR322 DNA marker ladder (TIANGEN Biotech, Beijing Co., Ltd.) to assess the length of the DNA bands. Software Quantity One version 4.6.2 (Bio-Rad Technical Service Department) was used for quantification. Bands were corrected by capillary electrophoresis, based on several individuals for each primer. Capillary electrophoresis was used for the EST-SSR genotyping. Sample analyses were carried out using the GeneMarker genotyping software (Hulce, Li, Snyder-Leiby, & Liu, 2011). The raw data were transformed into 1,0 data for further analysis.

| Chromosome counts and DNA ploidylevel estimation
The method of Sang (2002) was used for chromosome counts.
However, the chromosomes of this species are very small in size and difficult to identify even under a high-power microscope. Thus, the PloidyInfer v1.1 (Huang, Ritland, Dunn, & Li, 2019) software was used to confirm and test the ploidy level of every individual of ambiguous genotype in mixed-ploidy populations. Confounding individuals were removed to make a single ploidy level for each population.  using FSTAT v2.9.3 (Goudet, 2002). Significance levels were corrected by the sequential Bonferroni method (Rice, 1989), repeated 740 times. The BayeScan v2.1 (Foll, 2012) program was used to detect outlier loci using the data converted by the PGDSpider v 2.0.1.3 (Lischer & Excoffier, 2011) software.

| Genetic analysis and population structure
The polymorphism information content (PIC) of each primer was calculated to estimate the allelic variation of SSRs according to the formula: F I G U R E 1 Regional and estimated genetic structure for K = 2 for 72 populations of G. pentaphyllum. (a) Individual assignment to two clusters for all 72 populations was visualized as pie charts. Each population was partitioned into several colored parts proportionally to its membership in a given cluster; colored rings around the pie charts represented the ploidy of each population (gray: diploid; light blue: tetraploid; orange: hexaploid; purple: octaploid). (b) STRUCTURE plot presented for K = 2. Each vertical bar represents a population and its assignment proportion into one of two (colored) population clusters (K). The arrows represented the migration paths where Pi is the frequency of the ith allele for a given SSR marker, and n is the total number of alleles detected for that SSR marker (Botstein, White, Skolnick, & Davis, 1980). The genetic diversity indices (mean number of alleles; Na, number of effective alleles; Nae, allelic richness; Ar, observed heterozygosity; Ho, expected heterozygosity over all loci; He, gene diversity with unordered alleles; h, and individual inbreeding coefficient; Fi) of each locus and population were estimated by SPAGeDi (Hardy & Vekemans, 2002), and GenALEx 6.5 (Peakall & Smouse, 2006) (Pritchard, Stephens, & Donnelly, 2000), which employs Bayesian clustering analysis, was used to analyze the genetic structure, analysis followed the admixture model with independent allele frequencies. Ten independent simulations were run for K from 1 to 12 with 100,000 burn-in steps followed by 1,000,000 MCMC steps. Two alternative methods were utilized to estimate the most likely number (K) of genetic clusters with the online program STRUCTURE HARVESTER (Earl, 2012) by tracing the change in the average of log-likelihood L(K) as suggested by Pritchard et al. (2000) and by calculating delta K (ΔK) according to Evanno, Regnaut, and Goudet (2005). The ArcMap v10.0 and DISTRUCT v1.1 (Rosenberg, 2004) software packages were used to create the distribution of pie charts and bar charts for the data derived from the STRUCTURE analysis.
Analysis of molecular variance (AMOVA) and the fixation indices calculation in Arlequin 3.5 (Excoffier & Lischer, 2010) were used to investigate the extent of genetic differentiation among populations. Calculations were made using four levels of data grouping: (1) species level; (2) ploidy level; (3) two clusters; and (4) five clusters, based on the results of the STRUCTURE analysis, respectively. The significance of the fixation indices was tested using 10 4 permutations. StAMPP (Pembleton, Cogan, & Forster, 2013), which is an R package for calculation of genetic differentiation and structure of populations with mixed-ploidy level, was used to calculate the genetic distance and pairwise F ST .
The online software Isolation by Distance Web Service version 3.23 (http://ibdws.sdsu.edu; Bohonak, 2002;Jensen, Bohonak, & Kelley, 2005) was used to perform a Mantel test (Mantel, 1967) with 10,000 permutations to detect the relationship between genetic distance and geographic distance among populations, and to determine the possible role of isolation by distance (IBD) in the formation of the current population structure. Principal coordinate analysis (PCoA) was performed based on the genetic distance between pairwise populations.

| Effective population size and migration
The software Migrate-n v3.6 (Beerli, 2005)  where N e is the effective population size). Uniform priors and metropolis sampling with 10 short and 1 long chain with 50,000 and 500,000 iterations, respectively, were used to investigate genealogies. Genealogies were sampled 100 steps apart, and the first 1,000 were discarded. The gene flow and number of migrants per population (N m ) were estimated from the values of M and Θ. Before running the program, the results of STRUCTURE were used to define 2 and 5 clusters. The effective population size (N e ) per population was estimated using an average mutation rate for microsatellites of 5 × 10 −4 (Schlötterer, 2000;Selkoe & Toonen, 2006). The Bayesianbased program BAYESASS v3.0 (Wilson & Rannala, 2003) was used to estimate contemporary migration rates among the clusters (over the last few generations, mc), with a sampling frequency of 1,000. Model goodness of fit was evaluated using the area under the receiver operating characteristics curve (AUC). An AUC score above 0.7 was considered to indicate good model performance (Fielding & Bell, 1997).

| Samples and loci assessment
The ploidy of every individual was tested, and mixed ploidies were recognized in a few populations. Confounding individuals were removed to give a single ploidy for each population. Further analyses were performed on the remaining 1,020 individuals from 72 populations.
After Bonferroni corrections, significant deviation from HWE induced by homozygote excess was detected in most populations (Table A2), and the excess was mainly distributed in loci SSRE3, SSR2, and SSRE2. There was no evidence for LD, but 27 null alleles were found to exist in all loci. The null alleles were regarded as missing data in subsequent analysis. The PIC value of 10 loci ranged from 0.141 to 0.830, with an average value of 0.595 (Table 1). Among these, the values of three loci (SSR1, SSR5, and SSR6) were less than 0.5, indicating that the other seven primers are suitable for identification purposes. The mean values of Ar, He, Ho, F ST , and G ST were 3.543, 0.595, 0.334, 0.491, and 0.508, respectively ( Table 2, Table A2). These loci have a high level of genetic diversity and differentiation.

| Genetic structure and divergence
STRUCTURE analysis clearly differentiated the populations into two clusters: north cluster (N) and south cluster (S) with little admixture ( Figure 1). There is a clear peak in the value of DK at K = 2 and a small peak at K = 5 ( Figure 4b). Some populations also form clusters at K = 3 or K = 4, but these are inconsistent with high variance (Figure 4c,d).
The south cluster (S) was stable at higher K values, but the north cluster (N) showed some evidence for partitioning into further clusters: a northwest cluster (NW), a north-central cluster 1 (NC1), a north-central cluster 2 (NC2), and a northeast cluster (NE). Most of the ploidy populations fell into clusters with similar genotypes, rather than clusters with any other groups or a single group, except for two tetraploid populations in the northeast cluster (NJ and ZS).
The relationships between populations based on PCoA plots of pairwise Euclidean distances are consistent with the results of the STRUCTURE analysis. 31.38% is accommodated by the first three components, which separate all populations into their respective groups ( Figure 5). Components PC1, PC2, and PC3 account for 13.54%, 11.24%, and 8.60% of the total variance, respectively. The UPGMA tree based on a matrix of Nei's genetic distance among the 72 populations divided the accessions into five main branches, with three additional subclusters within those branches ( Figure 6). The branches and clusters are consistent with those of the STRUCTURE analysis.
The results of AMOVA reveal that the genetic variation is mostly within populations (

| Effective population size and population history
Our study indicated an asymmetrical pattern of historical gene flow among clusters. When the 72 populations are divided into two clusters, the mean migration rate (M) from the north (N) to the south (S) clusters is 11.323 and from S to N is 92.507. The gene flow (N m ) between the two clusters is also asymmetrical ( Table 5).
The value of M between pairs of clusters varies from 9.104 to 36.299 migrants when the populations are divided into five clusters. The effective population size ranges from 245 individuals in the south cluster (S) to 415 individuals in the northeast cluster (NE; Table 6). The highest value of N m was 4.680, calculated for migration from the south (S) to north-central 1 (NC1) cluster.
Bidirectional contemporary gene flows of the related pairs were symmetrical, with slight differences. The highest migration rate (0.157) was calculated for migration from the NW cluster to the S cluster; the S cluster provided less immigrations.

| Ecological niche modeling
Six bioclimatic variables were selected out of 19 for the ENM ( significant correlation, so these variables could be used for further analyses.
The AUC value, based on 10 times repeat, was 0.987, with a stan-

| Genetic divergence and diversity
Genetic diversity of a species reflects its evolutionary potential and allows for evolution and adaptation. The more abundant the genetic variation of a species is, the more adaptable it is. Thus, it is necessary to study the genetic diversity of a species to understand its biological properties (Grant, 1985). Subtropical China was a Pleistocene refugium for many ancient species during the Pleistocene glacial and interglacial cycles (e.g., Wang et al., 2009). Species in this region commonly have unique haplotypes, and the level of genetic differentiation among glacial refugia is usually high because of random allele fixation (Hewitt, 2000;Zhang et al., 2015).
In the current study, six SSR and four EST-SSR markers were used   (Ellstrand & Elam, 1993). A species with low genetic diversity lacks the evolutionary flexibility to cope with a changing ecological environment and is passive in longer-term evolutionary processes (Genton, Shykoff, & Giraud, 2005). Thus, a species with low level of genetic diversity is relatively more vulnerable to get extinction (Chen et al., 2015;Knox, Bezold, Cabe, Williams, & Simurda, 2016;Nolan, Noyes, Bennett, Hunter, & Hunter, 2010). In contrast, a high level of genetic diversity tends to be associated with successful ecological adaptation (Ortego, Noguerales, Gugger, & Sork, 2015).

| Genetic structure of G. pentaphyllum
Genetic distance is commonly used to describe the genetic structure of a population and the differences among populations (Nei, 1972 Patterns in genetic structure are produced by evolutionary and demographic processes at different temporal scales (Morris, Ickert-Bond, Brunson, Soltis, & Soltis, 2008). Factors such as mutation, migration, natural selection, and genetic drift, as well as the evolutionary history and biological characteristics of the species, combine to produce a nonrandomly distributed pattern of genetic variation in space and time. The evolutionary potential of a species or populations depends to a large extent on the genetic structure of the population (Loveless & Hamrick, 1984). The results of the STRUCTURE analysis performed for this study indicate that the most likely genetic structure of the 72 studied populations is either two or five clusters (Figures 1 and 4). With F I G U R E 4 Genetic structure for K = 3 to K = 5 for 72 populations of G. pentaphyllum. Bayesian inference analysis for determining the most likely number of clusters (K) for the distribution of (a) the likelihood L(K) values and (b) ΔK values was presented for K = 1-12 (10 replicates per K-value). (c-e) Individual assignment to 3-5 clusters for all 72 populations was visualized as pie charts. Each population was partitioned into several colored parts proportionally to its membership in a given cluster; colored rings around the pie charts represented the ploidy of each population (gray: diploid; light blue: tetraploid; orange: hexaploid; purple: octaploid). (f) STRUCTURE plots were presented for K = 3 to K = 5, respectively. Each vertical bar represents a population and its assignment proportion into one of three to five (colored) population clusters (K)

| Gene flow, migration, and diffusion
Mutation and genetic drift lead to genetic differentiation of local populations, and gene flow might promote evolution by spreading new genes (Slatkin, 1987) and producing changes in the spatial dis- The effective size of the populations that experienced bottlenecks is usually small, and the number of alleles and heterozygosity is expected to be correspondingly decreased. However, the observed heterozygosity is greater than the heterozygosity calculated from the number of alleles using the mutation-drift equilibrium; this phenomenon is known as heterozygosity excess (Piry et al., 1999).
The third trajectory was from the source to Hainan Island, through northern Vietnam and south China.
Results of the ENM analysis also suggest that G. pentaphyllum has expanded its distribution range continuously since the last interglacial period (LIG; Figure 9). Indeed, the genus Gynostemma is thought to have originated in "West Sichuan Central Yunnan old land," while southwest China is its modern center of distribution and diversity (Chen, 1995). Moreover, the recent expansion of G. pentaphyllum populations in China is from the southwest to the east and north. The north-south asymmetrical gene flow documented by this study is significantly greater than the flow from south to north. The largest recent gene flow was from the NW cluster to the S cluster; this observation suggests that the southern populations are of recent origin.

| Origin of polyploidization
Polyploidization is one of the most important evolutionary characteristics of plant species and a major driving force for the high diversity of angiosperms (Otto & Whitton, 2000;Soltis & Soltis, 1999). Approximately 47% of the angiosperm species and 80% of ferns have undergone polyploidization processes in their evolutionary history (Cui et al., 2006;Soltis, 2005). Compared with the diploid species, polyploid species may have broader niches and/ or larger distribution ranges (Ehrendorfer, 1980;  species originated from its diploid ancestor at least eight separate times and that it had undergone at least one geographic expansion in the origin of the polyploidy complex (Wu, Cui, Milne, Sun, & Liu, 2010). In general, the derivation of polyploidies from different diploid ancestors induces a high level of genetic variation and population differentiation in the polyploid species, which increases the genetic diversity of polyploidies through hybridization and genomic recombination events from the autopolyploid. The  The genus Gynostemma might have originated from the "West Sichuan Central Yunnan old land" in early Tertiary (Chen, 1995). Similar result was also found in the study of Galax urceolata (Servick, Visger, Gitzendanner, Soltis, & Soltis, 2015). Some polyploid populations are separated from adjacent diploid populations, such as the NJ and ZS tetraploid populations. It is speculated that a primitive genotype was preserved and doubled, and then adapted through the process of polyploidization. Such processes explain the geographic distribution pattern of coexisting polyploidy and diploid populations.
Polyploid species have inherent advantages because they can adapt readily to environmental changes and/or occupy new environments.
However, whether polyploidization occurred once or multiple times has not yet been determined. Therefore, further work on the origin and evolution of polyploidies in G. pentaphyllum, using modern phylogeography based on molecular methods, is necessary. The use of plant sequence fragments to construct geographic genetic distribution patterns for the different genetic backgrounds, and simulations of evolution using statistical population analysis, have the potential TA B L E 7 Information of the six ecological variables F I G U R E 9 Species distribution modeling using maximum entropy modeling of G. pentaphyllum. Predicted distributions were shown for two periods, that is, (a) the present time and (b) the LGM (21,000 years before present) periods. Color scales refer to logistic probability of occurrence, and black dots indicate the sampling sites

| Implications for conservation
Studies of the genetic diversity and genetic structure of species are important components of biodiversity conservation. Preferential conservation of populations with high diversity optimizes the potential of a species to adapt. However, populations with low genetic diversity should be protected from the threats that arise from evo-

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R S' CO NTR I B UTI O N S
G.Z. and Z.L. conceived the ideas; X.Z., H.S., J.Y., and L.F. contributed to the sample collection; X.Z. did the experiments, analyzed the data, and written the manuscript. All authors read and approved the final manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
Details of all species used in this study including sampling information (longitude and latitude) and genetic diversity values are available in Tables A1 and A3. All the microsatellite genotyping data used in this study are available in Dryad.  Note: NA, alleles; Nae, effective alleles; AR (k = 8), allelic richness (expected number of alleles among eight gene copies); He, gene diversity corrected for sample size; Ho, observed heterozygosity; Fi, individual inbreeding coefficient; h, gene diversity with UNORDERED alleles; I, Shannon's Information Index = −1 × Sum (pi × Ln (pi)) where pi is the frequency of the ith allele for the population; Bn, bottleneck probability of Wilcoxon one-tailed test for heterozygote excess under PTM; L, normal L-shaped distribution; S, shifted mode.

O RCI D
a Populations with one private allele.
b Populations with two private alleles.