Genetic diversity and evolutionary patterns of Taraxacum kok‐saghyz Rodin

Abstract Taraxacum kok‐saghyz Rodin (TKS) is an important potential alternative source of natural inulin and rubber production, which has great significance for the production of industrial products. In this study, we sequenced 58 wild TKS individuals collected from four different geography regions worldwide to elucidate the population structure, genetic diversity, and the patterns of evolution. Also, the first flowering time, crown diameter, morphological characteristics of leaf, and scape of all TKS individuals were measured and evaluated statistically. Phylogenetic analysis based on SNPs and cluster analysis based on agronomic traits showed that all 58 TKS individuals could be roughly divided into three distinct groups: (a) Zhaosu County in Xinjiang (population AB, including a few individuals from population C and D); (b) Tekes County in Xinjiang (population C); and (c) Tuzkol lake in Kazakhstan (population D). Population D exhibited a closer genetic relationship with population C compared with population AB. Genetic diversity analysis further revealed that population expansion from C and D to AB occurred, as well as gene flow between them. Additionally, some natural selection regions were identified in AB population. Function annotation of candidate genes identified in these regions revealed that they mainly participated in biological regulation processes, such as transporter activity, structural molecule activity, and molecular function regulator. We speculated that the genes identified in selective sweep regions may contribute to TKS adaptation to the Yili River Valley of Xinjiang. In general, this study provides new insights in clarifying population structure and genetic diversity analysis of TKS using SNP molecular markers and agronomic traits.

D exhibited a closer genetic relationship with population C compared with population AB. Genetic diversity analysis further revealed that population expansion from C and D to AB occurred, as well as gene flow between them. Additionally, some natural selection regions were identified in AB population. Function annotation of candidate genes identified in these regions revealed that they mainly participated in biological regulation processes, such as transporter activity, structural molecule activity, and molecular function regulator. We speculated that the genes identified in selective sweep regions may contribute to TKS adaptation to the Yili River Valley of Xinjiang.
In general, this study provides new insights in clarifying population structure and genetic diversity analysis of TKS using SNP molecular markers and agronomic traits.

K E Y W O R D S
adaptation, population genetic diversity, population structure, rubber dandelion, SNP

| INTRODUC TI ON
Natural rubber (NR) is an important high-performance material, which has incomparable advantages over petroleum-derived synthetic rubbers in many applications requiring abrasion, heat dispersion, resilience, and other desirable properties (van Beilen & Poirier, 2007). As a valuable biopolymer, NR can be used to manufacture many rubber products, including latex gloves and tires (Cherian et al., 2019). To date, the production of NR in the world mainly comes from the Brazilian rubber tree Hevea brasiliensis (Cornish, 2017), and NR is also faced with fluctuating prices and increasing demand with the rapid economic development in many countries (van Beilen & Poirier, 2007). Therefore, we need to find some renewable NR materials to replace petroleum-derived products. Russian dandelion Taraxacum kok-saghyz Rodin (TKS) is considered to be a promising renewable NR material to replace petroleum-derived products (Clement-Demange et al., 2007).
It is native to Xinjiang areas in China and Kazakhstan, and often grows in salinized meadows, flood plain meadows, and farmland canals (Krotkov, 1945;Whaley & Bowen, 1947). After recognizing the importance and urgency of developing alternative sources for NR, many countries have collected a large number of TKS germplasm resources and strengthened the basic research on TKS (Luo et al., 2017). For example, Lin et al. (2018) had sequenced the reference genome of TKS.
Inulin content, rubber content, and root biomass of TKS population were also evaluated based on agronomic characteristics and molecular markers (Arias et al., 2016). Moreover, genetic resources play an important role in germplasm identification, breeding strategy and of the genetic diversity and evolutionary patterns in TKS is extremely important for the constitution of repository fields, obtained from the selection of genotypes collected from wild plants; the genetic diversity needs to be assessed and eventually integrated; finally, a rational conservation program for TKS germplasm resources needs to be associated with a genetic diversity evaluation.
High-throughput molecular markers, such as amplified fragment length polymorphism and random amplified polymorphic DNA, are effective in elucidating and identifying genetic backgrounds (Abdollahi et al., 2015;Bhagyawant, 2016;Cheng et al., 2015;Fu et al., 2015). These marker systems are economical, simple, and automated compared with whole-genome resequencing, but not numerous enough to saturate large populations. With the rapid development of deep sequencing technologies, the aforementioned shortcomings of these traditional markers are being resolved.
These emerging sequencing technologies make it possible for highthroughput identification of SNPs in many species, including those without reference genomes (Rimbert et al., 2018;Zhou et al., 2017).
For example, SNPs, an emerging molecular marker type, are efficient and powerful for population genetics studies. These approaches have facilitated the whole-genome resequencing for several hundred lines, marker genotyping platforms, the development of high-density genetic maps, and markers related to agronomic traits (Varshney et al., 2019). Nowadays, we can identify causal genetic features for breeders to perform biological interventions by integrating genotype and phenotype data effectively (Ramstein et al., 2018;Wallace et al., 2018;Zhang et al., 2018), such as genome-wide association study (GWAS). Moreover, high-throughput SNP molecular markers have not been used to evaluate the genetic diversity and evolutionary patterns of TKS germplasm resources worldwide so far (Cherian et al., 2019).
In this study, a population evolutionary analysis of 58 TKS individuals collected from different distribution regions was implemented to (a) characterize TKS population genetic structure and (b) identify candidate genes under natural selection for TKS germplasm adaptation to the Yili River Valley based on SNP markers developed using Illumina GBS approach.

| Germplasm collection and DNA extraction
Fifty-eight wild TKS individuals were collected from four different geographical regions worldwide (

| GBS sequencing
The extracted DNA was quantified by a Nanodrop 2000 UV-Vis spectrophotometer (Thermo Fisher Scientific) and then incubated with MseI (New England Biolabs), T4 DNA ligase (NEB), ATP, and the Y-adapter N containing a barcode. The digestion was conducted at 37°C and heated at 65°C to inactivate the enzymes. Restriction digestion-ligation reactions were completed in the same tube and then further digested with NlaIII (NEB) and EcoRI (NEB) at 37°C.
The restriction digestion-ligation samples were purified using the Agencourt AMPure XP System. Each clean read was checked using a Perl script to identify whether a read begins with a TAA site that can be recognized by the restriction enzyme MseI. The percent completeness of enzyme digestion equals the number of clean reads that contain a TAA site divided by the total number of clean reads times 100. The efficiency of enzymatic digestion for each sample was calculated in this manner. PCR amplifications were carried out in a single tube with purified samples and Phusion Master Mix (NEB) after adding universal primer and index primer to each sample. The PCRs were purified using Agencourt AMPure XP (Beckman) and pooled, then run out on a 2% agarose gel. A Gel Extraction Kit (Qiagen) was used to isolate 220-450 bp fragments (with indexes and adaptors).
These fragments were then purified using the Agencourt AMPure XP System, and the resulting products were diluted for sequencing. Finally, paired-end sequencing was performed on the selected tags using an Illumina 2500 platform (Illumina) by Novogene Bioinformatics Institute.

| Genotype calling and SNP identification
Raw reads were estimated for GC percentage (%) and phred score (Q30), and then quality-filtered using Stacks v2.55 with min_ maf = 0.05 and max_obs_het = 0.70 (Catchen et al., 2013). After quality control, clean reads were mapped to the TKS reference genome sequence (dandelion line 1151) (Lin et al., 2018) using BWA 1.0 with parameters "mem -t 4 -k 32 -M" (Li & Richard, 2009). Then, sequence alignment SAM files were further converted into binary BAM files with SAMtools . Picard were used to remove the potential PCR duplications for reducing the mismatches generated.
The sorted BAM files were eventually used to perform SNP calling by using SAMtools and GATK4.0 softwares. A filter was performed to ensure the accuracy of SNP variants by using VCFtools with the following parameters: --maf 0.01 --max-missing 0.7 --min-alleles 2.
Finally, only these high-quality SNPs were taken for further analysis.

| Population structure analysis
The filtered high-quality SNPs (total 524,812) were used to perform the TKS population genetics analyses. The phylogenetic tree of 58 TKS individuals was reconstructed using the MEGA-X  by neighbor-joining algorithm under the p-distance model with 1,000 bootstrap. To carry out the population structure analysis, admixture software (Zhou et al., 2011) was applied based on the number of the most likelihood populations (K value) which was set from 2 to 10 with five iterations for each value of K. Both, length of burn-in period and the number of Markov Chain Monte Carlo (MCMC) repeat after burn-in were set at 100,000. The crossvalidation (CV) error rate of K value was analyzed and the K value corresponding to the minimum value of CV error is considered as the best-fit K value based on Evanno's method (Evanno et al., 2005).
PCA (principal component analysis) of total 58 individuals were performed using GCTA v1.93 (Yang et al., 2011). Stacks v2.55 was used to compute the patterns of genetic differentiation and nucleotide diversity with the weighted average of nucleotide diversity (p) and fixation index value (F ST ) in 150-kb windows. The genomic regions with simultaneous top 5% p ratios and top 5% F ST values were selected as selective region signals. Treemix (Pickrell & Pritchard, 2012) was used to infer the patterns of TKS population splits and mixtures in the history of populations. The molecular variance (AMOVA) test was calculated using GenAlEx software (Peakall & Smouse, 2012).

| Identification of candidate genes related to geographic differentiation
To identify potential selective signatures during TKS evolutionary process, we scanned genomic regions using fixation index value and nucleotide diversity methods. F ST values (window size = 10 kb and step = 5 kb) and P value (θπ-sample/θπ-control) were adopted to discern the candidate regions responsible for the differentiation among the TKS populations with software vcftools v4.2 (Danecek et al., 2011) in this study. The candidate genome regions in the top 5% of empirical distribution of p and significantly higher F ST were considered as strong selective sweep that may associate with geographic differentiation or natural selection. Genes overlapped with these strong selective sweeps were then defined as candidate genes associated with geographic differentiation. The corresponding Tajima's D values of each selective genomic region were estimated by ANGSD (Korneliussen et al., 2014). The GO and KEGG pathwayenriched analysis of candidate genes were carried out using AgriGO v2.0 (Tian et al., 2017) and KOBAS 2.0 (Xie et al., 2011), respectively.

| Agronomic traits diversity and cluster analysis
The ANOVA results of eight agronomic traits showed that 58 TKS individuals were significantly different in terms of their agronomic traits. The highest range of phenotypic variance (60.37) was observed in BF, whereas the least value (0.21) was attributed to LT (Table 1, Table S1). The highest phenotypic coefficient of variation (42%) was denoted to SN (Table 1, Table S1). Scape yield of per plant ranged from 3 to 30 (Table 1, Table S1). The high values of CD and LL were in TS location, and the high BF value was found in Kazakhstan.
Locations TS and Kazakhstan also showed the highest LW and LT.
For SN, SL, and SD, the values were evenly distributed among the individuals in different locations ( Table 1)

| Genomic variants of TKS population
In this study, 58 TKS individuals from four different regions in the world, including Zhaosu County of Xinjiang in China, Tekes County of Xinjiang in China, and Kazakhstan, were selected to explore the genomic diversity by GBS sequencing (Table S1) (Table S2).

| Candidate genes of natural selection regions during TKS evolution
To identify potential selective sweep during wild TKS evolutionary process in geographically distant Kazakhstan and Xinjiang, the distribution of p and F ST were used and the genome regions with extremely low or high p ratio and high F ST value (top 5%   Table S6).

| D ISCUSS I ON
Diversity evaluation of TKS germplasm in different regions provides important breeding information for the utilization of genetic diversity. However, compared with other crops, the population genomic research of TKS has been largely limited due to lack of applicable abundant molecular markers. The combination of SNP molecular markers and agronomic traits for evaluating the genetic diversity of species is helpful to precisely analyze genetic diversity and population structure within species (Ambreen et al., 2018;Arzani & Ashraf, 2016). This method has been widely used in genetic diversity analysis in various crops such as safflower (Golkar et al., 2011), Nigella sativa L. (Golkara & Nourbakhshb, 2019), and Triticum urartur , whereas the genetic diversity of TKS has not been studied by this method nowadays. Therefore, 58 TKS individuals were collected in this study and divided into three different groups based on SNP molecular markers and agronomic traits. 80 candidate genes under natural selection for TKS germplasm adaptation to the Yili River Valley were further identified.
Based on the agricultural morphological data, there is a logical similarity between the TKS individuals assigned to the same group and their geographic locations, although a few individuals in C and D population were joined into population AB. It was consistent with the results of phylogenetic tree and principal component analysis based on SNPs molecular marker. And gene introgression from C population to A population and D population to B population also occurred in this study. Hence, we speculate the phenomenon could be due to the influence of various factors, such as stable genetic mutations, substitution or mixture of germplasm across the areas that migrate over long distances, inter-regional plant material exchange, gene flow, climate adaptation and environmental influences on genetic variation (Ramanatha & Hodgkin, 2002) or the profound influence of environmental factors, similar to reports of wheat (Najaphy et al., 2012) and safflower (Golkar et al., 2011). Previous research found that common origin, convergent evolution, and subsequent natural selection may result in accessions from separate regions clustered in a common group (Reeves et al., 2012). Additionally, mutations, recombination, the number of active alleles, genetic drift, and genetic structure also could influence the amount of variation within a population (Ambreen et al., 2018). It is important to obtain a lot of individuals representing the highest possible genetic dis-  (Hu et al., 2017;Kurowska et al., 2020;Wingler et al., 2020). Therefore, the genetic diversity and selection pressures may enable the TKS to adapt to a variety of environmental conditions in different geographical regions.

| CON CLUS I ON S AND OUTLOOK
The results in this study revealed the potentials of SNP markers in evolutionary studies, including genotype distinctness and population genetics. Meanwhile, it points out the scope and direction of further TKS research, such as genome-wide association study of rubber contents traits based on high-throughput sequencing by using a wide range of accessions from each geographical location.
The good varieties of TKS individuals could improve the rubber production capacity worldwide. With the increasing demand for NR and limitations of H. brasiliensis production systems, genetic engineering approaches to generate NR-enriched genotypes of alternative NR plants based on the identification of rubber candidate genes are of great importance.

ACK N OWLED G M ENTS
We thank Dr Xiaoxian Chen for helping to polish the English language.

CO N FLI C T O F I NTE R E S T
None of the authors have conflicts of interest.