Genome and evolution of the shade‐requiring medicinal herb Panax ginseng

Summary Panax ginseng C. A. Meyer, reputed as the king of medicinal herbs, has slow growth, long generation time, low seed production and complicated genome structure that hamper its study. Here, we unveil the genomic architecture of tetraploid P. ginseng by de novo genome assembly, representing 2.98 Gbp with 59 352 annotated genes. Resequencing data indicated that diploid Panax species diverged in association with global warming in Southern Asia, and two North American species evolved via two intercontinental migrations. Two whole genome duplications (WGD) occurred in the family Araliaceae (including Panax) after divergence with the Apiaceae, the more recent one contributing to the ability of P. ginseng to overwinter, enabling it to spread broadly through the Northern Hemisphere. Functional and evolutionary analyses suggest that production of pharmacologically important dammarane‐type ginsenosides originated in Panax and are produced largely in shoot tissues and transported to roots; that newly evolved P. ginseng fatty acid desaturases increase freezing tolerance; and that unprecedented retention of chlorophyll a/b binding protein genes enables efficient photosynthesis under low light. A genome‐scale metabolic network provides a holistic view of Panax ginsenoside biosynthesis. This study provides valuable resources for improving medicinal values of ginseng either through genomics‐assisted breeding or metabolic engineering.


Introduction
Roots of Asian/Korean ginseng have been used for thousands of years, today being an important Asian agricultural commodity with markets (together with P. quinquefolius, American ginseng) estimated at over 2 billion USD (Baeg and So, 2013). Panax species are shade-requiring perennials (Court, 2000). Most diploid Panax such as P. notoginseng, P. vietnamensis, P. bipinnatifidus, P. stipuleanatus and P. pseudoginseng grow at high altitudes in warm freeze-free areas from the Eastern Himalayas through Southern China to north and central highlands of Vietnam. Tetraploid P. ginseng and P. quinquefolius overwinter and are broadly distributed in Northeast Asia and North America, respectively.
Therapeutic effects of P. ginseng on neurodegenerative disorders (Cho, 2012;Radad et al., 2006), cardiovascular diseases (Zheng et al., 2012), diabetes (Xie et al., 2005) and cancer (Jung et al., 2012;Wong et al., 2015) are often attributed to unique saponins called ginsenosides, glycosylated triterpenes classified as either dammarane-(Panax-specific) or oleanane-type based on aglycone skeletal structure. Ginsenosides are accumulated in roots, leaves, stems, flower buds and berries, in quantities varying with tissue Shi et al., 2007), age (Shi et al., 2007;Xiao et al., 2015), environment Kim et al., 2014) and cultivar (Lee et al., 2017b). Limited genomic resources and genetic populations due to slow growth (~4 years/generation), sensitivity to environmental stresses and low seed yield (40/generation) hamper developmental and genetic studies and breeding. Therefore, less numbers of ginseng cultivars were developed and those cultivars were not pure inbred line, containing some heterogeneity because seeds were multiplied by pedigree selection.
Here, we report a draft genome sequence of P. ginseng cultivar (cv.) Chunpoong (ChP), which is the first cultivar officially registered in Korea Seed and Variety Service and showed relatively uniform genotypes . Investigation of the P. ginseng genome and comparative analyses with carrot (Daucus carota; Iorizzo et al., 2016) and other plants allowed us to gain new insights into evolution and speciation, also clarifying the origin and regulation of ginsenoside accumulation. These discoveries provide a valuable foundation for improving therapeutic effects, understanding shade plant biology and empowering Araliaceae genomic studies.

Genome assembly and annotation
Paired-end (PE) reads covering 746 Gbp (2069) and mate-pair (MP) reads covering 365 Gbp (1019) from ChP (Table S1) were assembled into 9,845 scaffolds covering 2.98 Gbp with N50 of 569 Kbp and longest scaffold of 3.6 Mbp (Table 1; Table S2). The predicted P. ginseng genome size ranged from 3.3 to 3.6 Gbp through flow cytometry and k-mer frequency, slightly bigger than the reported 3.12 Gbp (Hong et al., 2004). Assembly accuracy and completeness was indicated by correct read mapping of four MP libraries revealing proper span size (Table S3); alignment to 13 finished bacterial artificial chromosome (BAC) sequences (Choi et al., 2014;Jang et al., 2017) showing 99% homology with perfect contiguity (Table S4); and Benchmarking Universal Single-Copy Orthologs (BUSCO_v2) analysis finding 1339 (93%) of 1440 conserved orthologous angiosperm genes assembled completely (Table S5). Evidence-based de novo annotation revealed 2181 Mbp (79.52%) of repetitive elements (REs) including long terminal repeat retrotransposons (LTR-RTs) being most abundant with LTR/Gypsy accounting for 49% and one PgDel family alone occupying 30% of the genome (Table S6).

Genome structure and evolution
Analysis of P. ginseng paralogs revealed two WGD events, at 2.2 Million Years Ago (MYA, called Pg-a), and 28 MYA (Pg-b), consistent with previous reports (Choi et al., 2013(Choi et al., , 2014Kim et al., 2017; Figure S4). Genes and flanking regions were highly collinear between paralogous Pg-a WGD scaffolds (Figures 1a, S5 and S6), with 95% sequence similarity except in repeat-mediated InDel regions (Figure 1a), whereas only genic regions have collinearity between Pg-b WGD scaffolds. Utilizing Pg-a WGD paralogs, a zigzag approach to identify putative contiguous counterpart scaffolds for ordering ( Figure S5), established 453 recently duplicated collinear blocks. The collinear blocks consist of 1344 scaffolds covering 601.2 Mb with 29 953 genes. Our updated sequence scaffold (v0.8 to v1.0) and FISH analysis validated the zigzag approach for combining adjacent scaffolds, with some exceptions due to chromosomal rearrangement (Figures 1b and S7).
Gene sets of P. ginseng and four dicots (Arabidopsis, grape, tomato, carrot) were characterized by OrthoMCL (Li et al., 2003) and 1697 common orthologous gene clusters used to calculate Ks value for clarifying major evolutionary events ( Figure S4). Ginseng was estimated to have diverged from carrot (Apiaceae)~51 MYA and Pg-a and Pg-b occurred independently from a WGD in the carrot lineage (Dc-a) (Figure 1c). Manual ordering of P. ginseng scaffolds based on the carrot genome enabled us to construct 18 artificial counterpart superscaffolds ( Figure 1d). Four paralogous ginseng blocks show collinearity with two carrot chromosomes ( Figure 1e). The 18 ginseng superscaffolds showed the genomewide Pg-a WGD and biased distribution of genes and repeats. More SNPs are also identified from gene-rich regions (Figure 1f). To understand evolution and speciation of Panax and its Araliaceae relatives, we obtained the complete chloroplast genomes and 45S nrDNA sequences from ten species as well as carrot. We included our previous data (Kim et al., 2015b(Kim et al., , 2016a(Kim et al., , b, 2017, and added newly generated chloroplast genomes and 45s nrDNA sequences from two more Panax species, (Table S14; Figure S8). Phylogenomic analysis of those 10 chloroplast genomes and 45S nrDNA genes indicated that the Panax-Aralia lineages diverged~7.50-7.97 MYA, following Panax speciation (Figures 2a and S9). Tetraploids P. ginseng and P. quinquefolius formed~2.59 MYA (Figure 2a). Five uniquely enriched LTR-RT families (PgDel, PgTat, PgAthila, PgTork and PgSire;Choi et al., 2014;Jang et al., 2017;Lee et al., 2017a) occupy >50% of the genome. PgDel LTR-RTs largely account for genome size variation among seven Panax species (Figure 2b,e). P. ginseng has doubled orthologous sequences compared to diploid P. notoginseng (Zhang et al., 2017), and their modal Ks value (0.035, Figure 2c,d) implies divergence 2.62 MYA, which is similar value to chloroplast analysis (Figure 2a). A unique high copy En/Spm-like CACTA transposon (PgCACTA1) encoding two transposase genes Figure 1 Panax ginseng genome structure and evolution. (a) Relationship between four paralogous blocks resulting from two WGD events. Block 1 composed of Pg_scaffold0266 and Pg_scaffold2259, Block 2 contained Pg_scaffold0762 and Pg_scaffold0978, Block 3 had just one scaffold, Pg_scaffold0701, and Block 4 comprised reverse of Pg_scaffold2798 and Pg_scaffold1958. (b) FISH analysis to confirm the chromosomal locations of scaffolds inferred to be adjacent by zigzag alignment from counterpart scaffolds. FISH probes designed to validate adjacent scaffolds, and applied to pachytene chromosomes indicated that Pg_scaffold0266 and Pg_scaffold2259 (i, ii) were adjacent but Pg_scaffold0762 and Pg_scaffold0978 were on different chromosomes (iii). (c) Evolutionary history of P. ginseng. The pale blue triangles signify species number in the Araliaceae (1500) and Apiaceae (3700). (d) Construction of 18 virtual superscaffolds based on Daucus carota. The artificial counterpart superscaffolds of P. ginseng were twice the number of the corresponding D. carota superscaffolds, because of Pg-a WGD. (e) Syntenic analysis between P. ginseng and D. carota. The seven scaffolds described illustrated chromosomal rearrangements relative to two D. carota regions. (f) Circular map of 18 virtual superscaffolds of P. ginseng and distribution of SNPs with cv. YuP (A), repeats (B) and genes (C). Total identified repeats (red lines) and major LTR-RT family, PgDel (blue lines).  [1904][1905][1906][1907][1908][1909][1910][1911][1912][1913][1914][1915][1916][1917] and maintaining 31-bp conserved terminal inverted repeats (TIR) has highly diverse copy number of Pg167TR (Waminal et al., 2016) in the last intron of the second transposase (Figures 3a and S10), providing a molecular barcode for identification of individual chromosomes ( Figure S11; Waminal et al., 2016). The genome proportions of PgCACTA1 and Pg167TR are diverse among Panax species and richer in tetraploids, especially P. ginseng ( Figure 3b). Comparative FISH analysis with P. notoginseng using probes from PgCACTA1 and Pg167TR transposase regions showed clear proliferation of Pg167TR in P. ginseng (Figure 3c) but little difference for the transposase regions between the two species.

Ginsenoside biosynthesis
Ginsenosides, the major pharmacologically active compounds of ginseng, are triterpene saponins, of which more than 150 have been isolated from Panax plants (Christensen, 2009;Jia and Zhao, 2009). To characterize ginsenoside biosynthetic machinery and metabolic utilization, a genome-scale metabolic network was newly constructed based on established procedures (Thiele and Palsson, 2010), covering 4946 genes catalysing 2194 reactions and 2003 unique metabolites (Data S1) with a global overview in Figure S12.
Phylogenetic analysis of OSC families found DDS to be specific to P. ginseng (Figure 4b), suggesting that DDS and production of dammarane-type ginsenosides originated in Panax. Of 383 P. ginseng cytochrome P450 genes, two candidate protopanaxadiol synthase (PPDS) and two protopanaxatriol synthase (PPTS) genes were identified by homology search against curated PPDS (Han et al., 2011) and PPTS, respectively. In the last glycosylation step, 226 UGTs were annotated and eleven identified as candidate UGTs associated with elevated expression pattern upon methyl jasmonic acid (MeJA) treatment ( Figure S13), which is well-known elicitor for inducing secondary metabolites (Han et al., 2011(Han et al., , 2012. These candidate UGTs could be involved in synthesis of PPD-type ginsenosides, as MeJA triggers mainly PPDtype . The high ginsenoside contents for which older (above 4-6 years) P. ginseng roots are harvested might reflect transportation from shoot tissues rather than active biosynthesis. Downstream genes (SQE, DDS, PPDS and PPTS) in the ginsenoside biosynthetic pathway showed higher expression in leaves (1 year old and 5 years old) than roots (1-and 6-year-old main body roots, lateral roots and rhizomes; Figure 4c). Co-expression analysis across RNA-Seq samples from ChP showed that three highly expressed DDS genes among 20 OSC are co-regulated with several SQE genes, and disrupting function of either DDS or SQE affects P. ginseng ginsenoside production (Han et al., 2010;Tansakul et al., 2006). This implies that DDS and SQE may be important enzymes with which ginsenoside production coevolved. Indeed, higher expression of downstream genes in P. ginseng cultivars Cheongsun (CS) and Sunhyang (SH) than Sunun (SU; Figure 4c) is associated with higher ginsenoside content (Table S16). While many CPY450 and UGTs are not yet characterized with respect to different types of ginsenosides (Figure 4a), dynamic changes in expression of various genes were observed across the metabolic network ( Figure S14), providing a foundation for in silico analysis and ultimately empirical metabolic engineering.

(c) FISH analysis with
PgCACTA domain (i, iii) and Pg167TR (ii, iv) in P. notoginseng (i, ii) and P. ginseng (iii, iv). PgCACTA gene regions showed clear signals in both diploid and tetraploid species (i, iii), while Pg167TR showed very faint signals in P. notoginseng (ii) but showed highly abundant and distinct signals in P. ginseng (iv). Gene families responsible for environmental adaptation Differentially expressed genes (DEGs) were identified with two/ three biological replicates abiotic stress-treated RNA-Seq samples. In detail, 703, 152 and 23 genes were shown different expression in response to drought, cold and salt, respectively ( Figure S15). DEG analysis was also performed between nonheat-treated leaves and heat-treated (1 and 3 weeks) leaves of three replicates. In total, 1409 genes were identified as DEG after 1 and 3 weeks of heat treatment ( Figure S15). Altogether, 1880 genes were found to be differentially expressed (DE) and the numbers of DE genes including up-or down-regulated genes are represented in Figure S15. Majorly, fatty acid desaturase (FAD) and light-harvesting chlorophyll a/b binding (CAB) proteins were highly responsive to abiotic stresses including drought, salt, cold and heat. An unprecedented 85 FAD genes were found in P. ginseng, almost three times as many as in model annual plants (Table S17). Phylogenetic analysis revealed diverged FAD gene structures to include a Panax-specific subgroup (acetylenic FADs), and a carrot-and Panax-specific FAD-like subgroup ( Figure S16).
Many FAD orthologs (Figure 5a) in tetraploid P. ginseng cv. Yunpoong (YuP) were not found in diploids P. vietnamensis ( Figure 5b) and P. notoginseng, with only thirty-six P. notoginseng genes having orthologous relationships to P. ginseng FADs. The newly evolved P. ginseng FADs showed higher expression in Figure 4 Ginsenoside biosynthesis model and related genes in P. ginseng. (a) Overview of the ginsenoside biosynthetic pathway in P. ginseng. The blue coloured uridine 5 0 -diphospho-glucuronosyltransferases (UDP-glucuronosyltransferase, UGTs) are unknown enzymes involved in the glycosylation of ginsenosides. Reaction and metabolite abbreviations can be found in Data S1. (b) A phylogenetic tree of oxidosqualene cyclases (OSCs) in P. ginseng. OSC genes, including dammarenediol synthase (DDS), b-amyrin synthase (b-AS), lanosterol synthase (LSS) and cycloartenol synthase (CAS), were identified from P. ginseng (red), D. carota (black), S. lycopersicum (green), A. thaliana (blue) and V. vinifera (cyan) by KEGG and BLASTP searches. (c) Heatmap shows TMM normalized expression values of putative downstream genes involved in ginsenosides biosynthesis. Expression in above-ground tissue (S1: immature fruit, S2: mature fruit, S3: flower, S4: 1-year-old leaves, S5: 5-year-old leaves, S6: 6-year-old stem) and subterranean parts (R1: 1-year-old main body roots, R2: 6-year-old main body roots, R3: 6-year-old lateral roots, R4: 6-year-old rhizomes, R5: 6-year-old dormant roots) are depicted. Similarly, expression of downstream genes is shown between adventitious roots of P. ginseng cultivars, CS, SH and SU. An unprecedented 49 CAB genes were found in P. ginseng, with family expansion due to retention of whole genome duplicated copies Figure S17). All 49 CAB genes showed expression, albeit in various tissues, with significantly increased expression in leaves and decreased expression during abiotic stresses, especially drought and heat ( Figure S18). All CAB orthologs were found in both P. ginseng and P. vietnamensis ( Figure S19). The expansion of Panax CABs is consistent with shade adaptation, enabling efficient photosynthesis in low light. Some TF families showed P. ginseng-specific expansion, notably, FAR1 (far-red-impaired response), HRT (Hordeum repressor transcription) and CSD (cold-shock domain) families ( Figure S20; Table S11), indicating that expanded regulatory capacity also contributed to shade and cold adaptation.

Discussion
The genome sequence of P. ginseng opens a route to functional and molecular breeding of economically important herbaceous perennials within the Araliaceae family. The genome sequence covers~80% of the estimated genome size (~3.6 Gbp) and identified two rounds of WGDs unique in the Araliaceae family. The recent, 2.2 MYA, WGD event (Pg-a) contributed substantially to duplicated genes and genome structure of P. ginseng, with gene number about twice that of diploid P. notoginseng and other diploid plants. Following this recent WGD, 99% and 95% of homology showed between paralogous genes and its flanking regions, respectively, except TE-mediated sites complicating genomic analysis in P. ginseng. Like other plants, LTR-RTs were most abundant in ginseng genome in which LTR/Gypsy accounted for 49%, especially one PgDel family extremely abundant occupying 30% of whole genome sequence. Cytogenetic mapping of major P. ginseng TEs revealed hybridization of different repeat families to different chromosomal niches ( Figure S21). PgDel1 hybridized to the entire chromosomes, supporting their predominant abundance in the ginseng genome (Choi et al., 2014).
Insertion time estimation using LTR sequence of intact major LTR-RTs indicated that most of LTR-RTs were expanded recently after Pg-a WGD in P. ginseng ( Figure S22). We also assumed that one more expansion of major LTR-RTs occurred around 5-6 MYA, according to repeat GP of P. stipuleanatus and P. trifolius being half of the others diploid Panax species (Figure 2a,b). Although Class II TEs generally have lower genome proportion (GP) than Class I, they are known to be important gene regulators in a genome (Gao et al., 2016). We identified a novel En/Spm (CACTA) element PgCACTA1 in the ginseng genome and its insertion at high AT regions and conservation of the TIR sequences with other PgCACTA elements indicate its relatively recent insertion. Comparative analysis of PgCACTA abundance among Panax species showed preferential expansion of the Pg167TR in tetraploids, particularly in P. ginseng, whose genome contains 1.3% of PgCACTA (Figure 3b). Comparative FISH data between P. ginseng and a diploid relative, P. notoginseng, supported the expansion of Pg167TR in P. ginseng (Figure 3c). The amplification pattern of PgCACTA and Pg167TR suggests a tetraploid lineage-specific evolutionary pathway associated with the recent Pg-a WGD. These data imply that during the Pg-a WGD, PgCACTA amplification could have been triggered in response to genomic shock as in other plants (Fedoroff and Bennetzen, 2013;Kalendar et al., 2000). Concomitant to this amplification was the amplification of Pg167TR, which led to the distinct chromosomal loci in tetraploid ginseng. Further, we postulate that major evolution events in Panax species including two rounds of WGD and intercontinental species migrations were related to recurrent glaciations (ice ages) and global warming. The estimated 51 MYA Araliaceae-Apiaceae divergence falls early in the Eocene (56-34 MYA) global warming (Figure 1c), with the 1500 Araliaceae species (Gao et al., 2013) proliferating following Pg-b WGD 28 MYA. Complete chloroplast genome-based molecular clocks suggest the history of recent divergence of Panax species. The common diploid Panax ancestor, which was a heat-susceptible shadeloving plant, was distributed over the Qinghai-Tibetan Plateau by divergence with the Aralia genus~7.5 MYA (Li and Wen, 2016). Panax trifolius is the unique Panax diploid in North America and was estimated to diverge~6.6 MYA, prior to divergence of the other four Panax diploid species in Asia, suggesting that diploid Panax species proliferated to North Asia and crossed into North America during that period (Figure 2f). Pliocene (5.33-2.58 MYA) speciation of diploid Panax was associated with global warming, while Pleistocene (~2.58 MYA) glaciation was associated with their extinction. Allotetraploidization between these diploids occurred sequentially, and an allotetraploid ancestor of the current P. ginseng may have survived in Northeast Asia by gaining overwintering ability. Cold-susceptible Panax diploids may have been isolated at high altitude in warm Southern Asia, favouring speciation (heat-island effect), with~10 extant diploid Panax species at risk of extinction from global warming. The ancestor of P. quinquefolius may have migrated to North Americã 1.2-0.8 MYA in a glacial period (Figure 2f; Choi et al., 2013). For a better understanding of the metabolic paradigm in P. ginseng, a genome-scale metabolic network was reconstructed in this study, which leads to in silico metabolic engineering that could predict candidate genes associated with overproduction of desired metabolites and thus accelerate overall metabolic engineering process. Only plants in the genus Panax actively biosynthesize various types of ginsenosides (Kim et al., 2015d), which was explained by the taxonomic specific origin of DDS genes. We have also demonstrated the candidate genes including DDS and SQE controlling the accumulation of ginsenosides with transcriptome and metabolome data. These results provide essential targets to increase the production of ginsenosides through latest biotechnological approaches.
The recent allotetraploidization event (Pg-a) might have promoted environmental adaptation such as survival of freezing temperatures. A well-characterized phenomenon demonstrates that temperature modulates membrane fluidity, which is the major site of freezing injury (Shewfelt, 1992;Thomashow, 1999). It is also known that the role of FADs in cold acclimation in various plant species (Khodakovskaya et al., 2006;Rom an et al., 2012;Thomashow, 1998). In addition, the divergent FAD genes have been associated with synthesis of divergent fatty acid structures that play major role against biotic/abiotic stresses (Cao et al., 2013). As compared to diploid ginseng, the polyploid ginseng species such as P. ginseng and P. quinquefolius have been commonly found in the habitat of Northeast Asia and North America, respectively, where freezing temperature prevails in the winter. Therefore, the expansion of FAD genes with diverse FAD structures in P. ginseng or polyploidization of ginseng species might have led to freezing tolerance.
Light is a limiting factor for the ginseng cultivation and plays role in ginsenoside production. Ginseng has been grown under canopy or artificial shade; however, the reason behind this process is largely unexplored. It is obvious that the ginseng plant should have acquired a novel mechanism to ensure an efficient photosynthesis under low-light conditions. The light-harvesting chlorophyll a/b binding proteins (LHCPs or CAB) are the key components of the photosynthesis antennae complexes, which transfer the light energy to the reaction centres of photosystem I (PS I) and photosystem II (PS II) where the light energy is converted to form chemical bond energy (i.e. NADPH and ATP). Intriguingly, P. ginseng genome contains more CAB genes than any plant species to date, which was supported by RNA-seq expression ( Figure S18). Equivalently, a total of 53 genes (including pseudogenes) were also identified in the genome of brown algae (Ectocarpus siliculosus; Cock et al., 2010) and that expansion was attributed to adapt to variable or dim light conditions. We have also deduced the ability of ginseng plants to cope with low-light environments is related to its as-yet-unprecedented expansion in number of CAB genes, with decreased expression during drought and heat stresses. Intriguingly, estimation of presence/absence of orthologous gene copies in P. vietnamensis revealed the abundance of CAB genes in both shade plants, tetraploid and diploid ginseng species ( Figure S19).

Conclusion
The genome sequence clarifies the evolution, shade adaptation, and medicinal properties of P. ginseng. Two Araliaceae-specific WGDs played key roles in environmental (shade and freezing) adaptation and medicinal importance (dammarane-type ginsenoside production), the former also providing information that might apply to improvement of other cultigens. The widespread importance of collecting and cataloguing crop relatives is especially urgent in Panax, in which extant diploid relatives are at risk of extinction from global warming, progenitors of cultivated tetraploids are already extinct, and wild tetraploids are endangered by over-harvesting (Baeg and So, 2013;Court, 2000).

Methods
De novo sequencing, assembly and quality evaluation DNA from leaves of 4-year-old ChP, an elite Korean cultivar, was used for sequencing and assembly. The ChP was cultivated in a ginseng experimental field of research farm (College of Agriculture and Life Science, Seoul National University, Suwon, Korea) and used for isolation of genomic DNA and total RNA. To reduce heterogeneity, we used DNA from three individuals. Whole genome shotgun reads of ChP were generated using Illumina platform (HiSeq2000 and MiSeq) at National Instrumentation Center for Environmental Management (NICEM), Macrogen Co. (Seoul, Korea), and LabGenomics Co. (Seongnam, Korea). The five paired-end (PE) libraries (with 200-600 bp insert sizes) were sequenced into 746 Gbp for primary assembly, and the 365 Gbp was sequenced from four mate-pair libraries with 1.5 kb, 3 kb, 5 kb and 10 kb insert for scaffolding. First, low-quality reads and duplicated reads were eliminated using SOAPfilter 2.0 of SOAPdenovo package (Luo et al., 2012) with default parameter. Furthermore, low-frequency reads were eliminated based on kmer frequency by SOAPec 2.0 with KmerFreq_HA 2.0 and Corrector_HA 2.0, which cannot support for initial contig assembly. Genome size estimation was conducted by flow cytometry and 23 bp k-mer frequency analysis with JELLYFISH (Marc ßais and Kingsford, 2011). Taken together, the genome size of P. ginseng was estimated to range between 3.3 and 3.6. The k-mer frequency-based genome size, 3.6 Gbp, was used for further analysis and discussion for the genome composition. The genome assembly was mainly conducted using SOAPdenovo2. The contigs containing length over 1 kb and filtered mate-pair reads were used for scaffolding with SSPACE followed by error correction by in-house Perl scripts (Phyzen, Seongnam, Korea).

Validation of genome assembly
The assembled draft sequence was validated by mapping of MP reads and alignment with reported bacterial artificial chromosome (BAC) sequences (Choi et al., 2014;Jang et al., 2017). First, the 1382 million (M) of 1385 M filtered MP reads were mapped to assembled sequence through BWA (Li and Durbin, 2009) (v0.7.12) with default parameter, of which 536 M reads were mapped with paired ends. The assembled genome sequences were compared to 13 BACs composed of 15 contigs, which were sequenced using PacBio RSII platform and ABI3730 sequencer. Each scaffold matched with BAC clones sequence was identified through BLAST analysis and visualized with dotplot by PipMaker (Schwartz et al., 2000). Furthermore, the genome assembly completeness was validated using Benchmarking Universal Single-Copy Orthologs (BUSCO_v2; Simão et al., 2015).

Transcriptome sequencing and analysis
Tissues and cultivars used in this study were described in Table S7. Dormant roots with healthy rhizomes of 1-year-old cv. ChP plants were obtained from the Ginseng Research Division, National Institution of Horticultural and Herbal Science, Rural Development Administration (Eumseong, Korea). After storage for more than 1 month at 4°C to break dormancy breaking, the roots were planted in soil and grown for 4 weeks to become plants with fully expanded leaves under normal growth condition (24°C, relative humidity 60%, and continuous light of 40 lE/m 2 /s). These plants grown for 4 weeks were sampled as controls, immediately before stress treatment. For cold treatment, the plants were held at 4°C for 24 h (with relative humidity and light conditions the same as the normal growth condition). For salt treatment, pots with plants were submerged in 100 mM NaCl solution for 24 h to treat only root parts with salt stress (temperature, relative humidity and light condition were the same as the normal growth condition). For drought stress, plants were removed from soil and air-dried on 3MM paper for 24 h under the normal growth condition. For heat treatment, the plants were treated with 30 (AE1)°C for 1 week and 3 weeks (relative humidity and light conditions were the same as the normal growth conditions). After stress treatment, whole plant (leaves, stems and roots) were sampled, immediately frozen using LN 2 , and stored at À70°C before total RNA isolation.
Total RNAs from each sample were extracted using RNeasy Plant kits (QIAGEN, Hilden, Germany) and/or Hybrid-R kits (GeneAll, Seoul, Korea) according to the manufacturer's instructions, and used for construction of 300-bp PE libraries using an Illumina TruSeq RNA sample preparation kit according to the manufacturer's instructions. These libraries were pooled and sequenced by Illumina HiSeq2000 and NextSeq500 platforms (Table S7). The resulting RNA-Seq reads were mapped to the P. ginseng draft genome and assembled using HISAT (Kim et al., 2015a) and StringTie (Pertea et al., 2015), respectively. De novo assembly was performed using Trinity (Grabherr et al., 2011) to obtain full-length transcripts. All RNA-Seq samples were normalized using Trimmed Mean of M values (Dillies et al., 2013) (TMM).
Analysis of differential gene expression was performed using edgeR (Robinson et al., 2010) with false discovery rate (FDR)adjusted P-value of 0.01. Transcriptomes of 22 ChP samples including normal tissues and abiotic stress-treated samples were also analysed using 26 SMRT cells with P6-C4 chemistry of the PacBio RSII platform. Generated sequences were classified and clustered by the PacBio Iso-Seq analysis procedure (ver. 0.1) with default parameters (www.pacb.com) to generate high-quality (HQ) consensus isoform sequences (99% consensus accuracy based on Quiver). The HQ sequences were further processed to remove PCR chimeras and redundant sequences by cd-hit-est (Li and Godzik, 2006), and final HQ nonredundant (nr) isoform sequences were obtained based on genome positional coordinates.

Genome annotation
The IPGA pipeline was used for genome annotation, incorporating evidence from protein and RNA-Seq mapping and ab initio gene prediction to determine consensus gene models by EVM (Haas et al., 2008), that were curated using PacBio transcript sequences. The alternative splicing transcripts for the final curated protein-coding genes were identified using reference-based assembly generated by PacBio and Illumina sequencing data. Then, the reference-guided transcripts and annotated proteincoding genes were compared to identify novel isoforms using cufflink utility. Further, those novel isoforms were used to find the specific splicing events (i.e. skipping exon, mutually exclusive exons, alternative 5 0 or 3 0 splice site, retained intron and alternative first and last exon) using SUPPA (Alamancos et al., 2015). LncRNAs were identified from the reference-guided transcriptome assembly. From the total transcripts, transcripts with ORF ≥ 100 amino acids and length ≤200 nucleotides, having homology hit to the swiss-prot protein database, Pfam domains and other types of noncoding RNAs (tRNA, rRNA, snRNA, snoRNA), were discarded. Further, transcripts that span over 40% to repeat-masked genomic region and contained partially at protein-coding genes (IPGA v1.1) were discarded. Finally, the coding potential was accessed for the remaining transcripts using CPC with score ≤À1.0 and CPAT with score <0.39. A total of 19 495 lncRNA transcripts were identified using the above criteria. Transcription factor genes in the P. ginseng genome were identified and compared with corresponding genes in other plant genomes using iTAK 1.6b standalone (Zheng et al., 2016) with default parameters. A P. ginseng small RNA library generated by Mathiyalagan et al. (2013) was used for conserved miRNA prediction using mireap v0.2 (https://sourceforge.net/ projects/mireap/). The predicted miRNAs with match to miRBase (v21) (http://www.mirbase.org/) were referred to as conserved miRNAs. The target prediction was performed for conserved miRNAs using psRNATarget (Dai and Zhao, 2011). Functional descriptions were assigned to annotated genes using BLASTP search (E-value: 1E-05) to NCBI Nr, Arabidopsis, tomato and INTERPRO protein databases (Zdobnov and Apweiler, 2001). GO enrichment analysis was performed using Fisher's exact test with multiple testing correction of FDR with cut-off 0.05. The P. ginseng repeat library was constructed from eight reported transposable elements and consensus repeats characterized with pre-identified LTR-RTs (Choi et al., 2014;Jang et al., 2017) and RepeatModeler (Smit and Hubley, 2010), and genomewide repeat content of the assembled genome was calculated with RepeatMasker (Smit et al., 1996).

Genome evolution
A reciprocal all-vs.-all BLAST hit approach was used to identify homologous P. ginseng genes, which were clustered using an inhouse Python script. Synonymous substitution (Ks) was calculated for duplicated genes using codeml in PAML package (Yang, 2007). Paralogous and syntenic collinear blocks were characterized using MCScanX (Wang et al., 2012). Sequence comparison was conducted using BLASTZ (Schwartz et al., 2003), PipMaker (Schwartz et al., 2000) and SynMap (Lyons et al., 2008). A zigzag approach was designed that assigned the adjacent scaffolds based on the counterpart paralogous scaffold information. Collinear blocks consisted of several scaffolds which sharing more than five paralogous genes with Ks values <0.2. Gene sets from Arabidopsis, grape, tomato and carrot were used for ginseng orthologous gene family clustering by OrthoMCL (Li et al., 2003). The complete chloroplast genomes and 45S nrDNA were newly assembled for P. trifolius and P. stipuleanatus based on dnaLCW method (Kim et al., 2015c). Phylogenetic trees of the chloroplast genome and 45S nrDNA of Panax-related species were generated by Bayesian Inference using BEAST (version 1.8.1) (Drummond and Rambaut, 2007), and divergence time was calculated using the root age between P. ginseng and D. carota. The collinear region between P. ginseng and D. carota was characterized using MCScanX. The number of identified collinear P. ginseng scaffolds was 970 that harboured 25 091 genes and occupied 476 692 022 bp, of which 13 356 genes had collinear orthologous relationship with 10 381 of D. carota genes. Manual ordering of P. ginseng scaffolds based on gene order of D. carota chromosomes and constructed 18 artificial superscaffolds of P. ginseng. The genomic proportion of the major repeats was calculated as the total amounts of nucleotides mapped to the major repeats divided by the sum of all nucleotides in each data set (GP (%) = (masked read length/total read length) 9 100). Repeat amount was used as an indicator of the actual amounts of repeats in a total genome which calculated as repeat amount = (masked read length 9 total read length) 9 (actual genome size/ amount of WGS data used).

Genome-scale metabolic network reconstruction
A P. ginseng-specific metabolic network was assembled using information from KEGG (Kanehisa et al., 2012) and BioCyc (Caspi et al., 2014) databases based on sequence homology of P. ginseng with tomato, rice and Arabidopsis. The consensus network was then curated by removing duplicated reactions and verifying elemental balances using the COBRA toolbox (Schellenberger et al., 2011). Thermodynamic reversibility of the reactions was assessed using databases such as MetaCyc and BRENDA (Chang et al., 2014). Isoenzyme and subunit information for reactions were added to the network based on gene annotations. Secondary metabolic pathways of P. ginseng including the ginsenosides were added based on the combined evidence from gene annotations and literature-based biochemical information. Enzyme compartmentalization of the P. ginseng network was inferred from information available in a rice genome-scale metabolic network (Lakshmanan et al., 2015) and Plant-mPLoc (Chou and Shen, 2010). Dead-ends in the network were filled by the GapFind algorithm (Kumar et al., 2007) in the COBRA toolbox.

Identification of genes in ginsenoside biosynthetic pathway
Genes involved in ginsenoside biosynthesis were identified using KEGG annotation and BLASTP against reference enzyme genes retrieved from KEGG and MetaCyc databases (http://metacyc. org/) with E-value cut-off of 1E-05. The key candidate genes were identified by co-expression analysis across RNA-Seq samples from ChP with Pearson correlation coefficients (PCC). MeJA treated RNA-seq data sets in P. ginseng cv. CS were used from Lee et al. (2017b).

Identification of FAD and CAB genes
Fatty acid desaturase genes in P. ginseng were identified using Pfam accessions PF00487, PF11960 and PF03405 from an INTERPRO scan. InterPro analysis was also used to identify FADs in other selected plant species. CAB genes in P. ginseng and other species were identified using Pfam domain PF00504 from Interpro annotation. Phylogenetic trees were generated by MEGA 6.0 (Tamura et al., 2013).

Supporting information
Additional Supporting Information may be found online in the supporting information tab for this article: Figure S1 Integrated pipeline for genome annotation (IPGA). Figure S2 Number of coding exons (CDS) comparison between plant species. Figure S3 Alternative splicing (AS) events in P. ginseng. Figure S4 The Ks distribution of paralog gene pairs and orthologs of five dicot plants. Figure S5 An example of zigzag extension of scaffold sequence. Figure S6 Comparative analysis of four homoeologous blocks in P. ginseng. Figure S7 Chromosomal mapping of genic regions from two adjacent contiguous scaffolds. Figure S8 Chloroplast (cp) genome maps of P. stipuleanatus and P. trifolius. Figure S9 Dotplot and mimetic diagram between scaffolds of P. ginseng and P. notoginseng. Figure S10 Characterization of PgCACTA. Figure S11 Karyotype idiogram of P. ginseng showing repetitive elements previously described as well as the Pg167TR elements. Figure S12 Global metabolic map for P. ginseng. Figure S13 Heat map for major ginsenoside pathway genes and 11 differentially expressed UGTs in response to methyl jasmonate (MeJA) in P. ginseng cv. Cheongsun (CS) adventitious roots. Figure S14 Visualization of global metabolic changes based on RNA-seq expression. Figure S15 The number of differentially expressed (DE) genes among drought, salt, cold and stress samples. Figure S16 A phylogenetic relationship of FAD genes. Figure S17 A phylogenetic relationship of CAB family genes. Figure S18 Expression profiling of CAB genes in P. ginseng. Figure S19 Classification and estimation of CAB orthologs gene copies. Figure S20 P. ginseng specific expansion of TF family genes. Figure S21 Chromosomal distribution of major P. ginseng REs in P. ginseng chromosomes. Figure S22 Estimation of LTR-RT insertion time in P. ginseng. Table S1 Whole genome sequencing (WGS) data generated in this study.  [1904][1905][1906][1907][1908][1909][1910][1911][1912][1913][1914][1915][1916][1917] standalone sw (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.c gi). Table S13 Protein kinase (PK) genes identified in the P. ginseng genome and other 18 plant genomes, using iTAK 1.6b standalone sw (http://bioinfo.bti.cornell.edu/cgi-bin/itak/index.cgi). Table S14 Chloroplast genomes and 45S nrDNA sequences used for comparative analysis in this study. Table S15 Downstream genes involved in ginsenosides biosynthesis comparison with relative plant species.
Table S16 Ginsenosides quantification between P. ginseng cultivars. Table S17 The number of members in FAD gene family in plant genomes. Data S1 List of metabolites and metabolic reactions in ginseng genome-scale metabolic network.