High quality genome of Erigeron breviscapus provides a reference for herbal plants in Asteraceae

Abstract Erigeron breviscapus is an important medicinal plant in Compositae and the first species to realize the whole process from the decoding of the draft genome sequence to scutellarin biosynthesis in yeast. However, the previous low‐quality genome assembly has hindered the optimization of candidate genes involved in scutellarin synthesis and the development of molecular‐assisted breeding based on the genome. Here, the E. breviscapus genome was updated using PacBio RSII sequencing data and Hi‐C data, and increased in size from 1.2 Gb to 1.43 Gb, with a scaffold N50 of 156.82 Mb and contig N50 of 140.95 kb, and a total of 43,514 protein‐coding genes were obtained and oriented onto nine pseudo‐chromosomes, thus becoming the third plant species assembled to chromosome level after sunflower and lettuce in Compositae. Fourteen genes with evidence for positive selection were identified and found to be related to leaf morphology, flowering and secondary metabolism. The number of genes in some gene families involved in flavonoid biosynthesis in E. breviscapus have been significantly expanded. In particular, additional candidate genes involved in scutellarin biosynthesis, such as flavonoid‐7‐O‐glucuronosyltransferase genes (F7GATs) were identified using updated genome. In addition, three candidate genes encoding indole‐3‐pyruvate monooxygenase YUCCA2 (YUC2), serine carboxypeptidase‐like 18 (SCPL18), and F‐box protein (FBP), respectively, were identified to be probably related to leaf development and flowering by resequencing 99 individuals. These results provided a substantial genetic basis for improving agronomic and quality traits of E. breviscapus, and provided a platform for improving other draft genome assemblies to chromosome‐level.


| INTRODUC TI ON
The Compositae is a large plant family containing 25,000-30,000 species that accounts for approximately 10% of all angiosperms (Reyes-Chin-Wo et al., 2017). Many species in Compositae have considerable medicinal, edible, ornamental, and economic importance (Vidic et al., 2016), such as Centaurea cyanus and Artemisia annua, which were the first large medicinal plants used in Europe and America, respectively. Erigeron breviscapus is a medicinal plant with great potential in Compositae. Scutellarin, a major active component of flavonoids abundant in its leaves, is widely used as a prescription drug in treating cardiovascular and cerebrovascular diseases (PPRC, 2015). Breviscapine injection (>90% scutellarin) and E. breviscapus injection (about 16.7% scutellarin) currently prepared from E. breviscapus extracts (PPRC, 2015; Renwei et al., 2011), with a total annual output value over 5 billion yuan, have considerable economic benefits and application value.
Based on the previous draft genome assembly and engineering yeast for the production of breviscapine (scutellarin and apigenin-7-O-glucuronide) by genomic analysis and synthetic biology Yang et al., 2017), E. breviscapus has becoming the first medicinal plant accomplishing the process from genome to biosynthesis. However, due to the high heterozygosity and repeatability in the E. breviscapus genome, Illumina short-read sequencing used in the previous version is not sufficient for high quality genome assembly, thus hindering the complete capture of candidate genes involved in scutellarin synthesis and development of molecular-assisted breeding based on the genome.
To date, genomes from many species in Asteraceae have been released, including horseweed (Conyza canadensis), globe artichoke milk thistle (Silybum marianum; Peng et al., 2014;Scaglione et al., 2016;Bowers et al., 2016;Yang et al., 2017;Reyes-Chin-Wo et al., 2017;Badouin et al., 2017;Shen et al., 2018;Hirakawa et al., 2019; https://www.ncbi.nlm.nih.gov/genom e/?term=Aster aceae). Among them, only sunflower and lettuce in Compositae have reached the chromosome level (Badouin et al., 2017;Reyes-Chin-Wo et al., 2017). PacBio RSII sequencing combined with Hi-C technology provides the ability to generate long reads that effectively bridge complex regions, such as repeat sequences, and generate long-range haplotypes to distinguish genes within gene families, thus improving the quality of assembly and helping to anchor a large number of genomic fragments at the chromosome level. Therefore, it is necessary to update the genome of E. breviscapus using third-generation sequencing and Hi-C technology.
Flavonoids are a large class of plant secondary metabolites (Jaakola & Hohtola, 2010). Accumulation of flavonoids is a strategy to cope with adverse environmental stresses in adapting to special environments, with flavonoids functioning to protect plants from oxidation, ultraviolet radiation and pathogen invasion (Petrussa et al., 2013;Singh et al., 2017;Wang et al., 2011). E. breviscapus is widely distributed in middle and high-altitudes, and shows high accumulation of scutellarin, as well as many flavonoids including luteolin, kaempferol, quercetin and hesperetin in leaves (Chu et al., 2005;Su et al., 2001). As far as the plant itself is concerned, it will be meaningful to explore whether this accumulation is related to changeable climate factors in middle and high-altitude habitats, such as low oxygen, ultraviolet radiation, and temperature differences between day and night, and is also reflected in the evolution of E. breviscapus genome. In addition, the leaf number is the basis for ensuring high yield during the planting of E. breviscapus . The lack of knowledge on mechanisms regulating flowering, self-incompatibility and leaf number in E. breviscapus restricts breeding late flowering and leafy varieties . Therefore, resequencing based on a high quality reference genome is a potential and effective method to develop markers for molecular assisted breeding (Cao et al., 2016;Huang et al., 2010).
In this study, we used the third generation of genome sequencing technology to update the genome assembly of E. breviscapus. On the basis of the obtained high-quality assembled genome, we resequenced 99 individuals to identify single nucleotide polymorphisms (SNPs) and candidate genes controlling late flowering and number of leaves, re-captured candidate genes involved in major flavonoids biosynthesis, and established a stable genetic transformation system. The results will provid valuable technical support and genetic resources for genetic improvement and optimization of scutellarin biosynthesis in E. breviscapus in the future, and also provide a template for improvement of other draft genomes to chromosome-level. of sheared DNA was used to construct SMRT Cell libraries with an insert size of 17 kb. These libraries were sequenced in SMRT DNA sequencing cells with P6/C4 chemistry. Together with the long reads obtained from the previous study , about 157.41 Gb of raw data was obtained on a PacBio RSII instrument.
Illumina paired-end sequencing of 100 bp with insert sizes ranging from 150 to 800 bp and mate pair sequencing with insert sizes from 2 to 20 kb were carried out as previously reported .
After filtered by minimum length of 50 bp and trimmed of adapter sequence by home-made scripts, de novo assembly of the PacBio reads was performed with the default wtdbg pipeline (https://github.com/ruanj ue/wtdbg). Illumina reads were trimmed by software trimmomatic-0.36 with default parameters and were used to correct base-calling by proovread with the default parameters (Hackl et al., 2014).

| Hi-C assisted contig clustering
The Hi-C library was prepared with the standard procedure of Novogene as follows. Nuclear DNA was cross-linked in situ, extracted and then digested with a restriction enzyme. The sticky ends of the digested fragments were biotinylated, diluted and then ligated to each other randomly. After being enriched and sheared again, biotinylated DNA fragments ranging from 400 to 500 bp were PE-100 sequenced using Illumina HiSeq platform, producing 106.32 Gb of raw data. Raw Hi-C reads were first trimmed by software trimmomatic-0.36 with parameters "PE -threads 40 -phred33 TAILCROP: 70 MINLEN: 45 LEADING: 3 TRAILING: 3 SLIDINGWINDOW: 4:15", the cleaned data were then aligned to the assembled contigs by bowtie2 (2.2.6) integrated in HiC-Pro (2.9.0; Servant et al., 2015).

| Transcriptome sequencing
To investigate the expression patterns of candidate genes involved in flavones and caffeoylquinic acids biosynthesis, the tissues collected from the roots, stems, leaves and flowers of 90-day-old plants were used for tissue specificity analysis. Each plant represents a biological repeat, and three biological replicates were performed. To examine the effect of abscisic acid (ABA), salicylic acid (SA) and gibberellin (GA) hormones on gene expression, 60-day-old plants were treated with 200 μmol/l ABA, SA and GA by foliar spray, and then leaves harvested at 4, 12, and 24 hr after treatment, respectively. Five to eight leaves from different plants were mixed and represent a biological repeat, and three biological replicates were performed. All collected samples were immediately frozen in liquid nitrogen and stored at −80°C until use. Total RNA was extracted using Q iagen RNeasy Plant Mini Kits. According to the manufacturer's instructions, total RNAseq libraries were prepared using TruSeq RNA Library Preparation Kit,v. 2 (Illumina), and were subsequently pair-end sequenced with a read length of 150 bp on the HiSeq 4000 platform. In total, about 688.9 billion RNA-seq reads were obtained, representing about 103.3 Gb of raw data. According to the following metrics: (a) to remove adaptors; (b) to trim 3′ end bases with a quality score below 20; and (c) to remove low-quality reads with base Ns or more than 10% of bases with a quality score below 20, raw RNA-seq reads were trimmed and filtered by Trimmomatic (Bolger et al., 2014

| Estimation of the genome heterozygosity and repeat content by k-mer analysis
The quality-filtered short fragments from the Illumina platform were subjected to 17-mer frequency distribution analysis with Jellyfish
In the present study, we combined three different gene-model prediction methods, homology-based predictions, de novo predictions, and transcriptome-based predictions. In the homologue-based gene-prediction model, protein sequences of A. annua, Cynara cardunculus, H. annuus, L. sativa, Arabidopsis thaliana, Nicotiana tabacum, Oryza sativa and Vitis vinifera downloaded from the National Centre for Biotechnology Information (https://www.ncbi.nlm.nih.gov/) and were subjected to TBLASTN analysis to the E. breviscapu assembled genome with a cutoff E-value of 1e −5 (Altschul et al., 1990). BLAST hits corresponding to reference proteins were concatenated by Solar after low-quality records were removed. The genomic sequence of each reference protein was extended upstream and downstream by 2,000 bp to represent a protein-coding region. For de novo prediction, we used AUGUSTUS (v. 2.5.5; Stanke et al., 2006), GENSCAN (v. 1.0;Cai et al., 2014), SNAP (released 29 November 2013; Korf, 2004), and glimmerHMM (v. 3.0.2) on the repeat-masked genome, with parameters trained from A. thaliana. In the transcriptome-based prediction, unigenes identified by TopHat (v. 2.0.10; Trapnell et al., 2012) were first aligned to the genome assembly and were then integrated with PASA (Haas et al., 2003) with default parameters. All predicted gene structures were integrated into a consensus set with EVidenceModeler (EVM; Haas et al., 2008). Genes were then annotated according to homologous alignments with BLAST (E-value ≤ 1e −5 ) against several databases including the nr  (Kanehisa & Goto, 2000) was used for KEGG pathway annotation.

| Gene families analysis
To construct the phylogenetic tree and conduct the divergence time estimation and gene expansion/contraction analysis, OrthoMCL (v. 2.0.9; Li et al., 2003) pipeline with the settings (BLASTP E-value < 1e −5 ) was applied to identify the potential orthologous gene families among E. breviscapus, A. annua, C. cardunculus, H. annuus, L. sativa, A. thaliana, N. tabacum, O. sativa and V. vinifera. To construct the phylogenetic tree, MUSCLE (v.3.8.31;Edgar, 2004) with default settings was used to align single-copy orthologous gene sequences from nine species, and PhyML (v. 3.0) with default parameters was subsequently used to construct the tree. In addition, the known divergence time from the public resource TIMETREE (http://www.timet ree.org) was provided for calibration and the program MCMCtree from the PAML package (Yang, 2007) was applied to estimate the divergence time. Based on the phylogeny and gene family size, CAFE (v.2.1;De Bie et al., 2006) was applied to identify gene families which had undergone expansion and/or contraction with the parameters "p = .05, number of threads = 10, number of random = 1,000, and search for lambda".
To detect genes under positive selection, we used the coding DNA sequence (CDS) libraries of H. annuus and A. annua to run BLASTN against the E. breviscapus CDS library, respectively. The best hits were analysed in KaKs_Calculator v.2.0 (Zhang et al., 2006) with default parameters.
The specific gene families of seven species (V. vinifera, L. sativa, H. annuus, E. breviscapus, A. annua, A. thaliana, and O. sativa) were identified using the HMMER3 (http://hmmer.janel ia.org/) software and the Pfam-A data sets was downloaded from PFAM (http://pfam.janel ia.org/). The seed file for CHI domain (PF02431) was obtained, and Pfam-A data sets for six other types, namely PAL, CHS, 4Cl, F6H, FSII, and F7GAT were transferred from fasta sequences downloaded from NCBI to stockholm file by perl script "fasta2sto.pl". The domain file was used as the first template to scan the gene family whereby the output genes were filtered out with an E-value below 1e-10. The filtered genes were used as second templates for a second round of scanning of the target gene families. Similarly, the second phase of output genes was filtered out with an E-value of 1e-10.
The putative genes from the gene family were identified. Finally, the gene families of all seven species were filtered by blast with the downloaded sequence with cutoff: identity ≥ 70%.  (Table S16). Raw reads were filtered using NGSQCToolkit_v2.3.3 (Patel & Jain, 2012), where reads containing adapter or poly-N, and low-quality reads (reads with >30% bases having Phred quality ≤25) were removed.

| Sequence alignment, variation calling and annotation
All the clean reads for each sample were mapped to the newly up- RealignerTargetCreator was firstly used to identify regions where realignment was needed; package IndelRealigner was then used to realign the regions found in the first step, which produced a realigned BAM file for each accession.
The variation detection followed the best practice workflow recommended by GATK. In brief, variants were called for each sample by the GATK HaplotypeCaller (Emanuelli et al., 2013). A joint genotyping step for comprehensive variations union was performed on the GVCF files. Raw SNPs were filtered by VCFtools (v0.1.13;Danecek et al., 2011) with the following parameters "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < −12.5 || ReadPosRankSum < −8.0". Indels that shorter than or equal to 10 bp were filtered with "QD < 2.0 || FS > 200.0 || ReadPosRankSum < −20.0". SNPs and Indels that were not biallelic, had >5% missing calls and MAF < 0.05 were removed. The identified SNPs and Indels were further annotated with ANNOVAR software  and divided into groupings of variations occurring in intergenic regions, coding sequences and introns, on the basis of newly updated E. breviscapus genome annotation information.

| Population genetic analysis
The whole-genome SNPs were used to construct the ML (Maximum likelihood method) phylogenetic tree with 100 bootstrap using SNPhylo (Ver. 20140701; Clark et al., 2007). The tool iTOL (Letunic & Bork, 2019; http://itol.embl.de) was used to colour the phylogenetic tree. Principal component analysis (PCA) was performed with the Genome-wide Complex Trait Analysis (GCTA, Ver. 1.25.3) software (Yang et al., 2011), and the first three eigenvectors were plotted.

| Genome-wide association study analysis
A total of 4,255,459 high-quality SNPs and 1,646,738 Indels identified in 99 E. breviscapus samples were used to perform SNP-level and Indel-level genome-wide association study (GWAS) for three traits, respectively. GWAS was performed with a linear mixed model (LMM) in genome-wide efficient mixed model association (GEMMA, Version: 0.98.1) software (Zhou & Stephens, 2012), whereas the estimated standardized relatedness matrix (-gk 2) estimated by GEMMA (Zhou & Stephens, 2012) was used as a random effect to correct the population structure. To control the genomewide type I error rate, the effective number (N) of independent SNPs and Indels were calculated using the Genetic Type I Error Calculator (GEC, v0.2; Browning & Browning, 2016). Significant (0.05/N, Bonferroni correction) and suggestive (1/N) p-value thresholds were set as 2.09e-8 and 4.19e-7 for SNPs and 5.48e-8 and 1.09e-6 for Indels, respectively. The genomic inflation factor (lambda) was calculated by R package qqman (Turner, 2014). All software parameters used in our study are summarized in Table S19.

| Annotation and analysis of UDPglycosyltransferase genes
The UDP-glycosyltransferase (UDPGT) genes in the E. breviscapus genome were predicted by hmmsearch with the UDPGT hmm model PF00201 (E-value < 1e −10 ) from Pfam  and the putative UDPGT proteins were screened by amino acid length between 400 and 650. A total of 144 UDPGT genes were predicted in the E. breviscapus genome. Based on the previous study, the maximum-likelihood tree was constructed by MEGA (Kumar et al., 2016), 144 UDPGT genes were then clustered into several classes.

| Overexpression vectors construction and transformation of E. breviscapus
The full-length open reading frames (ORFs) of PAL, C4H, 4Cl, CHS, CHI, and FSⅡ were cloned from leaves of E. breviscapus. The target fragments and pCambia1301-35SN were cut by the restriction enzymes as shown in Table S14. The purified DNA fragments were inserted into the linearized pCambia1301-35SN with the T4 DNA ligase (NEB, Kunming, China). The resulting constructs were introduced into Agrobacterium tumefaciens strain EHA105 by electroporation. E. breviscapus transgenic plants were generated as described previously (Zhang et al., 2007).

| Gene expression and high-performance liquid chromatography analyses
The relative transcript levels of six genes located upstream of bre- for 2 min, followed by 40 cycles of 95°C for 20 s, 58°C for 20 s, and 72°C for 20 s. The actin gene was chosen as a reference gene to control for normalization. Samples for high-performance liquid chromatography (HPLC) analysis were prepared as described previously . The scutellarin standard was obtained from Solarbio. The experiments were carried out with three biological replicates.

| Assembly of reference genome of E. breviscapus
According to the standard 17-mer curves, the heterozygosity of E.
breviscapus was approximately 2.04% ( Figure S1). In the previous study, a total of 320.50 Gb of Illumina paired-end sequencing data (210.8X) were obtained . Here, 157.41 Gb of PacBio raw data (103.6X), and 106.32 Gb of Hi-C data (69.9X) were generated to update this complex genome assembly (Table S1).
The total genome assembly amounted to 1.41 Gb, consisting of 18,973 contigs with the longest length of 1,416,127 bp and contig N50 of 140,946 bp (Table S2). With the aid of Hi-C sequence data, 99.2% of the sequences were anchored and oriented onto nine pseudo-chromosomes, with a total size of 1.43 Gb, covering 94.1% of the genome size estimated by flow cytometry . The completeness of the genome assembly showed that 88.5% of the plant sets were identified as complete (1,274 out of the 1,440 BUSCOs; Table S3) (BUSCO; Simao et al., 2015).
By integrating homology-based and de novo approaches, 67.42% of the genome was predicted as transposable elements, among which long terminal repeats (LTRs) were the most abundant characterized elements, accounting for 40.15% of the genome, while 25.69% of that could not be classified into any known cluster (Tables S4 and S5). Next, combined with the transcriptome data from four tissues (leaf, flower, stem, and root) and three plant hormone treatments (ABA, GA, and SA), 43,514 protein-coding gene models were obtained, with an average coding-sequence length of 1.14 kb and an average of 5.3 exons per gene (Tables S6 and   S7). Also identified were 906 miRNAs, 854 tRNAs, 266 rRNAs, and 818 snRNAs (Table S8). The previously published E. breviscapus genome was approximately 1.2 Gb, with contig and scaffold N50 sizes of 18.8 kb and 31.5 kb, respectively .
By comparing assembly statistics for the genome of E. breviscapus with Asteraceae, the updated genome increased in size to 1.43 G, with a scaffold N50 of 156.82 Mb and contig N50 of 140.95 kb (Table 1). The contig N50 of updated E. breviscapus genome was shorter than that of sunflower, but longer than those of other Asteraceae genomes, including L. sativa, A. annua, C. Nankingense, C. seticuspe and so on ( Table 1). The genomic characterization of the E. breviscapus genome including chromosome, GC content, gene number, repeat content, and SNP density is shown in Figure 1.

| Evolution analysis
Both ortholog clustering and gene family clustering analyses were performed using OrthoMCL on all of the protein-coding genes of E. breviscapus, A. thaliana,O. sativa,N. tabacum,V. vinifera,H. annuus,L. sativa,C. cardunculus,A. annua,and G. max. In E. breviscapus,43,514 protein-coding genes were comprised of 4,538 single-copy orthologues, 9,372 multiple-copy orthologues, 10,113 unique paralogues, 9,536 other paralogues, and 9,955 unclustered genes ( Figure S2). A total of 33,559 protein-coding genes can be clustered into 14,045 gene families, among which 2,024 were unique gene families (Table S9).
Based on a concatenated sequence alignment of single-copy genes shared by the Asteraceae family and four other green plant species, a phylogenetic tree was constructed ( Figure S3) Table S10).

| Positive selection of genes in E. breviscapus
An increased rate of nonsynonymous substitution (Ka) relative to synonymous substitution (Ks) within certain genes may account for the adaptive evolution of organisms at the molecular level (Q iu  (Table S11).
These genes revealed candidates with putative functions related to leaf morphology, flowering and secondary metabolism. In particular, Cys(2) His(2) zinc finger transcription factors regulate leaf morphology (Chen et al., 2010) and MADS-Box proteins control flowering in Arabidopsis (Favaro et al., 2003). Auxin-responsive proteins regulate floral meristem maintenance and termination by repressing cytokinin biosynthesis and signaling (Zhang   Noguchi et al., 2009).

| Discovery of genes involved in flavones and caffeoylquinic acids biosynthesis
The major active component of E. breviscapus is breviscapine, mainly scutellarin, along with a small amount of apigenin 7-O-glucuronide.
To investigate the expression patterns of candidate genes in- Our previous research had successfully identified two key enzymes involved in breviscapine biosynthesis: 1. F7GAT, which converts apigenin into apigenin-7-O-glucuronide; and 2. F6H, which functions together with F7GAT in produce scutellarin from apigenin and synthesizes breviscapine de novo using engineered yeast Jiang et al., 2014;Liu et al., 2018). Surprisingly, it mainly produced apigenin 7-O-glucuronide instead of scutellarin. The reason may be that only one EbF7GAT belonging to the UGT88X subfamily was identified from the previously published genome, which converted scutellarein to scutellarin . EbF7GAT has no strict specificity and thus recognizes many other flavonoids as its substrates. Here, we reanalysed the UGT gene family from the updated genome. A total of 144 UDPGT genes were identified and assigned to 19 gene families ( Figure S4). Among these, three genes (evm.ugt.ctg7693.8, evm.ugt.ctg3868.3 and evm.ugt.ctg1608.6) belong to the UGT88 gene family Nagashima et al., 2000;Noguchi et al., 2009;Ono et al., 2010) (Figure S4a; Table S13). We subdivided these genes with genes in the UGT88 gene family reported in other plants, finding that two genes (evm. ugt.ctg7693.8 and evm.ugt.ctg3868.3) were clustered together into the UGT88X branch (evm.ugt.ctg3868.3 was the same with previously published genome), while another gene (evm.ugt.ctg3868.3) belongs to the UGT88F subfamily ( Figure S4b).

| Establishment of genetic transformation of E. breviscapus
Both overexpression and CRISPR/Cas9 gene editing need a stable genetic transformation system. Therefore, we established an Agrobacterium-mediated genetic transformation system which involves inoculation, cocultivation, selection, differentiation and regeneration. To test the reliability of the genetic transformation system, six enzyme genes, including PAL, C4H, 4Cl, CHS, CHI, and FS II that locate in upstream of breviscapine biosynthesis were cloned from the updated genome and ligated each into a pCAM-BIA1301-35SN vector (Table S14). Transgenic plants were generated by Agrobacterium (EHA105)-mediated leaf disc transformation ( Figure S5a,b). The expression level of PAL, C4H, 4Cl, CHS, CHI and FS II was dramatically increased in the transgenic E. breviscapus plants, reaching 4.0-to 12.8-fold higher than wild-type ( Figure S5c).
Meanwhile, scutellarin content significantly increased in all overexpressing transgenic lines, ranging from 0.23% to 0.41%, 1.92-3.42 times higher compared to that in wild-type (0.12%; Figure S5d).  Figure S6; Table S15). The phenotypic distribution showed the three traits in multileaf and sparse leaf groups were obviously distinct, especially leaf number, which illustrated the reliability of these phenotypes to distinguish samples.

| GWAS analysis based on high quality reference genome
By whole-genome resequencing of 99 E. breviscapus individuals, we generated a total of 1640.77 Gb of clean reads, representing an average sequencing depth of 11.58× (Table S16) (Table S17). The ratio of the number of nonsynonymous to synonymous SNPs was 1.61, which was higher than that of Arabidopsis (0.83) and O. sativa (1.29;Clark et al., 2007;Xu et al., 2012).

| D ISCUSS I ON
As a self-incompatible species in Compositae, E. breviscapus has a high heterozygosity genome of approximately 2.04% in the 17-mer analysis. Similarly, another self-incompatible species in Compositae, A. annua also has a high heterozygosity of 1.0%-1.5% (Shen et al., 2018). It is a huge challenge to assemble a complex diploid genome using only short-read sequencing (Illumina) as demonstrated by the previously published E. breviscapus genome .
Nonetheless, the PacBio RSII sequencing platform generating much longer reads, greatly facilitates the sequence assembly and enhances the assembly quality (Shen et al., 2018). Here, an updated E. breviscapus genome was achieved by combining Illumina pairedend sequencing data, PacBio RSII sequencing data and Hi-C data.
The new genome assembly increased in size from 1.2 Gb to 1.43 Gb, which was closer to the estimated size of 1.52 Gb by flow cytometry . The new genome assembly was anchored and oriented onto nine pseudo-chromosomes, and showed significantly longer scaffold N50 and contig N50, more protein-coding genes and higher completeness of the genome assembly compared to the previous version (Table 1). In Asteraceae, the genomes of sunflower, sweet wormwood and chrysanthemum have been released (Badouin et al., 2017;Shen et al., 2018;. Compared with other species, both contig N50 and scaffold N50 of E. breviscapus are significantly longer than those of A. annua and chrysanthemum, though shorter than those of sunflower (Table 1). Additionally, E.
breviscapus is the third species in Asteracea whose genome has been assembled to chromosome level, while sunflower is the first one that has been anchored onto 17 pseudo-chromosomes (Badouin et al., 2017). These results pave the way for further genome study on E. breviscapus as well as Compositae in future.
E. breviscapus has been domesticated and selected from alti- We previously decoded the biosynthetic pathway of breviscapine and produced scutellarin and apigenin-7-O-glucuronide with contents reaching to 108 and 185 mg/L, respectively, by engineered yeast . However, scutellarin is the most abundant component in E. breviscapus, while the content of apigenin-7-O-glucuronide is very low and almost undetectable. We speculated that this may be due to the poor quality of previously published E. breviscapus genome assembly, which lead to the incomplete capture of candidate genes and was unable to effectively convert scutellarein to scutellarin. Excitedly, we obtained two candidate genes belonging to subfamily UGT88X and one belonging to UGT88F by reanalysing the updated genome ( Figure S4b). Also, more candidate genes involved in the apigenin biosynthesis were captured. Further transgenic test of PAL, C4H, 4Cl, CHS, CHI, and FS II indicated that the expression level and scutellarin contents significantly increased in the transgenic E. breviscapus plants compared to WT, confirming the involvement of six enzymes in apigenin biosynthesis ( Figure S5c,d).
Transcriptomic analysis of E. breviscapus roots, stems, leaves and flowers showed that most genes involved in scutellarin biosynthesis had the highest expression in leaves, while key enzymes involved in CQA biosynthesis exhibited the highest expression levels in roots (Figure 3b). This is consistent with the fact that scutellarin is abundant in leaf and CQAs is abundant in root . Those genes that locate in downstream of luteolin, kaempferol, quercetin, and hesperetin biosynthesis were differentially expressed in the roots, stems, leaves, and flowers (Figure 3b), which coincides with the finding that diverse flavonoids in E. breviscapus are widely distributed in different organs . In addition, plant hormone treatment resulted in the upregulation of the most genes involved in flavonoids biosynthesis (Figure 3c). Previous reports have been demonstrated that methyl jasmonates (MeJA), ABA, SA and GA regulate secondary metabolism including flavonoids biosynthesis (Li et al., 2014;Mai et al., 2014;Xiao et al., 2009). Also it showed MeJA could promote scutellarin biosynthesis in E. breviscapus .
Thus, plant hormone treatment is useful for screening candidate genes involved in various metabolic pathways. The genome and transcriptome data provided abundant genetic information for improving scutellarin yield from optimizing precursor genes and F7GAT.
Recent advances in high-throughput sequencing technologies have enabled rapid and accurate resequencing of a large number of genomes (Wu et al., 2019;Yano et al., 2016). Meanwhile, the wide adoption of GWAS allows the identification of genes associated with agronomic traits in crop species (Cao et al., 2016;Huang et al., 2010). For example, Huang et al. (2012) have identified loci associated with flowering time and grain yield traits using GWAS in rice. In maize, the genetic architecture of oil biosynthesis was dissected by GWAS . In this study, phenotypic data indicated that for most individuals, the higher branches number, the more leaves (Table S15). Furthermore, leaf number was positively correlated with plant weight, showing leaf weight largely contributes to plant weight (45% of dry weight) (Table S15).
Based on GWAS analysis of 99 resequenced individuals, the candidate genes significantly associated with the branch number, leaf number and plant weight of E. breviscapus were YUC2, ZOG, SCPL18, PP1, LEA, and FBP genes (Figure 4b). Among them, auxin biosynthesis by the YUC2 controls the formation of rosette leaves in A. thaliana (Cheng et al., 2007). SCP regulates brassinosteroids signaling that affects leaf shape, delayed flowering, and senescence in A. thaliana (Li et al., 2001).
In plants, FBPs influence a variety of biological processes, such as leaf senescence, branching, flowering, self-incompatibility, and responses to biotic and abiotic stresses (Yang et al., 2008). Considering the candidate gene functions and the significance of GWAS signals, YUC2, SCPL18 and FBP may be the most possible candidate genes involved in the leaf development and flowering of E. breviscapus. These results provide a substantial genetic basis for genome-assisted breeding in E. breviscapus.
In conclusion, among 10,000 medicinal plants, E. breviscapus is the only one that meets all of the following criteria: (a) Active ingredients have been decided, and the curative effects are clear. (b) There is an efficient transgenic system which provides a good technical platform for functional gene validation and gene editing. (c) De novo biosynthesis of scutellarin in engineered yeast has been accomplished . (d) There are also small amount of other flavonoids in E. breviscapus such as luteolin, kaempferol, quercetin and hesperetin (Chu et al., 2005), making it as an ideal material for elucidating the biosynthesis of flavonoids. Therefore, E. breviscapus is an ideal model medicinal plant in traditional Chinese medicine research, and updated genome assembly and identified candidate genes are definitely helpful for further improvement and utilization of this medicinal herb.

DATA AVA I L A B I L I T Y S TAT E M E N T
The whole genome sequencing and assembly have been deposited at the Sequence Read Archive (SRA) under Bioproject PRJNA525743.
Transcriptome and resequencing sequence reads have been deposited under PRJNA637961 and PRJNA525744, respectively.