Chromosomal‐level reference genome of the incense tree Aquilaria sinensis

Abstract Trees in the genus Aquilaria (Thymelaeaceae) are known as lign aloes, and are native to the forests of southeast Asia. Lign aloes produce agarwood as an antimicrobial defence. Agarwood has a long history of cultural and medicinal use, and is of considerable commercial value. However, due to habitat destruction and over collection, lign aloes are threatened in the wild. We present a chromosomal‐level assembly for Aquilaria sinensis, a lign aloe endemic to China known as the incense tree, based on Illumina short‐read, 10X Genomics linked‐read, and Hi‐C sequencing data. Our 783.8 Mbp A. sinensis genome assembly is of high physical contiguity, with a scaffold N50 of 87.6 Mbp, and high completeness, with a 95.8% BUSCO score for eudicotyledon genes. We include 17 transcriptomes from various plant tissues, providing a total of 35,965 gene models. We reveal the first complete set of genes involved in sesquiterpenoid production, plant defence, and agarwood production for the genus Aquilaria, including genes involved in the biosynthesis of sesquiterpenoids via the mevalonic acid (MVA), 1‐deoxy‐D‐xylulose‐5‐phosphate (DXP), and methylerythritol phosphate (MEP) pathways. We perform a detailed repeat content analysis, revealing that transposable elements account for ~61% of the genome, with major contributions from gypsy‐like and copia‐like LTR retroelements. We also provide a comparative analysis of repeat content across sequenced species in the order Malvales. Our study reveals the first chromosomal‐level genome assembly for a tree in the genus Aquilaria and provides an unprecedented opportunity to address a variety of applied, genomic and evolutionary questions in the Thymelaeaceae more widely.


| INTRODUC TI ON
The genus Aquilaria (family Thymelaeaceae) contains fifteen tree species, commonly known as "lign aloes", native to the forests of southeast Asia. A special feature of lign aloes is their production of agarwood, which is also known as aloeswood or gharuwood.
Agarwood is produced as an antimicrobial defence mechanism, after infection of the tree with a fungal pathogen, and involves the saturation of infected heartwood with a dark aromatic resin.
The main active compounds present in agarwood are terpenoids, specifically sesquiterpenes and derivatives of flindersiachromone (Chen et al., 2012), and the composition of oil extracted from agarwood is exceedingly complex, including over 150 chemical compounds (Naef, 2011). As a consequence of the unique fragrant properties of agarwood, it has long been traded as a highly prized cultural, religious, and medical commodity. For example, the use of agarwood as a fragrant product is recorded in Sanskrit Vedas dating to 1,400 B.C., while the Greek physician Pedanius Dioscorides recorded medical uses for agarwood in his De Materia Medica from 65 A.D., and agarwood is also highly revered as an icense in Islamic, Buddhist and Hindu ceremonies (Lopez-Sampson & Page, 2018).
Today, the demand for agarwood remains great, not least in the form of oud oil which is distilled from agarwood for perfumery, and high grade agarwood products can reach prices as high as US$100,000/kg, with global trade estimated at US$6-US$8 billion (Akter, Islam, Zulkefeli, & Khan, 2013). A driver of the expense of agarwood products is the depletion of wild lign aloes, which has led to their inclusion on Appendix II of the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES), in an attempt to control and monitor international trade and help limit impact, while species in the genus have been categorized as "vulnerable" and "critically endangered" by the IUCN Red List of Threatened Species (Harvey-Brown, 2018;Newton & Soehartono, 2001).
The incense tree, Aquilaria sinensis (Lour.) Spreng, is a lign aloe endemic to the south eastern part of China (530,906 km 2 , Harvey-Brown, 2018), including Hainan Province, Fujian, Guanxi, Guangdong Provinces, and Hong Kong. The tree is relatively slow growing, taking 50-100 years to reach its maximum height of ~15 m (CITES, 2015 (Yin, Jiao, Dong, Jiang, & Zhang, 2016), and annual planting of 10,000 seedlings in the country park and other suitable habitats in Hong Kong (Agriculture, Fisheries, & Conservation Department, 2018).
In addition to its ecological and scientific importance, A. sinensis holds particular cultural significance in Hong Kong, as it is commonly believed that the species provided the region's name, which translates from Chinese as "Incense Harbour" or "Fragrant Harbour". Meanwhile, the trading of agarwood or Chen Xiang/Cham Heong (translated from Mandarin/Cantonese) was an important industry since the Sung Dynasty (610-970) (Agriculture, Fisheries, & Conservation Department, 2013;Iu, 1983). Although the cultivation of A. sinensis for the agarwood industry in Hong Kong ceased during in the last century, remaining populations continue to persist in the countryside of Hong Kong, including lowland and broad-leaved forests (Yip & Lai, 2004).
Despite the great scientific and cultural importance of A. sinensis, a high-quality genome is lacking, hindering further understanding of the species, and scientifically driven conservation measures. More widely, genomic resources for the genus Aquilaria are poor, with only a chloroplast genome of A. sinesis and a draft genome of A. agallocha available currently (Chen et al., 2014;Wang et al., 2016). To address this issue, here we provide a high quality chromosomal-level genome assembly for A. sinesis together with a large number of accompanying transcriptomes from diverse plant tissues.

| Sample collection and genome sequencing
Tissue samples used for genome sequencing (mature leaf) and transcriptome sequencing (young and mature leaves, young shoot, flower buds, flower, fruit, seed, aril, intact and wounded stem) were collected from incense tree individuals on the campus of The Chinese University of Hong Kong during the flowering and fruiting period in June 2019. During sample collection, healthy leaves without visible symptoms of fungal infection (e.g., leaf spot, rust or wilt) were collected for DNA extraction. Both surfaces of the leaves were cleaned and rinsed with double-distilled water prior to DNA extraction to minimise the potential contamination. Genomic DNA (gDNA) was extracted from A. sinensis using a Dneasy Plant Mini Kit (Qiagen) following the manufacturer's protocol. Extracted gDNA were subjected to quality control using Nanodrop spectrophotometer (Thermo Scientific) and gel electrophoresis. High molecular DNA was extracted with CTAB method and had a mean molecular weight of 85 kbp. Qualifying samples were sent to Novogene, and Dovetail Genomics for library preparation and sequencing. The resulting library was sequenced on Illumina HiSeq X platform to produce 2 × 150 paired-end sequences. The length-weighted mean molecule length is 23,590.15 bases, and the raw data can be found at NCBI's Sequence Read Archive (SRR10737433). Details of the sequencing data can be found in Table S1.

| Chicago library preparation and sequencing
A Chicago library was prepared as described previously (Putnam et al., 2016). Briefly, ~500 ng of HMW gDNA (mean fragment length = 85 kbp) was reconstituted into chromatin in vitro and fixed with formaldehyde. Fixed chromatin was digested with DpnII, the 5′ overhangs filled in with biotinylated nucleotides, and free blunt ends were ligated. After ligation, crosslinks were reversed and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~350 bp mean fragment size and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters. Biotincontaining fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq X to produce 211 million 2 × 150 bp paired end reads, which provided 146.61 × physical coverage of the genome (1-100 kb pairs).

| Dovetail HiC library preparation and sequencing
A Dovetail HiC library was prepared in a similar manner to that described previously (Lieberman-Aiden et al., 2009). Briefly, for each library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted fixed chromatin was digested with the restriction enzyme DpnII, the 5′ overhangs were filled in with biotinylated nucleotides, and free blunt ends were ligated. After ligation, crosslinks were reversed and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments.
The DNA was sheared to ~350 bp mean fragment size and sequencing libraries were generated using NEBNext Ultra enzymes and Illuminacompatible adapters. Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeq X sequencer to produce 193 million 2 × 150 bp paired-end reads, which provided 66,743.81 × physical coverage of the genome (10-10,000 kb pairs).

| Transcriptome sequencing
Transcriptomes of multiple developmental stages were sequenced at Novogene ( Table S1.

| Genome assembly
Linked-Read data were assembled using the supernova assembler (v2.1.1, Marks et al., 2019;Zheng et al., 2016), using the default recommended settings (https://suppo rt.10xge nomics.com/ de-novo-assem bly/softw are/overv iew/lates t/perfo rmance) to produce the a pseudohaplotype assembly outputs (--style = pseudohap). The Supernova output pseudohaplotype assembly, shotgun reads, Chicago library reads, and Dovetail HiC library reads were used as input data for HiRise, a software pipeline designed specifically for scaffolding genome assemblies using proximity ligation data (Putnam et al., 2016). An iterative analysis was conducted as follows. First, Shotgun and Chicago library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berke ley.edu). The separations of Chicago read pairs mapped within draft scaffolds were analysed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and make joins above a threshold. After aligning and scaffolding Chicago data, Dovetail HiC library sequences were aligned and scaffolded following the same method. After scaffolding, shotgun sequences were used to close gaps between contigs.  (Palmer & Stajich, 2017), the softmasked assembly were used to run "funannotate train" with parameters "

| Repetitive elements annotation
Repetitive elements were identified using an in-house pipeline.
Firstly, elements were identified using repeatmasker version 4.0.8 (Smit, Hubley, & Green, 2013) with the eukarya RepBase repeat library (Jurka et al., 2005). Low-complexity repeats were not masked and the sensitive search parameter was specified. Following this, a de novo repeat library was constructed using repeatmodeler version 1.0.11 (Smit & Hubley, 2015), including recon version 1.08 (Bao & Eddy, 2002) and repeatscout version 1.0.5 (Price, Jones, & Pevzner, 2005). Novel repeats identified by repeatmodeler were analysed using a blast, extract, extend process to improve the characterization of elements along their entire length (Platt, Blanco-Berdugo, & Ray, 2016). Consensus sequences were generated for each family, along with classification information. The resulting de novo repeat library was utilised to identify repetitive elements using repeatmasker. Longer repeats were constructed using repeatcraft (Wong & Simakov, 2018) using ltr _ finder version 1.0.5 (Xu & Wang, 2007) to defragment repeat segments. At loci where RepeatMasker annotations overlapped (i.e., where the same sequence was annotated as different repeat families), only the longest repeat was kept.
This conservative approach helps avoid TE content estimates being inflated by counting the same bases multiple times and ensures a one-to-one matching of sequence with repeat family identity. A revised summary table was constructed with the final repeat counts.
All plots were generated using rstudio version 1.

| Gene family annotation and gene tree building
Potential gene family sequences involved in the mevalonic pathway, methylerythritol phosphate pathway, and jasmonic acid biosynthesis pathway were first identified by homology searching in the KEGG database, and were retrieved from the genome using tblastn.
Identity of each putatively identified gene was then tested by comparison to sequences in the NCBI nr database using blastx.

| High quality Aquilaria genome
Genomic DNA was extracted from single individuals of Aquilaria sinensis (Figure 1a), and sequenced using the Illumina short-read and 10X Genomics linked-read sequencing platforms (Table S1) This high-quality Aquilaria genome provides an unprecedented opportunity to address the issue of a variety of applied, genomic and evolutionary questions in the Thymelaeaceae more widely.

| Transposable elements
Transposable elements (TEs) make up a proportion of most eukaryote genomes, but the contribution of TEs to genome size varies greatly across lineages. In plants, TEs may dominate the host genome, for example comprising 85% of the maize genome (Schnable et al., 2009) and potentially as much as 90% of the crown imperial (Fritillaria imperialis) genome (Ambrožová et al., 2010). At the other end of the spectrum, TEs may account for a very small proportion of the genome, such as in the carnivorous bladderwort, where TEs represent just 3% of total genomic DNA (Ibarra-Laclette et al., 2013).
However, in most plant genomes TE content lies somewhere between these extremes.
We estimated that the repeat content of the incense tree accounts for more than half of its total genome size (61.22%) (Figure 3a,  Table S3). By far the main contribution of TEs to the incense tree genome is from LTR elements, which account for over one third of the total assembly (36.47%, Figure 3a, Table S3). DNA transposons represent 6.4% of the incense tree genome, and there is just a small contribution from LINEs (2.1% of the genome), and very few SINEs present in the genome (0.01%) (Figure 3a, Table S3).
In Figure 3a, we summarise recently reported TE analyses for sequenced members of the order, specifically: diploid cotton (Wang et al., 2012), red silk cotton tree (Gao et al., 2018), durian , and cacao (Matina 1-6) (Motamayor et al., 2013). This provides a comparative overview of TE content in species closely related to the incense tree. Differences in the sequencing strategies employed to generate available Malvales genomes, accompanying variation in assembly quality, and differences in the approaches applied to generate TE annotations all complicate comparative analysis of TEs among members of the Malvales. However, it is clear that the incense tree shows the same general trends as other species in the order (Figure 3a). In summary, TEs make up a large proportion of total genomic content across the Malvales, with LTR retrotransposons representing by far the greatest contribution to repeat content (32%-48%, Table S4). As with other members of the Malvales, gypsy-like and copia-like LTR elements were the most abundant superfamilies among the LTR elements identified in the incense tree genome (Motamayor et al., 2013; (Figure 3a, Table S4). DNA transposons make up the next largest contribution to Malvales genomes, but they represent a considerably lower proportion of the genome (1%-9%) (Figure 3a, Table S4). Among DNA transposons, MULE elements were the most abundant superfamily, followed by En/Spm elements from the CMC superfamily (Table S4).
LINEs make up a very small proportion of total TE content (0.17%-2.7%), with SINEs almost completely absent from Malvales genomes (0.09% in G. raimondii, 0.01% in A. sinensis) (Figure 3a). Overall, the observed patterns demonstrate that TEs represent a major force for shaping genome size among members of the Malvales, with LTR elements being by far the most important repeat class.
A Kimura distance-based copy divergence analysis for the incense tree revealed that the most frequent TE sequence divergence relative to TE consensus sequence was 1%-10%, mainly due to increased distances among LTR elements (and unclassified elements), suggesting a relatively recent burst of sustained activity, followed by a gradual decline (Figure 3b).

| Genes involved in plant defence and agarwood production
Formation of agarwood is usually associated with fungal infection or physical wounding, where resin composed of mixtures of sesquiterpenes and 2-(2-phenylethyl)chromones (PECs) are secreted by the tree as a defence mechanism (Chen et al., 2012;Naef, 2011). Over time, the accumulation of volatile compounds and sesquiterpenoids lead to the formation of fragrant agarwood (Fazila & Halim, 2012;Hashim, Ismail, & Abbas, 2014).
In the A. sinensis genome, genes involved in the biosynthesis of sesquiterpenoids from isoprenoid precursors using the mevalonic acid (MVA) pathway in the cytosol and the 1-deoxy-D-xylulose-5-phosphate (DXP) or methylerythritol phosphate (MEP) pathway in the plastid were all found to be present ( Figure 4, Table   S5). Further, these two pathways biosynthesise C5 homoallylic isoprenoid precursors including isopentenyl pyrophosphate (IPP) and dimethylallyl pyrophosphate (DMAPP), and the genes encoding enzymes for production of IPP and DMAPP could all be identified ( Figure 4, Table S5). These C5 isoprene units are later converted to C10 geranyl pyrophosphate (GPP) and C15 farnesyl pyrophosphate (FPP) in the presence of the key-limiting enzymes GPP synthase and FPP synthase (Figure 4, Table S5). In the final stage of sesquiterpenes production, the necessary sesquiterpene synthase enzymes (SesTPS) can also be identified in 74 loci (Figure 4 Tables S5 and S6). Further, the jasmonic acid (JA) signalling pathway has been reported to be involved in plant defence and it plays a significant role in the wound-induced signalling mechanism in regulating SesTPS expression (Tan, Isa, Ismail, & Zainal, 2019;Xu et al., 2017), and genes involved in its biosynthesis pathway could all be identified (Figure 4, Table S5). Interestingly, we have also found that genes from the all of the aforementioned pathways, including the MVA pathway, DXP or MEP pathway, and JA biosynthesis pathway, are all expressed in at least one of the wounded stem transcriptomic data sets (stem 02 and 03) ( Figure 4). These data represent the first complete set of genes documented to be involved in sesquiterpenoid production, plant defence, and agarwood production in a single species of Aquilaria.

F I G U R E 4 Genes involved in plant defence and agarwood production in
In conclusion, this study presents the first high-quality genome assembled for a plant in the genus Aquilaria. The A. sinensis genome holds important scientific, commercial and conservation relevance, and our work provides details of the first complete set of genes involved in sesquiterpenoid and agarwood production in Aquilaria.
More broadly, this high quality A. sinesis genome provides a useful reference point for further understanding of Thymelaeaceae biology and evolution.

DATA AVA I L A B I L I T Y S TAT E M E N T
The raw genome and RNA sequencing data have been depos-