A chromosome‐scale assembly of the model desiccation tolerant grass Oropetium thomaeum

Abstract Oropetium thomaeum is an emerging model for desiccation tolerance and genome size evolution in grasses. A draft genome of Oropetium was recently sequenced, but the lack of a chromosome‐scale assembly has hindered comparative analyses and downstream functional genomics. Here, we reassembled Oropetium, and anchored the genome into 10 chromosomes using high‐throughput chromatin conformation capture (Hi‐C) based chromatin interactions. A combination of high‐resolution RNAseq data and homology‐based gene prediction identified thousands of new, conserved gene models that were absent from the V1 assembly. This includes thousands of new genes with high expression across a desiccation timecourse. Comparison between the Sorghum and Oropetium genomes revealed a surprising degree of chromosome‐level collinearity, and several chromosome pairs have near perfect synteny. Other chromosomes are collinear in the gene rich chromosome arms but have experienced pericentric translocations. Together, these resources will be useful for the grass‐comparative genomic community and further establish Oropetium as a model resurrection plant.


| INTRODUCTION
Desiccation tolerance evolved as an adaptation to extreme and prolonged drying, and resurrection plants are among the most resilient plants on the planet. The molecular basis of desiccation tolerance is still largely unknown, but a number of models have emerged to dissect the genetic control of this trait (Hoekstra, Golovina, & Buitink, 2001;Zhang & Bartels, 2018). The genomes of several model resurrection plants have been sequenced including Boea hygrometrica (Xiao et al., 2015), Oropetium thomaeum (VanBuren et al., 2015), Xerophyta viscosa (Costa et al., 2017), Lindernia brevidens , Selaginella lepidophylla (VanBuren, Wai, Ou, et al., 2018), and Selaginella tamariscina (Xu et al., 2018). To date, no chromosome scale assemblies are available for these species, limiting large-scale quantitative genetics and comparative genomics-based approaches. Many resurrection plants are polyploid or have prohibitively large genomes including those in the genera Boea, Xerophyta, Eragrostis, Sporobolus, and Craterostigma. This complexity complicates genome assembly and gene redundancy in the polyploid species hinders downstream functional genomics work.
Oropetium thomaeum (hereon referred to as Oropetium), is a diploid resurrection plant with the smallest genome among the grasses (245 Mb) (Bartels & Mattar, 2002). Oropetium plants are similar in size to Arabidopsis, but significantly smaller than the model grasses Setaria italica (Li & Brutnell, 2011) and Brachypodium distachyon (Brkljacic et al., 2011), with a short generation time of 4 months. Oropetium is in the Chloridoideae subfamily of grasses and is closely related to the orphan cereal crops tef (Eragrostis tef) and finger millet (Eleusine coracana). Desiccation tolerance evolved independently several times within Chloridoideae (Gaff, 1977(Gaff, , 1987Gaff & Latz, 1978) making it a useful system for studying convergent evolution. Together, these traits make Oropetium an attractive model for exploring the origin and molecular basis of desiccation tolerance. Oropetium was one of the first plants to be sequenced using the long reads of PacBio technology, and the assembly quality was comparable to early Sanger sequencing-based plant genomes such as rice and Arabidopsis (VanBuren et al., 2015). Despite the high contiguity of Oropetium V1, the assembly has 625 contigs and the Bio-Nano based genome map was unable to produce chromosome-scale scaffolds. Furthermore, the V1 annotation was based on limited transcript evidence, and a high proportion of conserved plant genes were missing (VanBuren et al., 2015). Here, we reassembled the Oropetium genome using a more refined algorithm and generated a chromosome-scale assembly using Hi-C based chromatin interactions.
The annotation quality was improved using high-resolution RNAseq data and protein homology, facilitating detailed comparative genomics with other grasses.

| Genome reassembly
The raw PacBio reads from the Oropetium V1 release (VanBuren et al., 2015) were reassembled with improved algorithms to better resolve highly complex and repetitive regions. PacBio data were error corrected and assembled using Canu (V1.4) (Koren et al., 2017) with the following modifications: minReadLength = 1,500, GenomeSize = 245 Mb, and minOverlapLength = 1,000. Other parameters were left as default. The resulting assembly graph was visualized in Bandage (Wick, Schultz, Zobel, & Holt, 2015). The assembly graph was free of heterozygosity related bubbles, but many nodes (contigs) were interconnected by a high copy number retrotransposon. The Canu based contigs (assembly V1.2) were first polished using Quiver (Chin et al., 2013) with the raw PacBio data and default parameters. Contigs were further polished with Pilon (V1.22) (Walker et al., 2014) using~120× coverage of paired-end 150 bp Illumina data. Quality-trimmed Illumina reads were aligned to the draft contigs using bowtie2 (V2.3.0) (Langmead & Salzberg, 2012) with default parameters. The overall alignment rate was 95.5%, which was slightly higher than alignment against the HGAP V1 assembly (94.5%). The following parameters for Pilon were modified: -flank 7, -K 49, and -mindepth 25. Other parameters were left as default. Pilon was run four times with an updated reference and realignment of Illumina data after each iteration. Indel corrections plateaued after the third iteration, suggesting polishing removed most residual assembly errors.

| Hi-C library construction analysis and genome anchoring
Oropetium descended from the original plants used for PacBio sequencing was collected for Hi-C library construction and RNAseq.
Oropetium is highly selfing with low heterozygosity, and we expect minimal differences to be introduced in the new version. Oropetium seeds are available upon request. Oropetium plants were maintained under day/night temperatures of 26 and 22°C, respectively, with a light intensity of 200 E m −2 s −1 and 16/8 hr photoperiod. Young leaf tissue was used for Hi-C library construction with the Proximo ™ Hi-C Plant kit (Phase Genomics) following the manufacture's protocol.
Briefly, 0.2 g of fresh, young leaf tissue was finely chopped and the chromatin was immediately crosslinked. The chromatin was fragmented and proximity ligated, followed by library construction. The final library was size selected for 300-600 bp and sequenced on the Illumina HiSeq 4000 under paired-end 150 bp mode. Adapters were trimmed and low-quality sequences were removed using Trimmomatic (V0.36) (Bolger, Lohse, & Usadel, 2014). Read pairs were aligned to the Oropetium contigs using bwa (V0.7.16) (Li, 2013) with strict parameters (-n 0) to prevent mismatches and non-specific alignments in duplicated and repetitive regions. SAM files from bwa were used as input in the Juicer pipeline, and PCR duplicates with the same genome coordinates were filtered prior to constructing the interaction based distance matrix. In total, 101 filtered read pairs were used as input for the Juicer and 3d-DNA Hi-C analysis and scaffolding pipelines (Dudchenko et al., 2017;Durand et al., 2016). Contig ordering, orientation, and chimera splitting were done using the 3d-DNA pipeline (Dudchenko et al., 2017) under default parameters. Contig misassemblies and scaffold misjoins were manually detected and corrected based on interaction densities from visualization in Juicebox. In total, six chimeric contigs were identified and split at the junction with closest interaction data.
The manually validated assembly was used as input to build the 10 scaffolds (chromosomes) using the finalize-output.sh script from 3d-DNA. Chromosomes and unanchored contigs were renamed by size, producing the V2 assembly.

| Genome annotation
The Oropetum V2 assembly was reannotated using the homologybased gene prediction program Gene Model Mapper (GeMoMa: V 1.5.2) (Keilwagen et al., 2016. GeMoMa uses protein homology and RNAseq evidence to predict gene models. Genome assemblies and gene annotation for the following 11 species were downloaded from Phytozome (V12) and used as homology-based evidence: Arabidopsis thaliana, Brachypodium distachyon, Glycine max, Oryza sativa, Panicum hallii, Populus trichocarpa, Prunus persica, Setaria italica, Solanum lycopersicum, Sorghum bicolor, and Theobroma cacao. Translated coding exons and proteins from the reference gene annotations and genome assemblies were extracted using the module Extractor function of GeMoMa (module Extractor: Ambiguity = AMBIGUOUS, r = true). RNAseq data from Oropetium desiccation and rehydration timecourses (VanBuren et al., 2017) were aligned to the V2 Oropetium genome using HISAT2 (Kim, Langmead, & Salzberg, 2015) with default parameters. The resulting BAM files were used to extract intron and exon boundaries using the module ERE (module ERE: s = FR_FIRST_STRAND, c = true). Translated coding exons from other species were aligned to the Oropetium genome using tblastn and transcripts were predicted based on each reference species independently using the extracted introns and coverage (module GeMoMa). Finally, the predictions based on the 11 reference species were combined to obtain a final prediction using the module GAF. Gene models containing transposases were filtered, resulting in a final annotation of 28,835 gene models. The annotation completeness was assessed using the plant specific Benchmarking Universal Single-Copy Ortholog (BUSCO) dataset (version 3.0.2, embryophyta_odb9) (Simão, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015). The following report was obtained from BUSCO: 98.9% overall, 95.4% single copy, 3.5% duplicated, 0.6% fragmented, and 0.5% missing. Gene model names from V1 were conserved where possible, and new gene models received new names.

| Expression analysis
Oropetium RNAseq data from desiccation and rehydration timecourses were reanalyzed using the updated gene model annotations (VanBuren et al., 2017). Four timepoints during dehydration (days 7, 14, 21, and 30), two during rehydration (24 and 48 hr), and one well-watered sample were analyzed. Based on principle component analysis, replicate two of the "well-watered" and "D21" samples were excluded from the analysis. Each other timepoint had three replicates. Gene expression was quantified on a transcript level using salmon (v 0.9.1) in quasi-mapping mode (Patro, Duggal, Love, Irizarry, & Kingsford, 2017). Default parameters were used with the internal GC bias correction in salmon. The R package tximport (v 1.2.0) was used to map transcript level quantifications to gene level counts (R Core Team, 2013;Soneson, Love, & Robinson, 2015). We conducted differential expression analysis with the remaining samples using the R package DESeq2 (v 1.14.1) set to default parameters [3,4].

| RESULTS AND DISCUSSION
The first version of the Oropetium genome (V1) was sequenced with high coverage PacBio data (~72×) followed by error correction and assembly using the hierarchical genome assembly process (HGAP) (VanBuren et al., 2015). We reassembled this PacBio data using the Canu assembler (Koren et al., 2017), which can more accurately assemble and phase complex repetitive regions. The resulting Canu based assembly (hereon referred to as V1.2) had fewer contigs than the V1 HGAP assembly, but had otherwise similar assembly metrics (Table 1). Draft contigs were polished using a two-step process to remove residual insertion/deletion (Indel) and single nucleotide errors. Contigs were first polished using the raw PacBio data with Quiver (Chin et al., 2013), followed by four rounds of reiterative polishing with Pilon (Walker et al., 2014) using high coverage Illumina paired-end data. The final V1.2 assembly contains 436 contigs with an N50 of 2.0 Mb and total assembly size of 236 Mb. This is six megabases smaller than the V1 assembly, with slightly lower contiguity. More intact LTR-RTs and centromere specific repeat arrays were identified in Oropetium V1.2 compared to V1, suggesting the Canu assembler resolved these repetitive elements more accurately.
The Oropetium V1.2 contigs were ordered and oriented into chromosome-scale pseudomolecules using Hi-C. Hi-C leverages long-range interactions across distal regions of chromosomes to order and orient contigs. This approach is similar to genetic mapbased anchoring, but with higher resolution. Illumina data generated from the Hi-C library were mapped to the V1.2 Oropetium genome using bwa (Li, 2013) and the proximity-based clustering matrix was generated using the Juicer and 3d-DNA pipelines (Dudchenko et   F I G U R E 3 Landscape of the Oropetium genome. Gypsy and Copia long terminal repeat retrotransposons (LTR-RT) and CDS density are plotted for the 10 Oropetium chromosomes. Features are plotted in sliding windows of 50 kb with 25 kb step size. The location of centromere specific tandem arrays is highlighted by red bars. The heat maps below each landscape show relative density with red indicating high density and blue indicating low density for each feature 48 hr post rehydration. Differentially expressed genes were identified based on comparisons of well-watered leaves with each dehydration or rehydration timepoint. In addition, each timepoint was compared with the timepoint immediately following it in the timecourse (i.e., day 7 dehydration vs. day 14). In total, 17,204 gene models from the V1 annotation had detectable expression (count > 0 in at least one sample) compared to 25,314 gene models in V2 (Figure 2b). Of the expressed genes, 9,149 V1 and 11,948 V2 gene models were classified as differentially expressed in at least one of the comparisons. Most newly annotated genes (8,110) have detectable expression in at least one of the seven timepoints, and the majority are expressed in all timepoints. In total, 2,799 new V2 gene models were differentially expressed, suggesting the newly annotated genes have important and previously uncharacterized roles in desiccation tolerance.
We used the chromosome-scale assembly of Oropetium to survey patterns of genome organization and evolution related to maintaining a small genome size. The proportion of LTR-RTs in Oropetium V1 and V2 is similar, though V2 has more intact ele-  Table 2). Array sizes are likely underestimated, as only 52% of centromeric arrays were anchored to chromosomes, and 23 unanchored contigs contain centromeric repeat arrays. Gene density is low in the pericentromeric regions, consistent with the rice, sorghum, maize, and Brachypodium genomes (Du et al., 2017;Initiative, Jiao et al., 2017;Paterson et al., 2009). Collectively, pericentromeric regions span 67.5 Mb or 29% of the genome, a much smaller proportion than Sorghum (62%; 460 Mb) (Paterson et al., 2009), but higher than rice (15%; 63 Mb) (Goff et al., 2002). The majority of intact LTRs (86%; 628) have an insertion time of less than one million years ago, with a steep drop off of insertion time after 0.4 MYA. This suggests that LTRs are highly active in Oropetium but rapidly fragmented and purged to maintain its small genome size.
Previous comparative genomic analyses supported a high degree of collinearity between Oropetium and other grass genomes, but the draft assembly prevented detailed chromosome-level comparisons.
To date, no chromosome scale assemblies are available for other Chloridoideae grasses, though draft genomes are available for the orphan grain crops tef (Eragrostis tef) (Cannarozzi et al., 2014) and finger millet (Eleusine coracana) (Hittalmani et al., 2017). We compared the V2 Oropetium chromosomes to the high-quality BTX 623 Sorghum genome (McCormick et al., 2018). Sorghum is in the Panicoideae subfamily of grasses which diverged from the ancestors of Chloridoideae~31 MYA (Cotton et al., 2015). Despite this diver- The V2 chromosome-scale assembly will serve as a reference for future population genomics work and positional cloning of desiccation related genes. Together, this highlights the need to improve and scaffold existing high-quality reference genomes.

ACKNOWLEDG MENTS
This work is supported by funding from the National Science Foundation (MCB-1817347 to R.V.).

AUTHOR CONTRI BUTIONS
R.V. designed research; R.V., C.M.W, J.K. and J.P. performed research and/or analyzed data; and R.V. wrote the paper. All authors reviewed the manuscript.

AVAILABILITY OF SUPPORTING DATA
The V2