Chromosome‐level Thlaspi arvense genome provides new tools for translational research and for a newly domesticated cash cover crop of the cooler climates

Summary Thlaspi arvense (field pennycress) is being domesticated as a winter annual oilseed crop capable of improving ecosystems and intensifying agricultural productivity without increasing land use. It is a selfing diploid with a short life cycle and is amenable to genetic manipulations, making it an accessible field‐based model species for genetics and epigenetics. The availability of a high‐quality reference genome is vital for understanding pennycress physiology and for clarifying its evolutionary history within the Brassicaceae. Here, we present a chromosome‐level genome assembly of var. MN106‐Ref with improved gene annotation and use it to investigate gene structure differences between two accessions (MN108 and Spring32‐10) that are highly amenable to genetic transformation. We describe non‐coding RNAs, pseudogenes and transposable elements, and highlight tissue‐specific expression and methylation patterns. Resequencing of forty wild accessions provided insights into genome‐wide genetic variation, and QTL regions were identified for a seedling colour phenotype. Altogether, these data will serve as a tool for pennycress improvement in general and for translational research across the Brassicaceae.


Introduction
Native to Eurasia, field pennycress (Thlaspi arvense L.) is a member of the Brassicaceae family and is closely related to the oilseed crop species rapeseed (Brassica rapa and Brassica napus L.), camelina (Camelina sativa L.) and the wild plant Arabidopsis thaliana (Beilstein et al., 2010;Warwick et al., 2002). It is an emerging oil feedstock species with the potential to improve sustainability of cold climate cropping systems through use as a cash cover crop (Boateng et al., 2010;Chopra et al., 2018;Sedbrook et al., 2014). Pennycress is extremely winter hardy (Warwick et al., 2002) and can be planted in traditional fallow periods following summer annuals such as wheat, maize or soya bean (Cubins et al., 2019;Johnson et al., 2015;Ott et al., 2019;Phippen and Phippen, 2012). By providing a protective living cover from the harvest of the previous summer annual crop through early spring, pennycress prevents soil erosion and nutrient loss, which in turn protects surface and below-ground water sources, suppresses early-season weed growth, and provides a food source for pollinators (Del Gatto et al., 2015;Johnson et al., 2015;Weyers et al., 2019Weyers et al., , 2021. The short life cycle allows for harvest in May or June in temperate regions, with reported seed yields ranging from 750 to 2400 kg/ha (Cubins et al., 2019;Moore et al., 2020). Following harvest, an additional crop of summer annuals can be grown in a double-crop system that provides increased total seed yields and beneficial ecosystem services (Johnson et al., 2015;Phippen and Phippen, 2012;Thomas et al., 2017). The pennycress seed contains an average of 30%-35% oil, and the fatty acid profile is conducive to producing biofuels (Fan et al., 2013;Moser, 2012;Moser et al., 2009). Seed oil also has the potential to be converted into an edible oil and protein source (Chopra et al., 2020b;Claver et al., 2017;McGinn et al., 2019).
Thlaspi arvense is a homozygous diploid species (2n = 2x = 14) (Mulligan, 1957) and is predominantly self-pollinating (Mulligan and Kevan, 1973), suggesting that breeding efforts could proceed with relative ease and speed. It is amenable to genetic transformation using the floral dip method (McGinn et al., 2019), and its diploid nature with many one-to-one gene correspondence with A. thaliana (Chopra et al., 2018) could provide an avenue for gene discovery followed by field-based phenotypic validation. Indeed, several agronomic and biochemical traits have already been identified in pennycress using this translational approach, including traits crucial for de novo domestication of T. arvense such as transparent testa phenotypes (Chopra et al., 2018), early flowering (Chopra et al., 2020b), reduced shatter (Chopra et al., 2020b) and seed oil composition traits (Chopra et al., 2020b;Esfahanian et al., 2021;Jarvis et al., 2021;McGinn et al., 2019). Field pennycress could thus serve as a de novodomesticated oilseed crop for the cooler climates of the world and at the same time as a new dicotyledonous model for functional genetics studies. Its amenability for translational research constitutes a clear advantage vis-a-vis A. thaliana. However, to establish T. arvense as a genetic model and a crop, it is important to develop genomic resources that will help explore the spectrum of genetic diversity, the extent and patterns of gene expression, genetic structure and untapped genetic potential for crop improvement.
Here, we describe a set of new resources developed for research and breeding communities, including a high-quality, chromosome-level genome assembly of T. arvense var. representing~97.5% of the estimated genome size of 539 Mbp. We provide robust annotations of both protein-coding and non-coding genes, including putative transfer RNA (tRNA), ribosomal RNA (rRNA) and small nucleolar RNA (snoRNA) predictions, alongside small RNA-producing loci, transposable element (TE) families and predicted pseudogenes. From transcriptome data based on a panel of eleven different tissues and life stages, we built a gene expression atlas. In combination with whole-genome DNA methylation profiles of both roots and shoots, this provides a basis for exploring gene regulatory and/or epigenetic mechanisms within pennycress. A comprehensive analysis of forty resequenced pennycress accessions highlights the nucleotide diversity in these collections, alongside gene variants and population structure. Finally, by means of modified bulked-segregant analysis (BSA), we identified quantitative trait loci (QTL) associated with seedling colour phenotype, exemplifying the usefulness of this resource. The genome and resequencing information presented in this study will increase the value of pennycress as a model and as tool for translational research and accelerate pennycress breeding through the discovery of genes affecting important agronomic traits.

An improved reference genome sequence
The genome of T. arvense var. MN106-Ref was assembled de novo from 476X (256 Gb) depth PacBio Sequel II continuous long reads (CLRs) (38 kb N50). The initial assembly attempts exceeded the genome size by~53% with respect to the range of 459-540 Mbp total size estimated from flow cytometry and k-mer analyses (Table S1). Reducing the duplicated fraction, polishing and scaffolding/rescaffolding using several approaches resulted in a final assembly of~526 Mbp, corresponding to~97.5% of the upper limit of the flow cytometry-based estimate and representing an improvement of~20% relative to the original assembly size. Scaffolding/rescaffolding of the genome assembly was achieved using Bionano optical, Hi-C contact, genetic linkage and comparative synteny maps. The final genome contains 964 scaffolds, with~83.6% of the total estimated size represented by seven large scaffolds, in agreement with the haploid chromosome number, demonstrating a vast improvement in overall contiguity and bringing the assembly to chromosome level. The coding space is 98.7% complete on the basis of conserved core eukaryotic single-copy genes (BUSCO), with 92.1% being single copy and 6.6% duplicated. Full descriptive statistics of the final version in comparison with T_arvense_v1 are given in Table 1; intermediary versions are summarized in Table S2.
The seven largest scaffolds are all characterized by high gene density towards both telomeres and a high density of repeats and TEs in the pericentromeric and centromeric regions ( Figure 1, Figure S1). While the protein-coding gene fraction of the genome is similar in size to other closely related Brassicaceae (Wang et al., 2011), the large repetitive fraction suggests an increased genome size driven by TE expansion (Beric et al., 2021). In addition, the spatial distribution of sRNA loci followed the gene density but was concentrated predominantly at the boundary between genes and TEs.
In addition to the duplicate-containing contigs, alignments of the raw CLR reads to the new genome revealed the presence of what appeared to be a small number of collapsed repeats in scaffolds 1, 3, 5 and 7, which were typically larger than 25 kbp and indicative of misassembly in these loci ( Figure S2). Further investigation revealed an overlap with tandem repeat clusters of 18S and 28S rRNA annotations at those loci on scaffolds 3 and 5, and a large supersatellite of 5S rRNA on scaffold 1. In addition, there were corresponding genes associated with organellar DNA at those loci on scaffolds 3 and 7, indicating either erroneous incorporation of plastome sequence during assembly or genuine nuclear integrations of plastid DNA (NUPTs) (Michalovova et al., 2013).

Comparative genomics
Exploiting information from the genome of Eutrema salsugineum , a closely related species (Franzke et al., 2011) with a much smaller genome (241 Mbp) but the same karyotype (n = 7), aided during rescaffolding (see methods; Figure S3) and confirmed synteny of the seven largest scaffolds in the two species ( Figure S4). There is a large-scale synteny between the two genomes, with the exception of some regions on scaffolds 2, 3, 6 and 7. This could be due to the low gene density observed in the T. arvense genome towards the centre of each chromosome and/or the high presence of dispersed repeats in those regions. Chromosome evolution in the Brassicaceae has been studied through chromosome painting techniques, and 24 chromosome blocks (A-X) have been defined from an ancestral karyotype of n = 8 (Murat et al., 2015;Schranz et al., 2006). We identified the 24 blocks in T. arvense based on gene homology and synteny between T. arvense and A. thaliana (Figure 2). While in general the distribution of the chromosomal blocks resembles that in the close relatives E. salsugineum and S. parvula, some blocks are rearranged in a small section at the end of the scaffold representing chromosome 1 and at the beginning of chromosome 6. The first case involves the transposition of a small part of block C in between A and B, while chromosome 6 has a possible inversion between the blocks O and W when compared to E. salsugineum and S. parvula. Overall, despite having an increase in genome size compared with E. salsugineum and S. parvula, T. arvense conserves all the ancestral Brassicaceae karyotype blocks. The synteny analysis also revealed intrachromosomal rearrangements, but no obvious interchromosomal rearrangements.

Transcriptome assembly
We sequenced total cDNA with strand-specific RNA-seq from eleven tissues, including rosette leaves, cauline leaves, inflorescences, open flowers, young green siliques, old green siliques, green seeds, mature seeds, seed pods, roots of 1-week-old seedlings and shoots of 1-week-old seedlings (Table S3). Reads from each tissue sample were aligned to the genome with unique mapping rates between 76% and 91%, with the exception of old green silique (19%), green seed (59%) and mature seed (12%). The majority of unmapped reads in each case were due to insufficient high-quality read lengths. We constructed independent tissue-specific transcriptome assemblies and combined them into a multi-sample de novo assembly, yielding 30 650 consensus transcripts. These were further refined by prioritizing isoforms supported by Iso-seq data, resulting in 22 124 high-quality consensus transcripts to inform gene models.

Protein-coding genes
In addition to the expression data, gene models were informed by protein homology using a combined database of Viridiplantae from UniProtKB/Swiss-Prot (Boutet et al., 2007) and selected Brassicaceae from RefSeq (Pruitt et al., 2012). Following initial training and annotation by ab initio gene predictors, proteincoding loci were further annotated with InterPro to provide PFAM domains, which were combined with a BLAST search to the UniProtKB/Swiss-Prot Viridiplantae database to infer gene ontology (GO) terms. In accordance with MAKER-P recommendations , the final set of 27 128 protein-coding loci was obtained by filtering out those with an annotation edit distance (AED) score of 1 unless they also contained a PFAM domain. Approximately 95% of loci had an AED score <0.5 ( Figure S5), demonstrating a high level of support with the available evidence, and 21 171 (~78%) were annotated with a PFAM domain. Analysis of gene orthologs and paralogs among related Brassicaceae confirmed the close relationship with E. salsugineum, with the protein-coding fraction occupying a genome space comparable to related species (Figure 3a). A total of 4433 gene duplication events were recorded with OrthoFinder, comparable to E. salsugineum (5108), but fewer than in B. rapa (11 513), for example.
The full descriptive statistics are given in Table 2, in comparison with the original T_arvense_v1 annotation (Dorn et al., 2015) lifted over to the new genome with Liftoff v1.5.2 (Shumate and Salzberg, 2020), where applicable. Gene feature distributions are comparable between T_arvense_v1 and the present assembly of MN106-Ref (hereafter referred to as T_arvense_v2; Figure S6). Unique genes that were successfully lifted over from the previous version were included as a separate fraction in the final annotation (source: T_arvense_v1), resulting in 32 010 annotated genes in total. Up to~95.2% completeness can be obtained by combining the full set of both the current and previous annotations according to a BUSCO evaluation of 2121 conserved, single-copy orthologs. The improved contiguity of the genome space allowed for the resolution of genes such as the tandem duplicated MYB29 and MYB76, which were concatenated in the previous version ( Figure 3b).

Non-coding loci
In addition to the protein-coding gene annotations, we annotated non-coding RNA (ncRNA) genes, pseudogenes, and TEs. Descriptive annotation statistics are summarized in Table 2. While many of these annotation features in T. arvense were similar to those found in other plant species, we observed several unique patterns, which we will describe in detail below. ncRNA annotations were inferred from sequence motifs (tRNA, rRNA, snoRNA) or from sequencing data (siRNA, miRNA). We predicted clusters of both 5S rRNA and tandem repeat units of 18S and 28S rRNA with RNAmmer (Lagesen et al., 2007), often in relative proximity to loci identified with Tandem Repeats Finder v4.09.1 (Benson, 1999) and putatively associated with centromeric repeat motifs (not shown). Of the largest seven scaffolds, only scaffolds 4 and 7 carried no such annotations. Notably, several large clusters of 5S rRNA genes were interspersed throughout the pericentromeric region of scaffold 1, whereas the remaining four scaffolds contained 18S and 28S rRNA gene annotations. Finally, we identified 243 homologs from 114 snoRNA families.

sRNA annotation
We identified 19 386 siRNA loci. More than 98% of these loci corresponded to heterochromatic 23-to 24-nt siRNA loci, with only 196 producing 20-to 22-nt siRNAs. The sRNA loci were expressed unevenly across tissues, as inferred from prediction with data from different tissues. Only 2938 loci were shared across all four tissues studied (rosette leaves, roots, inflorescences and pollen). Inflorescences were the major contributor with 6728 private loci. Despite these differences between tissues, we Altogether, sRNA loci accounted for~8 Mbp or~1.5% of the assembled genome. Of the seven largest scaffolds, where the majority of genes are located, the total coverage of siRNA loci ranged between 1.5% and 2% and the loci appeared to be preferentially concentrated at the boundary between TEs and the protein-coding gene fraction of the genome. To further explore this, we partitioned the seven largest scaffolds into gene-enriched and gene-depleted regions, based on a median of 14 genes per Mbp and a mean of 54.2 genes per Mbp. We defined geneenriched loci as those above and gene-depleted loci as those below the mean. At the chromosomal level, sRNA loci correlated with gene-enriched regions and were scarce in regions with high TE content. This trend is in contrast to that observed in A. thaliana (Hardcastle et al., 2018) but resembles what has been observed, for example, in maize (He et al., 2013) and tomato (Tomato Genome Consortium, 2012).
Phased secondary siRNAs (phasiRNAs) are a class of secondary sRNAs that, due to the way they are processed, produce a distinct periodical pattern of accumulation (Axtell, 2013b). In the T. arvense genome, we observed 139 loci with such phased patterns. In contrast to the general notion that phasiRNAs are typically 21 nt long (Lunardon et al., 2020), we found 24-nt siRNAs to be dominant in 133 of these loci.

MicroRNAs
MicroRNA (miRNA)-encoding genes were predicted using a combination of ShortStack and manual curation (see Methods). We identified 72 miRNA-producing loci, with 53 that were already known from other species, and 19 appeared to be species-specific. Most of the identified families were produced from only one or two loci, with miR156 and miR166 being produced by the most loci, with eight and five family members, respectively. A total of 21 out of 25 families in T. arvense are found in other rosids, and three (miR161, miR157 and miR165) only in other Brassicaceae. One family, miR817, is also present in rice. There is a strong preference for 5 0 -U at the start of both unique and conserved miRNAs ( Figure S8), in line with previous reports (Voinnet, 2009). The expression level of both conserved and novel miRNA families was compared between tissues, showing that the ten most highly expressed across all tissues are conserved families, whereas novel miRNA demonstrates a marginal tendency to be more lowly expressed or with potential for differential expression ( Figure S9).

sRNA loci
When we overlaid the sRNA loci with our annotated genomic features, most sRNAs localized to the intergenic space, but a substantial fraction, especially 20-to 22-nt sRNAs, were produced  Figure S10a). Helitrons make up only 1.5% of the genome space, yet more than 5% of sRNA biogenesis loci overlap with this type of TE. Most sRNA loci (93.0%) fell within 1.5 kbp of annotated genes or TEs (Figure S10b,c). As expected, 23-to 24-nt sRNAs were more frequently associated with TEs, whereas 20-to 22-nt sRNAs were more often produced by coding genes (Axtell, 2013a).

Pseudogenes
In accordance with the MAKER-P protocol, pseudogenes (Ψ) were predicted in intergenic DNA with the ShiuLab pseudogene pipeline (Zou et al., 2009). A total of 44 490 set II pseudogenes were annotated, exceeding those in A. thaliana (~3700) or rice (~7900) by one order of magnitude. We identified 35 818 pseudogenes overlapping with TEs, and 8672 pseudogenes that were either concentrated in intergenic space or more towards the protein-coding gene complement of the genome, and thus perhaps less likely to have arisen from retrotransposition. Approximately 59.2% of these contained neither a non-sense nor a frameshift mutation, indicating either (i) that the regulatory sequences of the pseudogenes were silenced first, (ii) a pseudoexon that may be linked to another non-functional exon, or (iii) a possible undiscovered gene.

Transposable elements
In total, we identified 423 251 TEs belonging to 10 superfamilies and covering~61% of the genome (Figure 3d). Retrotransposons (75% of all TEs are Gypsy elements; 10% Copia; 4% LINE) by far outnumbered DNA transposons (3% Helitrons; 1% hAT; 2% CACTA; 1% Pif-Harbinger; 2% MuLE). A detailed breakdown of repeats is shown in Table S4. As the most abundant retrotransposon superfamily, Gypsy elements accounted for 46% of the total genome space, which is consistent with a high abundance observed in the pericentromeric heterochromatin of E. salsugineum, where centromere expansion is thought to have been caused by Gypsy proliferation (Zhang et al., 2020). In addition, we identified 359 protein-coding genes located fully within TE bodies that could represent Pack-TYPE elements and contribute to gene shuffling (Catoni et al., 2018). Among these elements, 153 were intersecting with mutator-like elements, suggesting they correspond to Pack-MULE loci. TEs were located primarily in low gene density regions, while the fraction of TE-contained genes was randomly distributed.

Expression atlas
With cDNA sequences from 11 different tissues or developmental stages, we could annotate tissue-specific expression patterns. The complete expression atlas is provided in Data S1. We evaluated the relative extent of tissue-specific gene expression using the Tau (s) algorithm (Yanai et al., 2005), from the normalized trimmed mean of M-value (TMM) counts in all tissues (Robinson and Oshlack, 2010). To preclude potential biases caused by substantial differences in library size, we excluded low-coverage samples from mature seeds and old green siliques. In total, 4045 genes had high or even complete tissue specificity (s = 0.8-1.0), while 5938 genes had intermediate specificity (0.2-0.8) and 6107 had no or low specificity (0-0.2); the remaining genes were ignored due to missing data. The relative breakdown of each specificity fraction by tissue type is shown in Figure 4a, with 'roots', 'green seeds' and 'inflorescences' representing the tissues with the greatest proportion of high or complete specificity genes. The relative log 2 (TMM) expression values of the top 30 most highly expressed genes in each tissue, given a high or complete specificity score, are plotted in Figure 4b with respect to the overall mean expression per gene across all included tissues. These include, for example, genes with homology to EXTENSIN 2 (EXT2; A. thaliana) in 'roots', CRUCIFERIN (BnC1; B. napus) in 'green seeds', and PECTINESTERASE INHIBITOR 1 (PMEI1; A. thaliana) in 'inflorescences' and 'open flowers' (Data S2).

DNA methylation
Cytosine methylation (also commonly referred to as DNA methylation) is a prevalent epigenetic mark in plant genomes and is often associated with heterochromatin and transcriptional inactivation of TEs and promoters, but also with higher and more stable expression when present in gene bodies (Zhang et al., 2018). In plants, DNA methylation occurs in three cytosine contexts, CG, CHG and CHH (where H is any base but G), with the combined presence of CG, CHG and CHH methylation usually indicative of heterochromatin formation and TE silencing, while gene body methylation consists only of CG methylation (Bewick and Schmitz, 2017). In the light of the high TE density in T. arvense, we analysed genome-wide DNA methylation by whole-genome bisulphite sequencing (WGBS) in shoots and roots of 2-week-old seedlings. Genome-wide, 70% of cytosines were methylated in the CG context, 47% in the CHG context and 33% in the CHH context. In line with findings in other Brassicaceae, methylation at CG sites was consistently higher than at CHG and CHH ( Figure 1a; Figure S11). When we compared the WGBS data against the genome annotation, high levels of DNA methylation (mostly m CG) colocalized with regions of dispersed repeats and TEs in the centre of the chromosomes. Conversely, methylation was depleted in gene-rich regions (Figure 1a,b). In line with this, DNA methylation was consistently high along TEs, particularly in the CG context ( Figure 4c). In contrast to E. salsugineum (Bewick In contrast to TE and promoter methylation, gene body methylation (gbM) is generally associated with medium-to-high gene expression levels (Zhang et al., 2006;Zilberman et al., 2006). gbM occurs in~30% of protein-coding genes in A. thaliana, with DNA methylation increasing towards the 3 0 -end of the gene (Zhang et al., 2006). The T. arvense relative E. salsugineum lacks gbM Niederhuth et al., 2016). gbM was also largely absent in T. arvense (Figure 4d), suggesting that gbM was lost at the base of this clade.

Genetic variation in a pennycress collection
Knowledge of genetic diversity within wild populations is an essential process for improvement and domestication of new crop species. We analysed a geographically broad sample of forty accessions ( Figure S12) using whole-genome resequencing to characterize population structure and variation in germplasm available for breeding. We identified a total of 13 224 528 variants with QD value of ≥2000. Of these, 12 277 823 (92.8%) were SNPs, 426 115 (3.2%) were insertions, and 520 590 (3.9%) were deletions relative to the reference genome. Across all variants, 661 156 (2.9%) were in exons, with 340 132 synonymous, 314 075 nonsynonymous and 6949 non-sense changes. STRUCTURE analysis of both indel and SNP data sets resulted in optimal models of k = 3 populations ( Figure S13). Both data sets assigned the three lines of Armenian descent, which were highly distinct and had the largest genetic distance to the other accessions, to a single discrete population with limited to no gene flow to the other populations. These results are consistent with previous reports in pennycress  and were further supported by whole-genome dendrograms ( Figure 5a). We also calculated linkage disequilibrium (LD) among 2 518 379 genome-wide markers and chromosome-specific markers using TASSEL v5.2.75 (Bradbury et al., 2007) with a sliding window of 40 markers. The r-squared values were plotted against the physical distance with a LOESS curve fitted to the data to show LD decay ( Figure S14). Genome-wide, LD decayed to an r-squared value (r 2 ) of 0.2 over 6.2 kbp (Hill and Weir, 1988), which is comparable to LD decay reported in related Brassica species at r 2 = 0.3, including B. rapa (2.1 kbp)  and B. napus (12.1 kbp) (Lu et al., 2019).

Gene structure variation in pennycress accessions
The natural variation present in germplasm is an important source of alleles to facilitate breeding efforts and presents an opportunity to understand the evolution of gene families and adaptation within a species. To understand these in a more targeted approach, we sequenced on the PacBio Sequel platform the transcriptomes of two accessions, MN108 and Spring32-10, that are amenable to transformation and gene editing (McGinn et al., 2019), using RNA from leaves, roots, seeds, flowers and siliques. We constructed de novo reference transcriptomes using the Iso-seq3 pipeline, resulting in 25 296 and 26 571 accession-specific isoforms for MN108 and Spring32-10, respectively. These transcriptomes were then polished using the raw reads and processed through the SQANTI3 pipeline (Tardaguila et al., 2018) to characterize the genes and isoforms identified in each of the accessions. We identified 212 of 220 unique genes and 3780 of 3857 unique isoforms for MN108 and Spring32-10 respectively compared with the new reference. Transcripts mapping to the known reference denoted by 'Full Splice Match' (FSM) and 'Incomplete Splice Match' (ISM) accounted for 28.7% and 30.6% of all transcript models in MN108 and Spring32-10, respectively (Figure 5b, c). Transcripts of the antisense, intergenic and genic intron categories collectively accounted for a total of 12.0% (MN108) and 11.2% (Spring32-10). About~15% of all identified transcripts were novel isoforms when compared to the reference transcriptome for T_arvense_v2.

Mapping a pale seedling phenotype
From a segregating population with a high oleic pennycress (fae-1/rod1-1) background (Chopra et al., 2020b), we identified pale seedling lines (Figure 5d). This phenotype segregated in a Mendelian fashion. To determine the genetic control for this phenotype, we separately pooled genomic DNA from 20 wildtype and 20 pale plants. We processed sequence data obtained from each of these pools through the MutMap pipeline (Sugihara et al., 2020) and discovered a putative genomic interval (63.85-63.95 Mbp) on scaffold 6 linked to the pale phenotype. SnpEff (Cingolani et al., 2012) identified polymorphisms that might have deleterious effects on function of genes in this region (Table S5). The most obvious candidate is MEX1, encoding a maltose transporter located in the chloroplast, knockout of which causes a pale seedling phenotype in A. thaliana (Niittyl€ a et al., 2004).

Discussion
In this study, we report a high-quality reference genome assembly and annotation for T. arvense (var. MN106-Ref), a newly domesticated oilseed crop for the cooler climates of the world. The improved genome assembly, containing seven chromosomelevel scaffolds, revealed two main features: a landscape characterized by a large repetitive fraction populated with TEs and pseudogenic loci in pericentromeric regions, and a gene complement similar in size to other Brassicaceae and densely concentrated towards the telomeres (Figure 1). Previous annotations were enriched with additional gene models for proteincoding loci, and now include non-coding genes for tRNAs, rRNAs, snoRNAs, siRNAs and miRNAs, alongside predicted pseudogenes and TEs (Table 2). These newly improved assembly features will allow for efficient combining of traits and help accelerate future breeding as it would provide knowledge about the gene localization and the linkage of genes of interest. For example, the improved genome assembly has revealed that multiple domestication syndrome genes (ALKENYL HYDROXALKYL PRO-DUCING 2-like, TRANSPARENT TESTA 8, EARLY FLOWERING 6) ( Figure S1) are located on a single chromosome.
Improved genomic resources can facilitate general understanding of plant biology and evolutionary biology while aiding plant breeding and crop improvement (Scheben et al., 2016). For example, pennycress and Arabidopsis share many key features that made Arabidopsis the most widely studied model plant system (Meinke et al., 1998). The use of Arabidopsis for translational research and for identifying potential gene targets in T. arvense is possible and has been extensively validated (Chopra et al., 2018(Chopra et al., , 2020a(Chopra et al., , 2020bJarvis et al., 2021;McGinn et al., 2019). Previous studies have suggested that over a thousand unique genes in T. arvense are represented by multiple genes in Arabidopsis and vice versa. Our comparative genomics by way of synteny with E. salsugineum  revealed a high level of agreement, particularly between the protein-coding fraction of the genome, represented as conserved blocks in the largest seven scaffolds relative to the ancestral karyotype in Brassicaceae (Murat et al., 2015; Figure 2). The detailed description of gene synteny between T. arvense and other Brassicaceae provides insights into the evolutionary relevance of T. arvense within lineage II of Brassicaceae. In addition, the difference in genome size between T. arvense and other species, despite the reduced level of gene duplication and the 1:1 gene relationship, can be explained by the large repetitive fractions present throughout both the centromeric and pericentromeric regions. In the absence of whole-genome duplication events, these repetitive fractions indicate that the increased genome size may be a consequence of active TE expansion. This is therefore suggestive of a mechanism by which deleterious retrotransposon insertions must be mitigated in T. arvense. This could be explained by the high proportion of Gypsy retrotransposons in this species, usually located in heterochromatic regions, or by integration site selection (Sultana et al., 2017), or otherwise by silencing by small RNA activity and/or DNA methylation (Bucher et al., 2012;Sigman and Slotkin, 2016). Given the relatively high error rate of PacBio CLR reads (~10% before correction) with respect to circular consensus sequencing (CCS), the repetitive fraction would also help to explain the initial overestimation of the assembly size as a result of duplicated contigs. We also detected several loci with highly overrepresented read coverage indicative of repeat collapsing during the assembly process, often intersecting with 5S, 18S and 28S rRNA annotations. Such regions are difficult even for current long read technologies due to the large size of the tandem repeat units.
With the availability of improved genomic resources, increasing interest has turned towards understanding tissue-specific gene regulation to reduce pleiotropic effects upon direct targeting of genes during crop improvement. In this study, we have generated a resource using mRNA-seq, sRNA-seq and WGBS to gain insights into genes and their associated regulatory landscape. These data sets help elucidate the extent of tissue specificity and provide useful information for gene modification targets. For example, fatty acid desaturase 2 gene (FAD2; Ta12495 -T_arvense_v1) is involved in the oil biosynthesis pathway and is expressed in many different tissues analysed in this study (Data S1). FAD2 gene knockout should result in higher levels of oleic acid in the seed oil and provide an opportunity for pennycress oil to be used in food applications. It has been observed, however, that knockout mutants in pennycress display delayed growth and reduced seed yields in spring types (Jarvis et al., 2021), and reduced winter survival in the winter types , as a purported consequence of its broad expression profile. Similarly, genes such as AOP2-LIKE (Tar-vense_05380 -T_arvense_v2) have been targeted to reduce glucosinolates in pennycress seed meal for food and animal feed applications (Chopra et al., 2020b). However, AOP2-LIKE, too, is expressed in many tissues during development, which might explain why knockout plants with reduced glucosinolate content are reportedly more susceptible to insect herbivores such as flea beetles feeding on rosette leaves and root tissues (Marks et al., 2021). Our tissue-specific expression data suggest that, to overcome this challenge, one could alternatively target genes such as Glucosinolate Transporter 1 (GTR1; Tarvense_14683), which is expressed specifically in reproductive tissues (Data S1). This might achieve the desired reductions of seed glucosinolates while avoiding developmental defects. Such approaches have been effectively used in Arabidopsis and many Brassica species (Andersen and Halkier, 2014;Nour-Eldin et al., 2012).
Finally, the forty resequenced accessions described here provide a rich source of variants that reflect the genetic diversity and population structure of the species in the collection (Figure 5a). Further evaluations of transcriptome sequences showed ample variation in the transcripts from two separate lines -MN108 and Spring32-10that are highly amenable to transformation and highlighted the potential for developing pangenomes in the future. These genomic resources will facilitate genetic mapping studies in pennycress in both natural populations and mutant panels. We have identified genomic regions associated with a pale leaf mutant in pennycress seedlings using a modified BSA-Seq approach in this study (Figure 5d).
Over the last few years, significant efforts have been made towards the discovery of crucial traits and translational research in pennycress, centring on MN106-Ref and the gene space information generated by Dorn et al. (2013) and Dorn et al. (2015). In this study, we continued to generate genomic tools for this accession, with improved contiguity and high-quality annotations to make T. arvense var. MN106-Ref more accessible as a fieldbased model species for genetics and epigenetics studies and to provide tools for this new and extremely hardy winter annual cash cover crop. However, the assembly of additional accessions can only help to further enrich the resources available for the study of pennycress. In parallel to this study, a Chinese accession of T. arvense (YUN_Tarv_1.0) was assembled using Oxford Nanopore, Illumina HiSeq and Hi-C sequencing (Geng et al., 2021). This timely availability of an additional frame of reference opens the door to a pan-genomic approach in evolutionary research and allows for the better characterization of structural variants moving forward. Furthermore, the use of different sequencing technologies and assembly software provides an additional avenue to correct misassemblies and base calling errors in either case. The overall longer contigs assembled with PacBio CLR, for example, and the consideration of various genetic map data in addition to Hi-C provides a greater resolution of scaffolds particularly throughout the centromere and pericentromeric regions ( Figure S15). The reduced error rate of PacBio CCS (used for polishing) is also reflected in the overall k-mer content, which is measured with a two-order magnitude higher consensus quality over scaffolds representing chromosomes and~99% overall completeness for T_arvense_v2 (Tables S6-S8), indicative of high-quality, error-free sequences more appropriate for variant calling, for instance. Geng et al. (2021) also reported WGS analysis on forty Chinese accessions and reported an LD decay of 150 kbp at an r-squared value (r 2 ) of 0.6, which is considerably higher than the values determined on the forty accessions in this study, as well as those reported for related Brassica species (Lu et al., 2019;Wu et al., 2019). We believe the combination of resources will allow us to investigate the differences that might exist between accessions originating from different geographic locations around the world and help provide further insight into structural variations and evolutionary dynamics.
In conclusion, the T_arvense_v2 assembly offers new insights into the genome structure of this species and of lineage II of Brassicaceae more generally, and it provides new information and resources relevant for comparative genomic studies. The tools presented here provide a solid foundation for future studies in an alternative model species and an emerging crop.

Seeds for the reference genome development
Seeds from a small natural population of T. arvense L. were collected near Coates, MN by Dr. Wyse, and the accession number MN106 was assigned to this population. We propagated a single plant for ten generations from this population, and we refer to this line as MN106-Ref.
Sample collection, library preparation and DNA sequencing for assembly

PacBio CLR library
Plants were cultivated, sampled and prepared at the Max Planck Institute for Developmental Biology (T€ ubingen, Germany). Plant seeds were stratified in the dark at 4°C for 4-6 day prior to planting on soil. Samples were collected from young rosette leaves of T. arvense var. MN106-Ref seedlings, cultivated for 2 weeks under growth chamber conditions of 16-23°C, 65% relative humidity and a light/dark photoperiod of 16 h:8 h under 110-140 lmol/m 2 /s light. High molecular weight (HMW) DNA was obtained following nucleus isolation and DNA extraction with the Circulomics Nanobind Plant Nuclei Big DNA Kit according to the protocol described in Workman et al. (2018) and (Workman et al., 2019). A total of 11 extractions from 1.5-2 g frozen leaves each were processed in that way, yielding a pooled sample with a total of 12 lg of DNA by Qubit â 2.0 fluorometer (Thermo Fisher Scientific, Waltham, MA) estimation, and high DNA purity with a mean absorbance ratio of 1.81 at 260/280 nm absorbance and 2.00 at 260/230 nm absorbance, as measured by NanoDrop 2000/2000c spectrophotometer (Thermo Fisher Scientific, Waltham, MA). HMW DNA was sheared by one pass through a 26G needle using a 1-mL syringe, resulting in an 85-kb peak size sample as estimated by FEMTO Pulse Analyzer (Agilent Technologies, Santa Clara, CA). A large insert gDNA library for PacBio Sequel II CLR sequencing was prepared using the SMRTbell â Express Template Preparation Kit 2.0. The library was size-selected for >30 kb using BluePippin with a 0.75% agarose cassette (Sage Science) and loaded into one Sequel II SMRT cell at a 32 pM concentration. This yielded a genome-wide sequencing depth of approximately 476X over~6.9 million polymerase reads with a subread N50 of~38 kbp.

PacBio CCS library
MN106-Ref plants were grown in growth chambers at the University of Minnesota. Individual plants were grown to form large rosettes for isolating DNA. Approximately 25 g of tissue was harvested and submitted to Intact Genomics (Saint Louis, MO) for high molecular weight DNA extraction. This yielded a pooled sample with a total of 269 ng of DNA by Qubit â (Thermo Fisher Scientific, Waltham, MA) estimation, and high DNA purity with a mean absorbance ratio of 1.87 at 260/280 nm and 2.37 at 260/ 230 nm, as measured by Nanodrop spectrophotometer (Thermo Fisher Scientific, Waltham, MA). To further clean up the high molecular weight DNA, we used Salt:Chloroform Wash protocol recommended by PacBio. This yielded a total of 12.1 ng/uL of high-quality DNA for library preparation. A large insert gDNA library was prepared, and 15 kb High Pass Size Selection on Pippin HT was performed at the University of Minnesota Genomics Center (Minneapolis, MN). These libraries were sequenced on 4 SMRT cells using PacBio Sequel II (Pacific Biosciences, Menlo Park).

Bionano library
High molecular weight DNA was isolated from young leaves and nicking endonuclease -BspQI was chosen to label high-quality HMW DNA molecules. The nicked DNA molecules were then stained as previously described (Lam et al., 2012). The stained and labelled DNA samples were loaded onto the NanoChannel array (Bionano Genomics, San Diego, CA) and automatically imaged by the Irys system (Bionano Genomics, San Diego, CA).

Hi-C library
The MN106-Ref plant tissue used for PacBio CCS was submitted to Phase Genomics (San Diego, CA). The Hi-C library was prepared following the proximo Hi-C plant protocol (Phase Genomics, San Diego, CA), and the libraries were sequenced to 116X depth on an Illumina platform with the paired-end mode and read length of 150 bp.

Illumina PCR-free library
Libraries for PCR-free short read sequencing were prepared from MN106-Ref genomic DNA using the TruSeq DNA PCR-Free Low Throughput Library Prep Kit (Illumina, San Diego, CA) in combination with TruSeq DNA Single Indexes Set A (Illumina, San Diego, CA) according to the manufacturer's protocol. We prepared two libraries, with average insert sizes of 350 bp and 550 bp, respectively. Samples were sequenced to 125X depth (~66 Gb) on an Illumina HiSeq 2500 (Illumina, San Diego, CA) instrument with 125-bp paired-end reads.

Genome assembly and construction of chromosomelevel scaffolds
The initial assembly was performed using Canu v1.9  with default options, aside from cluster runtime configuration and the settings corOutCoverage=50, minReadLength=5000, minOverlapLength=4000, correctedErrorRate=0.04 and genomeSize=539m, which were selected based on the characteristics of the library. Canu performs consensus-based read correction and trimming, resulting in a curated set of reads that were taken forward for assembly ( Figure S16).
The resulting assembly overestimated the genome size by approximately 53% (Table S2), which we surmised was likely due to uncorrected sequencing errors in the remaining fraction of reads, in which Canu was able to assemble into independent, duplicated contigs. Analysis of single-copy orthologs from the Eudicotyledons odb10 database with BUSCO v3.0.2 (Simão et al., 2015) revealed a high completeness of 98.4% and a duplication level of 23.6% (Table S6). Subsequent alignment of the reads to the assembly using minimap2 v2.17 (Li, 2018) and purge_dups v1.0.1 (Guan et al., 2020) presented bimodal peaks in the read depth distribution, indicative of a large duplicated fraction within the assembly ( Figure S17). As efforts to collapse this duplicated fraction using assembly parameters were unsuccessful, and purge_dups is intended to correct duplication arising from heterozygosity (which does not apply in T. arvense), the fraction was reduced by manual curation instead. Contigs starting from the left-hand side of the read depth distribution were consecutively removed until reaching an approximation of the estimated genome size, with any contigs containing non-duplicated predicted BUSCO genes kept preferentially in favour of discarding the next contig with lower read depth in the series.
The deduplicated assembly from Canu was polished with the PacBio Sequel II HiFi CCS reads using two iterations of RACON v1.4.3 (Vaser et al., 2017), prior to repeat reassembly. Bionano maps were used to build de novo scaffolds using the polished assembly; hybrid scaffolds were generated using the de novo Bionano maps and the assembly (https://bionanogenomics.com/ support-page/data-analysis-documentation/). To further resolve repetitive regions and improve assembly contiguity, the bionanoscaffolded assembly was integrated into the HERA pipeline (Du and Liang, 2019). The Hi-C data were aligned with bwa-mem v0.7.17 (Li and Durbin, 2009), PCR duplicates were marked with picard tools v1.83 (http://broadinstitute.github.io/picard), and the quality was assessed with the hic_qc.py tool of Phase Genomics (https://github.com/phasegenomics/hic_qc). The assembly was then scaffolded with the Hi-C alignments using SALSA v2.2 (Ghurye et al., 2017) and subsequently polished with the PCR-free Illumina data using two iterations of PILON v1.23 (Walker et al., 2014). The final assembly was the result of a metaassembly with quickmerge v0.3 (Chakraborty et al., 2016), which combined the current assembly with an earlier draft version assembled using Canu 1.8  directly from the PacBio CCS reads and polished only with the Illumina PCR-free short-reads, following an almost identical workflow, in order to help address the possibility of misassembly arising from technical sources and improve overall contiguity. This resulting assembly was evaluated with BUSCO (Simão et al., 2015) and QUAST  (Gurevich et al., 2013). Intermediate assembly statistics are given in comparison with (i) immediately after Canu, and (ii) the final version after rescaffolding (Table S2).

Genome size estimation using flow cytometry and k-merbased approach
The nuclei of field pennycress line MN106-Ref, Arabidopsis thaliana, maize and tomato were stained with propidium iodide, and fluorescent signals were captured using a Becton-Dickinson FACSCanto flow cytometer (https://www.bdbiosciences.com/). DNA content for all four species that corresponded to G 0/1 nuclei is listed in Table S1. The genome size of Arabidopsis is 135 Mb, and therefore, the genome size of pennycress was calculated to be 501 AE 33 Mb. Using the Illumina HiSeq2500 platform, we obtained~1009 PCR-free reads, which were used for subsequent K-mer analysis using Jellyfish (Marc ßais and Kingsford, 2011). The 101-mer frequency distribution curve exhibited a peak at 22 kmer, and analysis showed that the total number of K-mers was 11 403 836 319. Using the formula of genome size = total K-mer number/peak depth, the genome size of this sequencing sample was estimated to be 518 356 196 bp. Similarly, the single-copy content of the genome was estimated to reach 79%. Using both methods of genome size estimation, we found the pennycress genome ranged from 459 to 540 Mb.

Development of genetic maps for rescaffolding
To improve the contiguity and correct misassemblies, we developed two genetic linkage maps using F 2 populations. The first linkage map was derived from a cross between a wild Minnesota accession 'MN106-Ref' and a genetically distant Armenian accession 'Ames32867'. The resulting F 1 plants were allowed to self-fertilize, and seeds from a single plant were collected and propagated to the F 2 generation. Approximately 500 mg fresh tissue was collected from 94 individuals in the F 2 population. The tissue was desiccated using silica beads and pulverized using a TissueLyser. DNA was isolated with the BioSprint DNA Plant Kit (Qiagen, Valencia, CA). The F 2 population along with the two parental genotypes was genotyped with genotyping by sequencing at the University of Minnesota Genomics Center (Minneapolis, MN). Each sample was digested with the BtgI_BgLII restriction enzyme combination, barcoded and sequenced on the Illumina NovaSeq S1 (single-end 101 bp) yielding 1 237 890 mean reads per sample. The raw reads were demultiplexed based on the barcode information and aligned to the most recent iteration of the pennycress genome using bwa. Sequence-aligned files were processed through samtools v1.9  and picard tools to sort the files and remove group identifiers. Variants were called using GATK HaplotypeCaller v3.3.0. SNPs identified among these 94 lines were used for the development of genetic maps. The second linkage map was derived from a cross between MN106-Ref and a mutant line '2019-M2-111'. To identify the variant alleles in 2019-M2-111, we performed whole-genome resequencing using paired-end reads on the Illumina Platform. SNPs were identified using a similar approach as described above. Sixtyseven SNP markers were designed using the biallelic information from resequence data. DNA was extracted from 48 samples from the mutant F 2 population using the Sigma-Aldrich ready extract method, allele-specific and flanking primers synthesized from IDT (Iowa, USA) for each of the alleles were mixed (Data S3), and genotyping was performed using the methods described in Chopra et al. (2020a).
A total of 35 436 SNPs were identified among the population used for the first linkage map, SNP sites were selected with nomissing data, QD > 1000, and the segregation of the markers was 1:2:1. A total of 743 high-quality SNPs were retained for further analysis. A genetic map for the population was constructed using JoinMap 5 (Stam, 1993). Only biallelic SNPs were used in the analysis, and genetic maps were constructed with regression mapping based on default parameters of recombination frequency of <0.4 with only the first two steps. The Kosambi mapping function was chosen for map distance estimation, and the Ripple function was deployed to confirm marker order within each of the seven linkage groups. A total of 319 markers were mapped to seven linkage groups (Data S4). Similarly, 67 markers were genotyped on 48 individuals from the second population of linkage and 52 markers were mapped to six linkage groups (Data S5). Both of these linkage maps were used for reordering and correcting the scaffolds as described below.

Rescaffolding
Initial exploration regarding gene and TE distributions and methylation patterns pointed to potential misassemblies in the assembled genome. Further investigation by way of synteny comparison with a closely related species, Eutrema salsugineum , revealed that several of these likely occurred during scaffolding as orientation errors. Some of these errors could also be supported in comparison with the recent assembly of a Chinese accession (YUN_Tarv1.0) of T. arvense. Consequently, we manually introduced breakpoints at selected loci in the assembled genome where they were supported by at least two sources of data from whole-genome alignments to YUN_Tarv1.0, synteny maps to E. salsugineum (derived from reciprocal best blast), genetic linkage maps (wild-derived and EMS mutation based) and Hi-C contact maps. These were crossexamined with minimap2 alignments of PacBio CLR reads to the genome, an overview of corresponding gene distributions produced by Liftoff v1.5.2 (Shumate and Salzberg, 2020) and the resulting synteny analysis to E. salsugineum. The resulting contigs were then rescaffolded with ALLMAPS v1.1.5 (Tang et al., 2015) to produce the final assembly, integrating both the synteny map and genetic map data and manually discounting contigs that were supported only by single markers. The final assembly statistics in comparison with previous intermediary stages are given in Table S2.

Genome alignments and synteny analysis
The genome alignments between the different versions of the T. arvense assembly to E. salsugineum were done using MUMmer we used MCScan in the JCVI utility library (https://github.com/ tanghaibao/jcvi; Tang et al., 2008). The ortholog relationships were obtained using the proteinic translation of the CDS and using the argument --cscore=0.99. To define the syntenic blocks and the corresponding genomic coordinates, we used the parameters --minspan=15 and --minsize=5. The genomic coordinates from the syntenic blocks were parsed to draw the syntenic relationships using Circos v0. 69-8 (Krzywinski et al., 2009).
To determine the different ancestral Brassicaceae chromosomal blocks (ABKs), we took the ortholog relationship between each gene in T. arvense and A. thaliana from the synteny analysis, and compared it with a gene list derived from Murat et al. (2015) where each ortholog gene of A. thaliana had an assigned ABK block (Murat et al., 2015).

Genome annotation
Tissue preparation for RNA sequencing Thlaspi arvense var. MN106-Ref seeds were surface-sterilized with chlorine gas for 1 h and stratified for 3 day at 4°C. For seedlingstage RNA extractions, seeds were plated on ½ MS medium supplemented with 1% plant agar and stratified for 3 day at 4°C. For all other tissue collections, plants were sown on soil and grown in a climate-controlled growth chamber in long-day conditions (16/8-h light/dark at 21°/16°C, light intensity 140 µE/m 2 *s, with 60% relative humidity); plants were watered twice per week. Two weeks after germination, plants growing on soil were vernalized at 4°C in the dark for 4 weeks, then moved back to the growth chamber. Samples were collected from 11 different tissues in three biological replicates (two in case of mature seeds); for each replicate, we pooled tissue from two individuals. Tissues included the following: one-week-old shoots (from plate culture), one-week-old roots (from plate culture), rosette leaves, cauline leaves, inflorescences, open flowers, young green siliques (about 0.5 9 0.5 cm), older green siliques (about 1 9 1 cm), seed pods, green seeds and mature seeds.

RNA extraction and sequencing
Total mRNA was extracted using the RNeasy Plant Kit (Qiagen, Valencia, CA) and treated with DNase I using the DNA-free Kit DNase Treatment and Removal Reagents (Ambion by Life Technologies, Carlsbad, CA), following the manufacturer's protocols. cDNA libraries were constructed using the NEBNext Ultra II Directional RNA Library Prep Kit (New England BioLabs, Ipswich, MA, USA Inc.) for Illumina following the manufacturer's protocol. Libraries were sequenced on a HiSeq 2500 instrument (Illumina, San Diego, CA) as 125-bp paired-end reads.

Transcriptome assembly
Following quality control and adapter clipping with cutadapt (Martin, 2011), biological replicates for each of eleven tissue types from Illumina mRNA-seq libraries were aligned independently using STAR v2.5.3a (Dobin et al., 2013), then merged according to tissue type, prior to assembly by a reference-based approach. Each assembly was performed using Ryuto v1.3m (Gatter and Stadler, 2019), and consensus reconstruction was then performed using TACO v0.7.3 (Niknafs et al., 2017) to merge tissuespecific transcriptome assemblies. PacBio Iso-seq libraries from MN106-Ref were refined, clustered and polished following the Iso-seq3 pipeline (https://github.com/PacificBiosciences/IsoSeq), prior to alignment with STARlong and isoform collapsing using the cDNA_Cupcake (https://github.com/Magdoll/cDNA_Cupcake) suite. The Iso-seq data were later leveraged together with the Illumina mRNA-seq data to prioritize convergent isoforms using custom in-house scripting.

Genome annotation
The final assembly was annotated using the MAKER-P v2.31.10  pipeline on the servers provided by the EpiDiverse project, at ecSeq Bioinformatics GmbH (Leipzig, Germany). Plant proteins were obtained from the Viridiplantae fraction of UniProtKB/Swiss-Prot and combined with RefSeq sequences derived from selected Brassicaceae: Arabidopsis thaliana, Brassica napus, Brassica rapa, Camelina sativa and Raphanus sativus. TEs were obtained from RepetDB (Amselem et al., 2019) for selected plant species: Arabidopsis lyrata, Arabidopsis thaliana, Arabis alpina, Brassica rapa, Capsella rubella and Schrenkiella parvula (Eutrema parvulum). Repeat library construction was carried out using RepeatModeler v1.0.11 (Smit and Hubley, 2008) following basic recommendations from MAKER-P . Putative gene fragments were filtered out following BLAST search to the combined Swiss-Prot + RefSeq protein plant database after exclusion of hits from RepetDB. The de novo library was combined with a manually curated library of plant sequences derived from repbase (Bao et al., 2015). Genome masking is performed with RepeatMasker v4.0.9 (Smit, 2004) as part of the MAKER-P pipeline. Proteincoding genes, non-coding RNAs and pseudogenes were annotated with the MAKER-P pipeline following two iterative rounds under default settings, using (i) transcript isoforms from Illumina mRNA-seq and PacBio Iso-seq data, (ii) protein homology evidence from the custom Swiss-Prot + RefSeq plant protein database and (iii) the repeat library and TE sequences for masking. The initial results were used to train gene models for ab initio predictors SNAP v2006-07-28 (Korf, 2004) and Augustus v3.3.3 (Stanke et al., 2006), which were fed back into the pipeline for the subsequent rounds. The final set of annotations was filtered based on Annotation Edit Distance (AED) < 1 except in cases with corresponding PFAM domains, as derived from InterProScan v5.45-80.0 (Jones et al., 2014). The tRNA annotation was performed with tRNAscan-SE v1.3.1 (Lowe and Eddy, 1997) and the rRNA annotation with RNAmmer v1.2 (Lagesen et al., 2007). The snoRNA homologs were derived using Infernal v1.1.4 (Nawrocki and Eddy, 2013) from plant snoRNA families described in Patra Bhattacharya et al. (2016). A small phylogeny based on gene orthologs and duplication events in comparison with selected Brassicaceae (A. lyrata, A. thaliana, B. rapa, S. parvula and E. salsugineum) was performed with OrthoFinder v2.5.2, and the resulting species tree is rooted using STRIDE (Emms and Kelly, 2017) and inferred from all genes using STAG (Emms and Kelly, 2018).

Transposable element annotation
Two de novo annotation tools, EDTA v1.7.0 (Ou et al., 2019) and RepeatModeler v2.0 (Flynn et al., 2020), were used to annotate TEs independently. For EDTA, the following parameters were used in addition to defaults: --species others, -step all, --sensitive 1, --anno 1, and -evaluate 1. For RepeatModeler2, the additional parameters were -engine ncbi and -LTRStruct. The outputs of both tools were evaluated by manual curation. First, we used tblastn to align each TE consensus with the transposase database obtained from repbase, and the retrotransposon domains (GAG, Pol, Env, etc.) were viewed one by one with dotter (Sonnhammer and Durbin, 1995). Sequences with multiple paralogs were mapped back to the genome and manually extended to determine the full-length boundary of each TE. A total of 107 fulllength, representative Copia and Gypsy families were successfully evaluated. The TE consensus from RepeatModeler2 was selected as the most accurate model based on full-length paralogs. RepeatMasker was then used to construct the GFF3-like file from the FASTA file from RepeatModeler2, with the optional settings: -e ncbi -q -no_is -norna -nolow -div 40cutoff 225. The perl script rmOutToGFF3.pl was used to generate the final GFF3 file.

sRNA plant material
Seeds were sterilized by overnight incubation at À80°C, followed by 4 h of bleach treatment at room temperature (seeds in open 2 mL tube in a desiccator containing a beaker with 40 mL chlorine-based bleach (<5%; DanKlorix, Colgate-Palmolive, New York, NY) and 1 mL HCl (32%; Carl Roth, Karlsruhe, Germany)). For rosette, inflorescence and pollen, seeds were stratified in the dark at 4°C for six days prior to planting on soil, then cultivated under growth chamber conditions of 16-23°C, 65% relative humidity and a light/dark photoperiod of 16 h:8 h under 110-140 lmol/m 2 /s light. Rosette leaves were harvested after two weeks of growth. For inflorescence and pollen, 6-week-old plants were vernalized for 4 weeks at 4°C in a light/dark photoperiod of 12 h:12 h under 110-140 lmol/m 2 /s light. Two weeks after bolting, inflorescence and pollen were collected. Pollen grains were collected by vortexing open flowers in 18% sucrose for 5 min followed by centrifugation at 3000g for 3 min in a swinging bucket rotor. For root samples, seeds were stratified for 6 days at 4°C in the dark on ½ MS media. Plants were grown in 3-4 mL ½ MS medium plates in long day (16 h) at 16°C. Root samples were collected 12-14 days after stratification.

sRNA extraction and library preparation
Total RNA was extracted by freezing collected samples with liquid nitrogen and grinding with a mortar and pestle with TRIzol reagent (Life Technologies, Carlsbad, CA). Then, total RNA (1 lg) was treated with DNase I (Thermo Fisher Scientific, Waltham, MA) and used for library preparation. Small RNA libraries were prepared as indicated by the TruSeq Small RNA Library Prep Kit (Illumina, San Diego, CA), using 1 lg of total RNA as input, as described by the TruSeq RNA sample prep V2 guide (Illumina, San Diego, CA). Size selection was performed using the BluePippin System (SAGE Science, Massachusetts). Single-end sequencing was performed on a HiSeq 3000 instrument (Illumina, San Diego, CA).

sRNA locus annotation
Raw FASTQ files were processed to remove the 3 0 -adapter and quality-controlled with trim_galore v0.6.6 (https://www. bioinformatics.babraham.ac.uk/projects/trim_galore/) using trim_ galore -q 30 --small_rna. Read quality was checked with FastQC v0.11.9 (https://www.bioinformatics.babraham.ac.uk/ projects/fastqc/). The reference annotation of sRNA loci was created following the steps indicated by Lunardon et al. (2020). In short, each library was aligned to the reference genome independently using ShortStack v3.8.5 (Axtell, 2013b), with default parameters, to identify clusters of sRNAs de novo with a minimum expression threshold of 2 reads per million (RPM). sRNA clusters from all libraries of the same tissue were intersected using BEDTools v2.26.0 multiIntersectBed (Quinlan and Hall, 2010) with default parameters, and only those loci present in at least three libraries were retained. For each tissue, sRNA clusters 25 nt apart were padded together with the bedtools merge -d option. sRNA loci whose expression was <0.5 RPM in all libraries of each tissue were also removed. Finally, sRNA loci for all different tissues were merged in a single file retaining tissue of origin information with bedtools merge -o distinct options. miRNAs predicted by the ShortStack tool were manually curated (Appendix S1) following the criteria of Axtell (2013b): maximum hairpin length of 300 nt; ≥75% of reads mapping to the hairpin must belong to the miRNA/miRNA* duplex; for the miRNA/miRNA* duplex, no internal loops allowed, two-nucleotide 3 0 overhangs, maximum five mismatched bases and only three of which are nucleotides in asymmetric bulges; and mature miRNA sequence should be between 20 and 24 nt.

Expression atlas
Gene expression was measured from the same tissue-specific STAR alignments taken prior to merging biological replicates for transcript assembly, excluding coverage outliers 'mature seed' and 'green old silique'. A total of 27 samples from 9 tissues were therefore considered for gene expression analysis. Raw counts were generated using subread featureCounts v2.0.1 (Liao et al., 2014) and subsequently normalized using the trimmed mean of M-values (TMM) (Robinson and Oshlack, 2010) derived from edgeR v3.34 . Averaged expression counts by group were taken for tissue specificity evaluation using the Tau (s) algorithm (Yanai et al., 2005), as implemented in the R package tispec v0.99.0 (https://rdrr.io/github/roonysgalbi/tispec/), which provides a measure of s in the range of 0 -1, where 0 is non/low specificity, and 1 indicates high/absolute specificity.

DNA methylation
We extracted genomic DNA from roots and shoots of 2-week-old seedlings grown on ½ MS medium with 0.8% agar and 0.1% DMSO. Seedlings were grown vertically in 16-h/8-h light/dark cycle; at the time of sampling, roots were separated from shoot tissue with a razor blade and the plant tissue was flash-frozen in liquid nitrogen. Genomic DNA was extracted from ground tissue using the DNeasy Plant Mini Kit (Qiagen, Hilden, Germany). Libraries for WGBS were prepared using the NEBNext Ultra II DNA Library Prep Kit (New England Biolabs). Adapter-ligated DNA was treated with sodium bisulphite using the EpiTect Plus Bisulfite Kit (Qiagen, Hilden, Germany) and amplified using the Kapa HiFi Uracil + ReadyMix (Roche, Basel, Switzerland) in 10 PCR cycles. WGBS libraries were sequenced on an Illumina HiSeq2500 instrument with 125-bp paired-end reads.

Population genomics
DNA from forty pennycress accessions was extracted from approximately 500 mg of leaf tissue pooled from five plants using a plant genomic DNA kit (Epoch Life Science). DNA was then subjected to whole-genome sequencing on an Illumina Novaseq sequencer (2 9 125 bp). Raw reads were then aligned to the new reference genome (T_arvense_v2) using bwa-mem (Li and Durbin, 2009). The aligned files were processed with Samtools and Picard tools, and variants were called using GATK HaplotypeCaller v3.3.0 (Ren et al., 2018). Variants were annotated using SnpEff 5.0e (Cingolani et al., 2012). Data sets for both Indel and SNP panels were trimmed based on LD prior to population genomic analysis using Plink v1.9 (Purcell et al., 2007) with the parameter --indep-pairwise 1000 5 0.5. Population structure for both SNP and indel data was then characterized using the admixture model and independent allele frequencies in STRUCTURE v2.3.4 (Pritchard et al., 2000). Dendrograms of both SNP and Indel data were generated under the UPGMA method using the R package poppr (Kamvar et al., 2014).
The forty accessions were planted in a three replication, randomized complete block design in a greenhouse maintained at 21/20°C and 16 hour days. Ten seeds per replicate were planted in 13.3-cm 2 pots in Sungrow propagation potting mix. Seedlings were thinned to one plant per pot after emergence. Winter annual accessions require vernalization to induce flowering, so all winter accessions were placed in a growth chamber maintained at 4°C with 16-h light for a period of 21 days about 4 weeks after emergence. Spring annual accessions were planted approximately five weeks after winter accessions. Data for days to flowering were collected on 34 accessions that germinated as the number of days that elapsed from the date of emergence to the appearance of the first flower. The vernalization requirement for winter accessions explains the large differences in mean number of days to flowering between spring and winter accessions. Additional phenotypes and data associated with these sequenced accessions are available in Data S6.

Structural variants using Iso-seq data
Single-molecule real-time (SMRT) isoform sequencing (Iso-seq) based on PacBio (Pacific Biosciences, Menlo Park, CA) generated reads was used to investigate unambiguous full-length isoforms for two pennycress wild accessions, MN108 and Spring32-10. Total RNA extraction was performed on the green seed, hypocotyl, seedling root and flower tissues from pennycress plants grown in a climate-controlled growth chamber maintained 21/20°C during 16-h:8-h day-night setting. Approximately 250 ng of total RNA was obtained and subjected to the Iso-seq Express Library Workflow (Pacific Biosciences, Menlo Park, CA). cDNA is synthesized from full-length mRNA with the NEBNext Single Cell/ Low Input RNA Prep Kit (New England Biolabs, Ipswich, MA) followed by PCR amplification. The amplified cDNA is converted into SMRTbell templates using the PacBio SMRTbell Express Template Prep Kit 2.0 for sequencing on the Sequel System. Sequencing was performed at the University of Minnesota Genomics Center Facility (Minneapolis, MN).
The polished high-quality FASTA file obtained from Iso-seq3 was aligned to pennycress version 2 (T_arvense_v2) with min-imap2 (Li, 2018). The resulting SAM file was sorted and collapsed using the cDNA_Cupcake package to obtain an input GFF file such that each transcript has exactly one alignment and at most one ORF prediction. Sqanti3_qc.py, part of the SQANTI3 package (Tardaguila et al., 2018), was deployed on the resulting GFF file along with the reference genome in the FASTA format and a GTF annotation file. This returned a reference corrected transcriptome, transcript-level and junction-level files with structural and quality descriptors, and a QC graphical report. Among the splice junction sites, SQANTI3 defines canonical junctions such as AT-AC, GC-AG and GT-AG, whereas all others are classified as non-canonical splice junctions.

Linkage disequilibrium analysis
Linkage disequilibrium (LD) among genome-wide markers and chromosome-specific markers was calculated with TASSEL v5.2.75 (Bradbury et al., 2007) with a sliding window size of 40 markers with 100 734 460 total comparisons. The r-squared values obtained via the linkage disequilibrium function in TASSEL were plotted against the physical distance with a LOESS curve fitted to the data to show LD decay ( Figure S14).

Bulked-segregation sequencing and MutMap analysis
Bulked-segregant analysis (BSA) (Michelmore et al., 1991) coupled with whole-genome sequencing (BSA-Seq) was performed to locate genomic region harbouring the gene responsible for the pale mutant phenotype in pennycress (Figure 5d). Two pools were created with one pool containing leaf tissue from 20 individual pale mutants and the other pool consisting of wild-type individuals that did not exhibit the pale phenotype. DNA was extracted from fresh pennycress leaves using the DNeasy Plant Mini Kit (Qiagen, Valencia, CA). Both pools were sequenced on an Illumina HiSeq 2000 instrument using 2 9 125 base-paired reads at the University of Minnesota Genomics Center (Minneapolis, MN). The reads were analysed using the MutMap pipeline (Sugihara et al., 2020), and the QTL region was surveyed for candidate genes.

Conflicts of interest
The authors declare potential competing interests as intellectual property applications have been submitted on some of the genes discussed in this study.

Author contributions
RC, AN, KF and CB conceived the study. RC and AN led the genome assembly and evaluation, assisted by IRA and PCB. IRA performed the comparative genomics analysis of synteny during genome rescaffolding and in the final evaluation. AN led the genome annotation and performed analysis for protein-coding genes, non-coding genes (tRNA, rRNA, snoRNA) and pseudogenes. ACG performed small RNA library sequencing, annotation and analysis, supervised by DW. PZ and ACG performed the transposable element annotation, supervised by DW and MM. AN performed the gene expression analysis and evaluation of tissue specificity. CB and KJ provided PCR-free libraries. RC performed k-mer analysis for genome estimation. KF and RC provided the CCS libraries, which were prepared by the UMGC. CB and IRA provided the DNA methylation libraries and analysis. RC, ZT, MDM and KF developed linkage mapping populations, designed primers, performed genotyping and built genetic maps. KF, RC and KD generated resources for Hi-C, Bionano and resequencing of accessions. KF and ZT phenotyped resequenced accessions. RC performed SNP analysis of resequenced datasets. ZT performed the linkage disequilibrium decay analysis. AB performed population genomics. RC and BJ prepared samples for Iso-seq libraries. RC and ZT performed gene structure variation analysis. RC and MDM performed bulk-segregant analysis. The PacBio CLR library was prepared and sequenced by PCB and AN under the guidance of CL. DR prepared and sequenced mRNA-seq libraries. RC, AN, CB, ACG, IRA and ZT wrote the manuscript. All authors reviewed and approved the manuscript.

Data availability statement
The assembly and all NGS-based raw data are deposited in the ENA Sequence Read Archive Repository (www.ebi.ac.uk/ena/) under study accession number PRJEB46635. The summary of data provided by each institute and corresponding application is described in Table S10.

Supporting information
Additional supporting information may be found online in the Supporting Information section at the end of the article. Figure S1 Karyotype plot of the seven largest scaffolds representing chromosomes in T. arvense MN106-Ref (T_arvense_v2), alongside a concatenation of all minor scaffolds. Figure S2 Integrative Genome Viewer (IGV) snapshot of PacBio read coverage (top track) over the largest seven scaffolds of the genome, including distributions of genes (middle track) and transposable elements (bottom track). Figure S3 Sequence dot plots showing the largest seven scaffolds of the closely-related species E. salsugineum and their equivalent in T. arvense var. , comparing the difference both (a) before and (b) after rescaffolding. Figure S4 Synteny analysis between the largest seven scaffolds of T. arvense var.  and (a) their equivalent in the closely-related species E. salsugineum (Es), and (b) A. thaliana (At). Figure S5 The cumulative distribution of annotation edit distance (AED) scores from the final set of protein-coding loci, denoting that~95% of annotated genes are supported with a score ≤ 0.5 overall. Figure S6 An overview of annotated genomic feature distributions in comparison to T_arvense_v1 for (a) gene lengths, (b) CDS lengths, (c) per gene exon number, and (d) intron lengths. Figure S7 Small RNA (sRNA) annotation in the T_arvense_v2 genome assembly. Figure S8 Predicted miRNAs in the T_arvense_v2 genome assembly. Figure S9 Relative expression level of novel and conserved miRNA families between tissue types. Figure S10 sRNA types and their association with different genomic features. Figure S11 Methylation rate frequency distribution by sequence context in shoot and root tissues. Figure S12 Map showing original sampling sites of pennycress accessions used for resequencing analysis in this study. Figure S13 Structure plot showing inferred population membership for SNP data (top) and Indel data (bottom) at k = 3 for the resequenced accessions. Figure S14 Genome-wide linkage disequilibrium decay plotted against physical distance for MN106-Ref (T_arvense_v2) at an rsquared value of 0.2 and chromosome level LD decay described in the right. Linkage disequilibrium (LD) was calculated using 2 518 379 genome-wide markers with a sliding window of 40 markers. Figure S15 Synteny between T_arvense_v2 (x-axis) and YUN_-Tarv_1.0 (y-axis). Figure S16 Read length distribution of trimmed PacBio Sequel II HiFi CLR reads taken forward for assembly with Canu v1.9. Figure S17 Distribution of PacBio Sequel II HiFi CLR read mapping depth frequency over assembled contigs, with bimodal peaks due to contig regions with lower depth than the average indicating that they are duplicated. Table S1 Estimation of the genome size of T. arvense using flow cytometry with Arabidopsis thaliana, tomato (Solanum lycopersicum), and maize (Zea mays) as references. Table S2 Full descriptive statistics for intermediate versions of the assembly starting with correction, trimming and initial assembly of PacBio reads (Canu), further polishing and scaffolding using optical maps and contact maps (Bionano + HiC), and the final version following manual curation and rescaffolding with the help of genetic linkage and synteny maps (ALLMAPS).   Table S6 BUSCO statistics on (a) initial assembly, immediately after CANU, and (b) final assembly. Both are derived from orthologs to the Eudicotyledons odb10 database. Table S7 Merqury k-mer (k = 21) analysis of Illumina HiSeq reads sequenced from the accession in YUN_Tarv_1.0, showing greater QV scores in T_arvense_v2 for the equivalent top 7 scaffolds based on k-mers found uniquely in each assembly and those shared with the read set. Table S8 Merqury k-mer (k = 21) analysis of Illumina HiSeq reads (PCR-free) sequenced from the accession MN106-Ref, showing greater QV scores in T_arvense_v2 for the equivalent top 7 scaffolds based on k-mers found uniquely in each assembly and those shared with the read set.

Table S9
Merqury k-mer (k = 21) analysis of each total assembly showing relative completeness of k-mers present in each read set from Illumina HiSeq. Table S10 Summary of data provided by each institute and corresponding application. Appendix S1 Manual curation of predicted miRNAs. Data S1 Normalized read counts for the genes expressed in each of the tissues analysed (See excel file). Tau values are incorporated in each of the genes to highlight the specificity. Data S2 Top 30 most-expressed genes in each tissue, relative to the mean across all tissues, from the subset of genes with a high/ absolute tau specificity score. Data S3 Location of SNPs and the primers used in the genotyping of EMS-based population for development of linkage map. Data S4 Genetic map developed using an F2 population derived from MN106 and Ames32867. Data S5 Genetic map developed using an F2 population derived from MN106 and 2019-M2-111. Data S6 Phenotypes, total reads, and coverage associated with the accessions used for GWAS.