A chromosome-scale assembly of allotetraploid Brassica juncea (AABB) elucidates comparative architecture of the A and B genomes

Brassica juncea (AABB; genome size ∼920 Mb), commonly referred to as mustard, is a natural allopolyploid of two diploid species – B. rapa (AA) and B. nigra (BB). We report a highly contiguous genome assembly of an oleiferous type of B. juncea variety Varuna, an archetypical Indian gene pool line of mustard, with ∼100x PacBio single-molecule real-time (SMRT) reads providing contigs with an N50 value of >5Mb. Assembled contigs were corrected and scaffolded with BioNano optical mapping. Three different linkage maps containing a large number of GBS markers were developed and used to anchor scaffolds/contigs to the 18 linkage groups of B. juncea. The resulting chromosome-scale assembly is a significant improvement over the previous draft assembly of B. juncea Tumida, a vegetable type of mustard. The assembled genome was characterized for transposons, centromeric repeats, gene content, and gene block associations. Both A and B genomes contain highly fragmented gene block arrangements. In comparison to the A genome, the B genome contains a significantly higher content of LTR/Gypsy retrotransposons, distinct centromeric repeats and a large number of B. nigra specific gene clusters that break the gene collinearity between the A and the B genomes. The genome assembly reported here will provide a fillip to the breeding work on oleiferous types of mustard that are grown extensively in the dry land areas of South Asia and elsewhere.


Introduction
Genus Brassica contains some of the most important oilseed and vegetable crops that are grown worldwide. Nagaharu The long-read third generation technologies such as SMRT sequencing by PacBio and Nanopore sequencing combined with long-range scaffolding technologies have significantly improved genome assemblies providing higher levels of contiguity and more extensive coverage of repeat sequences, transposable elements, centromeric and telomeric regions 6,7 .
Recent genome sequencing efforts have relied exclusively on high coverage SMRT sequencing, both for the de-novo genome assemblies and for improving earlier assembled draft genomes that used the second-generation technologies [8][9][10][11][12][13] . Amongst the Brassica species -assembly of B. rapa Chiifu has been improved by additional data from ~57x SMRT sequencing, BioNano optical mapping, and Hi-C reads 14 . Highly contiguous de-novo assemblies have been generated for two new lines of B. rapa and B. oleracea by Nanopore long-read sequencing and optical mapping 15 .
We report here reference genome assembly of an oleiferous type of B. junceavariety Varuna, an archetypical line of the Indian gene pool of mustard, with ~100x SMRT coverage using PacBio Sequel chemistry and long-range scaffolding with BioNano optical mapping. To facilitate and validate the assembly of the two constituent genomes -A and B of B. juncea, a draft genome of an Indian gene pool line of B. nigra (variety Sangam) was assembled using Illumina shotgun sequencing and gap filling with Oxford Nanopore long-read sequences. Sequence assemblies were further validated with three new genetic maps containing a large number of GBS markers.
The assembled genome of B. juncea variety Varuna, consisting of 58 scaffolds and nine contigs assigned to 18 pseudochromosomes, is a significant improvement over the previous assembly of B. juncea variety Tumida 5 -particularly, for the B genome. The Tumida assembly (Version 1.1) consisted of 10,581 scaffolds with N50 of ~1.5Mb; in the Varuna assembly N50 of the scaffolds is ~33. 6 Mb. Further, the centromeric and telomeric regions were identified in most of the 18 pseudochromosomes. We compare the architecture of the A, B and C genomes and highlight both similar and unique features of the B genome vis-à-vis the A and C genomes.

Genome assembly
To assemble the genome of B. juncea (AABB, genome size ~922Mb 5 ), three different experimental activities were initiated concurrently. In the first activity, high molecular weight DNA isolated from B. juncea variety Varuna was subjected to SMRT sequencing on the PacBio RSII platform. A total of 9,735,857 reads with an N50 value of ~15.5 kb were obtained − providing ~100x coverage of the genome (Supplementary Table 1). The reads were assembled into 1,253 contigs with a N50 value of ~5.7 Mb using Canu assembler (Table   1). B. juncea genome was also sequenced on an Illumina HiSeq platform to obtain ~40x coverage of the genome (Supplementary Table 2). Around 98.5% of the short-reads could be mapped to the SMRT sequencing based contigs, indicating that the long-read sequences had covered most of the genomic regions. In the second activity, a doubled haploid (DH) line of B. nigra (BB) variety Sangam, genome size ~591 Mb 5 , was sequenced using Illumina HiSeq platform. Around 110x coverage was obtained by sequencing paired-end (PE) libraries of three different insert sizes (Supplementary Table 3a). An additional 2.8 Gb sequence data were obtained from three different mate-pair (MP) libraries (Supplementary Table 3b). The paired-end reads were assembled to obtain 57,941contigs with an N50 value of ~34.6 kb.
These contigs were scaffolded using ~14x Oxford Nanopore long-read sequence data (Supplementary Table 3c found to belong to the B genome. The remaining contigs were checked against the draft genome sequence of B. rapa 2 , and 505 contigs were found to be specific to the A genome; 57 contigs could not be specified to either of the two genomes. Around eighteen mis-assemblies were noticed − these had mostly arisen due to the coalescence of highly conserved syntenic regions. These mis-assemblies were confirmed by the high-density genetic map of VH population. Scaffolding was carried out by hierarchical mapping of the 1,253 SMRT contigs on three different optical maps. Two maps were developed with NLRS based labeling using BssSI and BspQI enzymes and one with DLS labeling using DLE-1 enzyme (Supplementary Table 8; Supplementary Fig. 5). Corrections were made at each mapping stage, and corrected sequences were mapped again on the consensus maps. A total of 45 contigs were identified with mis-assemblies. These included the eighteen mis-assemblies that were identified earlier.
These were curated by breaking the mis-assembled regions and aligning the edited contigs again to the consensus maps. The final assembly, after corrections and multiple rounds of scaffolding, was constituted of 58 scaffolds and 714 unscaffolded contigs/fragments. The total size of the assembled scaffolds was ~847.8 Mb with an N50 value of ~33.6 Mb ( Table   1). The unscaffolded contigs/fragments constituted ~36.4 Mb of the genome with an N50 value of ~55.6 kb.
The VH genetic map was used to assign the scaffolds and contigs to 18 linkage groups of B. juncea and also to assess the quality of the assembly. Fifty-one of the 58 scaffolds and nine of the 714 contigs could be assigned to the 18 pseudochromosomeswhich were designated as BjuVA01 -BjuVA10, BjuVB01 -BjuVB08 17 (Figure 1; Supplementary Table 9). Three of the pseudochromosomes were covered with only one scaffold, four with two scaffolds, six with three and the rest with four or more scaffolds and a few contigs -the maximum number being ten for BjuA01. GBS markers, 2,947 in number that mapped on the VH population, were used to compare their position on the genetic map with their physical location on the pseudochromosomes. A very high correlation was observed between the marker order on the genetic map and the physical position of the GBS tags on the pseudochromosomes ( Supplementary Fig. 6). To further validate the assembly, we used the TuV population genetic map, which contained 8,517 GBS markers assigned to the 18 LGs of B. juncea. The order of the dispersed GBS markers on the TuV genetic map was invariably found to be correlated with the order of their sequences on the pseudochromosomes ( Supplementary Fig. 7). Based on the genetic marker data, two more scaffolds could be assigned to the pseudochromosomes (Supplementary Table 9). At the end of all the steps, from the overall genome assembly consisting of 58 scaffolds and 714 contigs/fragments -54 scaffolds and nine contigs could be assigned to the 18 pseudochromosomes; four scaffolds and 705 contigs remained unassigned ( Table 1). The size of the assigned sequences was calculated to be around 841.  Table 10). The coverage achieved for the B genome in the present study was significantly higher -~507.7 Mb, as compared to the earlier reported coverage of

Genome annotation
B. juncea Varuna genome assembly was annotated for three components -transposable elements, centromeres, and genes. Transposable elements (TEs) were identified by structure and similarity using a de-novo prediction approach (Methods for details). This exercise generated a database of 1,590 consensus repeat sequences that were merged with the  Table 14).  Table 15). The predicted genes were validated by comparing these with the non-redundant proteins in the UniProt plant database. This analysis could validate around 93% of the predicted genes. Predicted genes were also confirmed by the Isoseq transcriptome analysis and short-read RNA-seq data generated for B. juncea in some other studies 17,19,20 (Supplementary File 1). A total of 82,008 genes were found to express in one or the other study -38,232 (~79%), 43,776 (~76%) in the A and B genome, respectively.
We also carried out RNA-seq of B. nigra Sangam using RNA isolated from the seedling and  Table 15). This dataset also contains information on the expression status of each of the predicted gene and its closest ortholog in B. rapa Chiifu and the model species At based on protein similarity. Genes were marked based on 24 ancestral genomic blocks (A-X) defined for the At genome 21 (Fig. 1, Supplementary Table 15). We carried out ortholog tagging of the B. juncea Varuna genes with the previously assembled B. rapa Chiifu (V1.5) and B. juncea Tumida (V1.1 and V1.5) assemblies; many mis-assembled regions in the previous assemblies could be identified ( Supplementary Fig. 11 genome-specific genes were identified (Blastp E-value <1e -05 , coverage >80%). No orthologs of these genes were identified in At. A total of 2,937 A genome-specific genes and 12,262 of B genome-specific genes were observed to express in the RNA-seq data of B. juncea and B. nigra ( Supplementary Fig. 12, 13). We checked whether the genome-specific genes were dispersed or present in clusters. A total of 152 gene clusters with ≥ten species-specific genes were identified in the B genome, whereas only 38 such clusters were observed in the A genome.   Supplementary Fig 16). The split between the Arabidopsis and the Brassiceae progenitor  Table 17) and divergence analysis (Supplementary Table 18, Supplementary Fig 15), we assigned all the gene blocks identified in the A and B genomes (Fig 1, Supplementary Table 15) to the three-constituent paleo-genomes LF A , MF1 A , MF2 A and LF B , LF1 B , MF2 B (Fig. 2). As a contiguous genome assembly of the C genome has been reported recently 15 -we also assigned the gene blocks in the C genome to the three constituent paleogenomes LF C , MF1 C , and MF2 C (Fig. 2). The A and C genome homoeologs, besides being less divergent, also show extensive similarity in the gene block arrangements as compared to the A and B genome homoeologs (Fig 3).  Table 19). More contiguous genome assemblies have shown the A, B, and C genomes to be even more fragmented than reported earlier 30 Table   21).

The ancestral intra-paleogenome contiguous gene block associations -A-B, F-G, G-H, T-U, and W-X have been reported to be absent in the A and C genomes of B. rapa, B.
oleracea and B. napus 30 . The new assemblies of the A, B and C genomes show the presence of these block associations in one or the other paleogenomes (Fig. 2, Supplementary Table 19 and 21). Loss of these associations might have occurred pre-or post-b event. Intrapaleogenomes non-contiguous gene block associations are more interesting for understanding the structure of the paleogenomes pre-b event. The previously reported association V-K-L-W-Q-X as such does not exist in any of the paleogenomes, only V-K-L association is common to all the three paleogenomes (Supplementary Table 19

Discussion
We have reported a highly contiguous genome assembly of B. juncea variety Varuna, an oleiferous type belonging to the Indian gene pool of mustard, using long-read SMRT sequencing and optical mapping. The B. juncea assembly reported here has given contig N50 value of >5Mb, comparable to the other recent SMRT based genome assemblies 9-14 . We found recursive optical mapping with three different labeling reactions to be highly useful for correcting mis-assemblies. The final assembly showed a strong correlation with genetic marker-based maps, which were exclusively developed for this study. The assembly is a marked improvement over the previous assembly of B. juncea variety Tumida -particularly, for the B genome.  [38][39][40] , oil and seed meal quality [41][42] , and disease resistance 43 . Several QTL have been mapped in the Indian and East European gene pool lines [38][39][40][41][42][43][44] . The Chinese vegetable types like Tumida constitute another divergent gene pool 5 that contains potentially useful traits like serrated leaves, basal branching 45 , and stem strength.
The highly contiguous genome sequence assembly of B. juncea will allow fine mapping in the QTL regions, identification of the candidate genes and understanding molecular mechanisms underlying some of the critical quantitative traits.

Plant Material
Brassica juncea (AABB) variety Varuna, an oleiferous type belonging to the Indian gene pool of mustard, is one of the most extensively grown mega variety released by the public system in India 33  Albacore software (https://github.com/Albacore).

Assembly of B. juncea and B. nigra genome
Sequencing of B. juncea libraries on PacBio Sequel platform generated > 9 million long reads; these were processed by Canu assembler 47 (V1.4) -"minReadLength" and "minOverlapLength" set at 1000bp "rawErrorRate" set at 0.3 and "correctedErrorRate" at 0.045. Completeness of the assembly was checked by mapping sequences obtained from the Illumina PE libraries on the assembled SMRT based contigs using BWA mem 48 (default parameters).
Sequences generated from the three PE libraries of B. nigra were assembled into contigs with MaSuRcA software 49 . Illumina PE reads were mapped on the raw Nanopore reads using BWA mem; 1,442,476 error corrected Nanopore reads were generated using Pilon program 50 . Scaffolding of the MaSuRcA based contigs with the error corrected Nanopore sequence data was carried out with SSPACE-LongRead.pl script 51 which generated 29,808 scaffolds with an N50 value of 19,834. Further, three rounds of scaffolding were carried out using the 2 kb, 6 kb and 10 kb MP library sequence datasets using SSPACE-STANDARD-3.0.pl script 51 .

BioNano assembly and validation
BioNano optical maps were developed with proprietary kits and protocols provided by RefAligner to identify all molecule overlaps, and three different consensus maps were constructed. All fragments were then mapped back to the consensus maps which were recursively refined and extended. The BioNano IrysSolve module 'HybridScaffold' was used to perform the hybrid assembly between PacBio-SMRT contigs and BioNano-assembled genome maps.

Genetic mapping
Marker-based genetic maps were developed from two

Centromere region specific repeats prediction
Potential centromeric regions were identified from the correlation plots between GBS markers on the genetic maps and physical position of their respective tags on the pseudochromosomes. Python-based scripts were developed to identify kmers of 100bp length with >20 times occurrence in the predicted centromeric regions. The identified kmers were assembled using DNAstar (https://www.dnastar.com/) to get consensus contigs. The assembled contigs were further analysed for their presence in the genome assembly by Blast search analysis, and centromere-specific sequences were selected for both-A and B genomes separately. For ortholog identification, a two-step gene identification pipeline was implemented.

Syntenic block construction and determination of gene fractionation patterns
In the first step, protein sequences from A and B genomes of B. juncea were compared to B.
oleracea (C genome) and At using Orthofinder software 22 . All-against-all sequence comparison was carried out using Diamond software 70 . Orthologous groups were determined using the B genome specific genes which did not cluster with the genes of other analyzed species.

Phylogenetic analysis of gene families
1,482 genes of At, that were present as single copy gene in At (representing all the ancestral gene blocks; A-X) and three orthologs each in the A, B and C genomes were selected for phylogenetic analysis. The amino acid sequences of orthologs were aligned using MUSCLE (v3.8.31) 71 . Poorly aligned regions were trimmed using GBLOCKS 72 (v0.91) and subsequently with PAL2NAL script 73 . Concatenation of the sequences and conversion to Phylip format was performed using in-house developed Perl scripts. Norwik trees were development and Ka/Ks, and omega values were obtained using PAML package 26 .
Divergence time was calculated assuming a mutation rate of 1.5x10 -8 substitutions per synonymous site per year 74 .

Significance of variation between mean Ks values for orthologue divergence (AT-A,
AT-B and AT-C genomes), paralog divergence (within A, B and C genomes) and homoeolog divergence (between A, B and C genomes) were analysed statistically by oneway ANOVA and Tukey's post hoc test; p<0.05 was considered to be statistically significant.