Long-read sequencing reveals widespread intragenic structural variants in a recent allopolyploid crop plant

Genome structural variation (SV) contributes strongly to trait variation in eukaryotic species and may have an even higher functional significance than single nucleotide polymorphism (SNP). In recent years there have been a number of studies associating large, chromosomal scale SV ranging from hundreds of kilobases all the way up to a few megabases to key agronomic traits in plant genomes. However, there have been little or no efforts towards cataloging small (30 to 10,000 bp) to mid-scale (10,000 bp to 30,000 bp) SV and their impact on evolution and adaptation related traits in plants. This might be attributed to complex and highly-duplicated nature of plant genomes, which makes them difficult to assess using high-throughput genome screening methods. Here we describe how long-read sequencing technologies can overcome this problem, revealing a surprisingly high level of widespread, small to mid-scale SV in a major allopolyploid crop species, Brassica napus. We found that up to 10% of all genes were affected by small to mid-scale SV events. Nearly half of these SV events ranged between 100 bp to 1000 bp, which makes them challenging to detect using short read Illumina sequencing. Examples demonstrating the contribution of such SV towards eco-geographical adaptation and disease resistance in oilseed rape suggest that revisiting complex plant genomes using medium-coverage, long-read sequencing might reveal unexpected levels of functional gene variation, with major implications for trait regulation and crop improvement.


Summary 23
Genome structural variation (SV) contributes strongly to trait variation in eukaryotic species 24 and may have an even higher functional significance than single nucleotide polymorphism 25 (SNP). In recent years there have been a number of studies associating large, chromosomal 26 scale SV ranging from hundreds of kilobases all the way up to a few megabases to key 27 agronomic traits in plant genomes. However, there have been little or no efforts towards 28 cataloging small (30 to 10,000 bp) to mid-scale (10,000 bp to 30,000 bp) SV and their impact 29 on evolution and adaptation related traits in plants. This might be attributed to complex and 30 highly-duplicated nature of plant genomes, which makes them difficult to assess using high-31 throughput genome screening methods. Here we describe how long-read sequencing 32 technologies can overcome this problem, revealing a surprisingly high level of widespread, 33 small to mid-scale SV in a major allopolyploid crop species, Brassica napus. We found that 34 up to 10% of all genes were affected by small to mid-scale SV events. Nearly half of these 35 SV events ranged between 100 bp to 1000 bp, which makes them challenging to detect using 36 short read Illumina sequencing. Examples demonstrating the contribution of such SV towards 37 attributable to the longer read lengths for these two genotypes (N50 = 27,139 bp and 28,916 112 bp, respectively) ( Figure 1A). The largest SV event (34,848 bp) was also detected in the 113 spring-type accession N99, suggesting that read length plays a critical role in the ability to 114 detect large and complex SV events. Around half of all detected, high-confidence SV events 115 (46.8 to 53.2 % across the 12 genotypes) ranged in size from ~100-1000 bp (Supplementary 116 Table S2). These small SV represent a novel genetic diversity resource that was previously 117 unnoticed due to the insufficient resolution of high-throughput genotyping platforms such as 118 SNP genotyping arrays and a very high false-positive rates (up to 89%) of short-read 119 sequencing data (Mahmoud et al., 2019;Sedlazeck et al., 2018). 120

Subgenomic differences in SV frequency 122
Comparison of subgenomic SV frequency revealed significantly higher numbers of small-to 123 mid-scale SV per megabase in the B. napus A subgenome than the C subgenome in all twelve 124 analysed genotypes ( Figure 6 A and B, Supplementary Table S4). This reflects a 125 corresponding subgenomic bias also observed for large-scale SV in B. napus (Samans et al. 126 2017), this could also be attributable to repeated introgressions from the A genome of B. rapa 127 during the breeding history of B. napus (Lu et al., 2019). Samans et al. (2017) table S6). This indicates that a different molecular mechanism may be responsible for the 132 generation of large and small to mid-scale SV events in the rapeseed genome. Unexpectedly, 133 we found that between 5% (Express 617) and 10% (No2127) of all genes detected in the 134 twelve accessions were affected by small to mid-scale SV events. This represents a 135 previously completely unknown extent of functional gene modification as a result of post-136 polyploidisation genome restructuring. It also underlines the massive selection potential 137 arising from intergenomic disruption during the act of allopolyploidisation (Nicolas et al., 138 2007;Nicolas et al., 2008;Szadkowski et al., 2010), and the great significance of post-139 polyploidisation intergenomic restructuring for polyploid crop evolution (Samans et al., 140 2017). 141 142 Small to mid-scale SV underlining eco-geographical differentiation in B. napus 143 As expected, strong SV differentiation from the winter-type oilseed reference genotype 144 Darmor-bzh was found in the divergent semi-winter and spring ecotypes, and in genetically 145 distant synthetic B. napus accessions R53 and No2127 ( Figure 2). Unexpectedly, however, 146 the winter-type accessions Express 617, Tapidor and Quinta also showed high levels of SV 147 compared to Darmor-bzh, despite a related breeding history and partially shared pedigree 148 (e.g. Express 617). According to (Lu et al., 2019), who used whole-genome resequencing 149 data to investigate the species origin and evolution of B. napus, spring and semi-winter types arose only very recently (<500 years) from winter-types. Our data concur with this 151 assumption, with fewer genes carrying SV in winter-type accessions (1072) than in spring 152 (1170) or semi-winter (3663) ecotypes ( Figure 1C). Furthermore, we also detected small to 153 mid-scale SV within each ecotype, for example 1272-1887 genes carrying unique SV events 154 were found among the four semi-winter accessions ( Figure 1D). The unexpectedly high 155 structural gene diversification both between and within ecotypes suggests that de novo 156 generation of small to mid-scale SV may also be ongoing in recent breeding history. Overall, 157 4590 of the called intragenic SV were common among the four B. napus forms, indicating 158 putative SV events specific to Darmor-bzh. These could possibly be attributed to errors in the 159 Darmor-bzh reference assembly, however the similar number of unique intragenic SV 160 detected only in semi-winter types (3663) suggests that this frequency is not unexpected in 161 the context of the other results. Repeating the analysis with the concatenated pseudo-162 reference from B. rapa plus B. oleracea gave comparable results (6248 common among all 163 sequenced B. napus forms, 2919 unique to semi-winter ecotypes). 164 To evaluate the influence of SV on eco-geographical adaptation and potential species 165 diversification, we constructed a maximum likelihood (ML) tree for the 12 B. napus lines 166 based solely on SV detected using long read sequencing data. The resulting tree ( Figure 1B) 167 comprised 3 divergent clades representing 3 ecotypes of B. napus (winter, semi-winter and 168 spring). In contrast to genetic clustering based on genome-wide SNP data, which reveals high 169 sequence diversification between synthetic and natural B. napus (Bus et al., 2011), the two 170 synthetic accessions R53 and No2127 did not fall into separate clades. Instead, the winter-171 type R53 clustered closest together with the natural winter-type accessions and the spring-172 type No2127 clustered with the natural spring-type accessions. This suggests that small to 173 mid-size SV events originating during or immediately after allopolyploidisation might 174 rapidly confer ecogeographical adaptation. Although hundreds to thousands of genes carrying 175 unique SV events were detected in each individual accession, the intriguing observation that 176 their cumulative clustering reflects ecogeographical adaptation forms suggests a possible key 177 role of SV in rapid functional adaptation. Overall, the distribution and frequency of SV 178 events in all investigated accessions suggest that small to mid-scale SV may be a major, 179 previously unknown source of functional genetic variation in B. napus. 180 Unfortunately, a catalogued and validated "truth set" of genomic SV is not yet established for 181 B. napus or other complex plant genomes. This makes it crucial to validate SV predicted 182 from long reads using independent validation methods. On the other hand, manual 183 verification of thousands of SV events (for example using PCR) is not realistic. To obtain 7 first insight into the validity of the SV called using our pipeline, we selected relevant, 185 potentially functional examples representing possible functional mutations in flowering-time 186 and disease resistance-related genes. We validated the detected SV events using different 187 independent methods in a total of 4 B. napus genotypes including two springs, one winter and 188 a synthetic. and other small to mid-scale forms of SV could not be determined from previous, short-read 198 resequencing data . Using long-read data, we found that 44 of 178 199 flowering-time pathway genes, including numerous key regulatory genes, contain one or  flanking sequences, is shown in Figure 3B. The same insertion was undetectable using only 205 the short read sequence-capture data of Schiessl et al. (2017). In two out of three spring 206 accessions, N99 and PAK85912, we detected a 2.8 kbp insertion in a B. napus orthologue of 207 the key vernalisation regulator Flowering Locus C (BnFLC.A02, BnaA02g00370D), a variant 208 previously reported by Chen et al. (2018) to be causal for early flowering. 209 In a second case study, we analysed SV events in key vernalisation genes that differentiate 210 between the vernalisation-dependent and vernalisation-independent B. napus accessions in 211 our panel. A number of interesting, putative functional variants were detected. For example, 212 we detected a 288 bp deletion ( Figure 5) in all the spring and semi-winter accessions (except 213 for ZS11) in BnFT.A02 (BnaA02g12130D). This FT ortholog on chromosome A02 has been 214 reported to be significantly associated with flowering-time variation in a worldwide 215 collection of rapeseed accessions . BnFT.A02 was also found to be 216 differentially expressed among winter, spring and semi-winter type B. napus by Wu et al. 8 (2019), therefore we scanned for SV in the putative promoter region for this gene. We

Implications of long-read sequencing technologies for discovery of functional diversity 248
Of nine additional SV events we evaluated using PCR, all showed the expected PCR products 249 corresponding to the deletions or insertions predicted by the long-read SV calling. These 250 results underline the apparent effectiveness of long sequence reads for accurately detecting 251 and anchoring insertions/deletions in a broad size range from under 100 bp up to multiple 252 kbp. In contrast, Illumina short reads from regions corresponding to insertions not present in 253 available reference genomes remain un-aligned in alignment-based resequencing approaches, 254 meaning that their genomic localization using short-read data can be achieved only by whole 255 genome de novo assembly. Our results in B. napus showed that de novo SV events appear to 256 occur at an unexpectedly high rate. Hence, it remains unclear how many high-quality 257 reference genomes will be necessary to construct a representative pangenome that captures 258 the majority of the genome-wide functional SV landscape. 259 This study provides one of the very first insights into genome-wide, gene scale SV linked to  (Supplementary table S8) gives us high confidence that SV predicted using medium-coverage 267 long-read data with our calling strategy are genuine. This provides a relatively cost-effective 268 method to assay larger germplasm collections without ascertainment bias. 269 The occurrence of SV events in a size range corresponding to intragenic rearrangements 270 (~100-1000 nt) has been ignored in most crop species in the past, due to the limited 271 resolution of short-read resequencing. Although presence-absence calling from genome-wide 272 SNP array data has been successful in isolated cases in establishing QTL associations (e.g. or homoeologous exchanges (e.g. Grandke et al. 2016) are therefore likely to ignore 277 potentially functional SV events. Reduced costs, considerably improved read accuracy and 278 significantly increased average read lengths today make long-read sequencing technologies a 279 viable option not only for accurate assemblies of complex plant genomes (Belser et al., 280 2018), but increasingly also for genome-wide resequencing. Our results suggest that simple 281 reference-based resequencing and alignment with long reads can uncover a new dimension of 282 genetic and genomic diversity associated with important traits in crop plants. Particularly in 283 polyploid plants (Schiessl et al., 2019), this may lead to discovery of previously unknown 284 levels of functional diversity of major interest for breeding and crop adaptation. 285 286

DNA isolation for Oxford Nanopore Technology (ONT) sequencing 292
High molecular weight DNA was isolated using DNA isolation protocol modified from 293 Mayjonade et al. (2016). Young leaves were harvested from rapeseed plants at 4-6 leaf stage 294 and flash frozen using liquid nitrogen. Frozen leaf material was ground to fine powder using 295 a mortar and pestle and transferred to 15 ml Falcon tube. 4-5 ml of pre-heated lysis buffer 296 (1% w/v PVP40, 1% w/v PVP10, 500 mM NaCl, 100mM TRIS pH8, 50 mM EDTA, 1.25% 297 w/v SDS, 1% (w/v) Na 2 S 2 O 5 , 5mM C 4 H 10 O 2 S 2 , 1 % v/v Triton X-100) was added in order to 298 disrupt the cell wall. The lysate was incubated for 30 minutes at 37°C in a thermomixer. 0.3 299 volumes of 5M Potassium Acetate was added to the lysate and spun at 8000g for 12 minutes 300 at 4°C to precipitates sodium dodecyl sulfate (SDS) and SDS-bound proteins in order to 301 obtain clean DNA. Finally, magnetic beads were used to recover cleaned DNA. 302 303

Library preparation for ONT sequencing 304
Between 1-3ug of DNA was used to prepare the sequencing library, using the ligation 305 sequencing kit SQK-LSK108 or SQK-LSK109 according to the manufacturer's 306 recommendations. Genomic DNA was subjected to end repair followed by a bead cleanup. 307 Sequencing adaptors were then ligated to the end-repaired DNA. Finally, the adaptor ligated 308 DNA was once again subjected to bead cleaning. DNA was finally loaded onto an Oxford 309 Nanopore MinION flow cell for sequencing. 310

Alignment and SV calling for ONT data 318
Raw fast5 files obtained by the MinION device were base-called using ONT provided base-319 caller, Albacore. Raw, uncorrected reads from various flow cells were combined into single 320 fastq file for each genotype. This fastq file was used to align the Nanopore reads to the 321 publically available B. napus reference genome assembly Darmor-bzh v4.1 (Chalhoub et al., 322 2014) and also to a concatenated pseudo-reference assembly comprising the B. rapa and B. 323 oleracea reference assemblies recently published by Belser et al. (2018), using NGMLR 324 version 0.2.7 (Sedlazeck et al., 2018) with default settings except for "-x ont" flag, 325 representing parameter presets for ONT. NGMLR produced an un-sorted SAM file as an 326 output, which was converted to a sorted BAM file using Samtools version 1.9 (Li et al., 327 2009). Genomic variants were called using Sniffles version 1.0.10 (Sedlazeck et al., 2018) 328 using the preset parameters. 329 330

Alignment and SV calling for PacBio data 331
Since 8 PacBio libraries contained nearly 70-80 Gbp of sequencing data, we randomly 332 selected 50 Gbp of data for further analysis in order to obtain quantitatively comparable data 333 to the Nanopore sequencing. This 50 Gbp of data was then aligned as per section 1.4.1 to the 334 publicly available B. napus reference and also to the concatenated pseudo-reference 335 assembly, using NGMLR version 0.2.7 with default settings. NGMLR produced an un-sorted 336 SAM file as an output, which was converted to a sorted BAM file using Samtools version 337 1.9. Genomic variants were called using Sniffles (version 1.0.10) using the preset parameters. 338 339

Quality filtering of the predicted SV events for both ONT and PacBio datasets 340
We performed a very stringent quality filtering on the sniffles predicted SV events. Since the 341 study was focused on small scale insertions or deletions, we removed all predicted 342 translocations and duplications. Furthermore, it is nearly impossible to validate the 343 authenticity of such SV events, as many may represent mis-positioning of genomic fragments 344 in the reference assembly, we only considered SV scored as "PASS" by Sniffles and ignored 345 those scored as "UNRESOLVED". Sniffles reports SVs with both within-alignment (AL) and 12 split-read (SR) information. AL-type SV are usually small indels that can be spanned within a 347 single alignment, whereas large or complex events lead to SR alignments (Sedlazeck et al., 348 2018). To ensure only the high confidence SV were selected, all SV which were not 349 supported by a "within-alignment: AL" flag were discarded. This might lead to an under-350 estimation and bias in the size distribution of the detectable SV. However, at this point of 351 time the accuracy of publically available genome from B. napus is not high enough to 352 distinguish large and complex SV events from assembly errors. 353 354

Calculation of overlap between SV events and the gene models 355
Quality filtered SV events were overlapped with the gene models from Darmor-bzh and also 356 to the combined B. rapa and B. oleracea reference assemblies using Bedtools intersect 357 (Quinlan and Hall, 2010) using the default parameters. In order to calculate the genome wide 358 frequency of SV events, we also overlapped the quality filtered SV with a bed file containing 359 1 Mbp windows for the entire genome assembly. The intersect file between the SV events 360 and 1 Mbp windows for the entire genome assembly was then used for plotting the SV 361 distribution along 19 B. napus chromosomes, using Circos (Krzywinski et al., 2009). 362 Statistics including length and distribution of quality-filtered SV from the 12 genotypes were 363 calculated with SURVIVOR (Jeffares et al., 2017) and plotted with ggplot2 (Wickham, 364 2016). 365 366

Construction of a Maximum Likelihood (ML) tree 367
SV events predicted for each of the 12 genotypes were merged into a single variant calling 368 file (vcf). This combined vcf was then used to force call all the SV events across all 12 369 genotypes using Sniffles, resulting in a multi-sample vcf. The multi-sample vcf was then 370 converted into PHYLIP format using an in house bash script and used as an input for IQ-371 TREE version 1.6.12 (Nguyen et al., 2015). The best-fit substitution model for the data was 372 determined by IQ-TREE ModelFinder (Kalyaanamoorthy et al., 2017)

Selection of SV events for PCR validation 377
We looked at two different agronomically interesting traits in order to prioritize the predicted 378 SV events. Firstly, we analyzed the SV events that might contribute to Verticillium 379 longisporum (VL) resistance, using a bi-parental double-haploid population derived from a 13 cross between our sequencing panel genotypes Express 617 and R53. Two QTL were defined 381 for VL resistance on chromosome C01 and C05 by Obermeier et al. (2013). We mainly 382 focused on C05 QTL, as this was described to be the major genetic control for VL resistance. 383 The genetic map used for identifying C05 QTL was based on SSR (Simple Sequence 384 Repeats) and AFLP (Amplified Fragment Length Polymorphism) markers. Therefore, in 385 order to localize the physical position of the QTL on chromosome C05, we anchored the 386 flanking SSR markers (BRMS030_210 and Na12C01_160) to the Darmor-bzh version 4.1 387 assembly and identified a 4.3 Mbp (6,329,426 bp to 10,659,726 bp) region containing 606 388 genes. 37 and 45 out of the 606 genes were found to contain SV in the form of insertions or 389 deletions in Express 617 and R53 respectively. 17 genes were found to be common among 390 both the genotypes, so were dropped from the prioritized gene set. We further prioritized the 391 candidate genes, if they were annotated as defense response or phenolpropanoid pathway 392 genes. Secondly, we analyzed the SV located within the genes described to be involved in 393 flowering time pathway in B. napus as described by Schiessl et al. (2017). Top prioritized SV 394 were then visualized in IGV viewer (Robinson et al., 2017) and selected for PCR validation. 395 Authorship 400 HSC, HTL and RJS conceived the study. HSC, STNA and IAPP generated the Oxford 401 Nanopore long-read sequence data. JS, KL and LG contributed PacBio long-read sequence 402 data. SVS contributed Illumina sequence capture data. HSC, STNA and HTL conducted the 403 experiments and analysed the data. IG, CO, RJS and HTL provided ideas and suggestions for 404 data analysis. HSC and RS drafted the manuscript. 405