Chromosome‐level reference genome for European flat oyster (Ostrea edulis L.)

Abstract The European flat oyster (Ostrea edulis L.) is a bivalve naturally distributed across Europe, which was an integral part of human diets for centuries, until anthropogenic activities and disease outbreaks severely reduced wild populations. Despite a growing interest in genetic applications to support population management and aquaculture, a reference genome for this species is lacking to date. Here, we report a chromosome‐level assembly and annotation for the European Flat oyster genome, generated using Oxford Nanopore, Illumina, Dovetail OmniC™ proximity ligation and RNA sequencing. A contig assembly (N50: 2.38 Mb) was scaffolded into the expected karyotype of 10 pseudochromosomes. The final assembly is 935.13 Mb, with a scaffold‐N50 of 95.56 Mb, with a predicted repeat landscape dominated by unclassified elements specific to O. edulis. The assembly was verified for accuracy and completeness using multiple approaches, including a novel linkage map built with ddRAD‐Seq technology, comprising 4016 SNPs from four full‐sib families (eight parents and 163 F1 offspring). Annotation of the genome integrating multitissue transcriptome data, comparative protein evidence and ab‐initio gene prediction identified 35,699 protein‐coding genes. Chromosome‐level synteny was demonstrated against multiple high‐quality bivalve genome assemblies, including an O. edulis genome generated independently for a French O. edulis individual. Comparative genomics was used to characterize gene family expansions during Ostrea evolution that potentially facilitated adaptation. This new reference genome for European flat oyster will enable high‐resolution genomics in support of conservation and aquaculture initiatives, and improves our understanding of bivalve genome evolution.


| INTRODUC TI ON
The European flat oyster Ostrea edulis (Linnaeus, 1758) is a bivalve mollusc within Ostreidae ('true oysters'). This species is a native of Europe, naturally distributed from 65 degrees North in Norway to 30 degrees North in Morocco, along the North-Eastern Atlantic, and also the entire Mediterranean basin (Thorngren et al., 2019).
Introductions of O. edulis in the 19th and 20th centuries for aquaculture resulted in the establishment of natural beds in many regions across the world, including North America, New Zealand, Australia and Japan (Bromley et al., 2016). O. edulis can reach sizes exceeding 20 cm and has a life span of up to 20 years (Bayne, 2017). This species is a protandrous hermaphrodite that can change sex within a spawning season, and unlike the more widely cultured Pacific oyster Crassostrea gigas, brood their larvae in the inhalant chamber for several days before release (Suquet et al., 2018). O. edulis exhibits extensive physiological plasticity across its range, for example, the temperature at which spawning occurs (11-25°C degrees) and the duration of the spawning period (from 1 to 2 months to year-round; Bromley, 2015;Bromley et al., 2016).
Ostrea edulis has been an integral part of human diets in Europe for centuries, with evidence for its collection and consumption since at least Roman times. Furthermore, it is thought >700 million oysters were consumed in London alone during 1864 (Pogoda, 2019).
However, overfishing and anthropogenic activities have driven the collapse of O. edulis stocks throughout its natural range Pogoda, 2019). The past 40 years have witnessed a further decline in production, with a peak of 32,995 tonnes in 1961 dropping by >90% to 3120 tonnes by 2016 (FAO, 2020). Human impacts are widely cited as the primary reason for this decline, including habitat destruction, overexploitation, the introduction of non-native species competing for O. edulis habitats (Grizel & Héral, 1991;Vera et al., 2019), and the emergence/spread of diseases associated with translocations (Bromley et al., 2016). Key parasites associated with flat oyster population declines include the protist Marteilia refringens and the haplosporidian protozoan parasite Bonamia ostreae, which causes bonamiosis, for which no effective control methods exist (Sas et al., 2020). Large-scale restoration efforts exemplified by the Native Oyster Restoration Alliance (NORA; https://norae urope.eu/) are targeting the re-stocking of O. edulis at high densities and developing sustainable populations. However, these efforts are strongly hampered by parasitic diseases, especially bonamiosis (Engelsma et al., 2010;Pogoda et al., 2019). While using animals from Bonamiafree regions offers a potential short-term solution for restoration and aquaculture efforts, understanding the genetic basis for natural parasite resistance (Sas et al., 2020) will enable selective breeding to enhance Bonamia resistance and permanently reduce disease incidence in farmed and wild populations.
Several studies have applied genetic and genomic tools to study O. edulis in the absence of a reference genome assembly. Such work has been strongly targeted towards understanding bonamiosis, either by identifying candidate quantitative trait loci (QTL) and genetic outliers linked to Bonamia resistance (Harrang et al., 2015;Lallias et al., 2009;Vera et al., 2019) or by studying gene expression responses to Bonamia infection (Pardo et al., 2016;Ronza et al., 2018). SNP genotyping arrays with low (Lapègue et al., 2014) and medium (Gutierrez et al., 2017) density have also been developed for genetics applications. The lack of a high-quality reference genome in O. edulis, however, contrasts with the situation in the commercially valuable Pacific oyster C. gigas Qi et al., 2021) and is a current limitation for the research community. An annotated genome for O. edulis will enable genetics research in many directions supporting conservation and aquaculture, revealing the physical location of genetic variation with respect to genes and genomic features and offering an essential foundation for functional genomics.
A reference genome will also support our understanding of O. edulis evolution and environmental adaptation, through comparisons with other bivalve species.
Bivalve genome assembly has classically been hampered by genetic complexities including high heterozygosity and repeat content (Davison & Neiman, 2021), along with the challenge of extracting pure high molecular weight DNA (Adema, 2021). However, recent advances in long-read sequencing technologies have enabled highquality genome sequences to be generated for multiple bivalves, including C. gigas Qi et al., 2021), the scallop Pecten maximus (Kenny et al., 2020) and hard clam Mercenaria mercenaria Song et al., 2021). Here, we integrated multiple sequencing technologies to assemble and annotate a highly contiguous chromosome-level genome assembly for an O. edulis individual from the UK, which was confirmed for accuracy by comparison to a novel linkage map for O. edulis, and high-quality genome assemblies for several bivalve species. Comparative genomics inclusive of diverse bivalve species allowed us to define gene copy expansions in the Ostrea lineage. The high-quality reference genome reported here, and an independent O. edulis assembly reported for an individual from a distinct European population in the same issue of this journal by Boutet et al. (2022) (see commentary by Bean et al., 2022), will support ongoing conservation and aquaculture initiatives for the European flat oyster, while improving our comparative understanding of genome evolution and adaptation in the Ostrea lineage.

| Sampling and sequencing
A single unsexed adult O. edulis individual sourced from Whitstable (England, UK) through a commercial supplier (Simply Oysters) was used for all DNA and RNA sequencing performed in this study, as described below. The oyster was depurated in clean seawater for at least 42 h before sampling. Samples of gill, mantle, heart, white muscle, striated muscle, digestive gland, labial palp and gonad were flash frozen using liquid nitrogen and stored at −80°C. High molecular weight DNA was extracted from gill using a cetyltrimethylammonium bromide (CTAB) based extraction method and used to generate short and long-read sequencing libraries. DNA purity was confirmed using a Nanodrop 1000 (Thermo Fisher Scientific). DNA integrity was initially assessed using a Tapestation 4200 (Agilent Technologies). The DNA was purified using Ampure beads (Beckman Coulter™), sheared to a length of ~35 kb using a Megaruptor® (Diagenode) and size selected in the 7-50 kb range on a Bluepippin system (Sage Science) with a 0.75% cassette. The resulting DNA was sequenced on four PromethION flow cells (FLO-PRO002), with base calling performed using Guppy version 3.2.6 + afc8e14. Short-read libraries with an insert size of 350 bp were generated using the same DNA with an Illumina TruSeq DNA library kit, prior to sequencing on an Illumina NovaSeq 6000 by Novogene Ltd (UK) with a paired-end 150 bp configuration.
An Omni-C™ library was generated from gill tissue by Dovetail Genomics (Santa Cruz, USA) and sequenced on an Illumina HiSeq X with a paired-end 150 bp configuration.
For RNA-Seq library generation, total RNA was extracted for the eight tissues using a Trizol-based method, before DNAase treatment. RNA integrity was assessed using agarose gel electrophoresis and Bioanalyszer 2100 (Agilent). RNA purity was confirmed via a Nanodrop 1000 system. Illumina TruSeq mRNA libraries were prepared for each sample and sequenced on an Illumina NovaSeq 6000 with a paired-end 150 bp configuration by Novogene Ltd (UK).

| Genome assembly and scaffolding
Genome size and heterozygosity were estimated using a k-mer approach. The Illumina data were quality assessed using FastQC v0.11.8 (Andrews, 2010), trimmed using TrimGalore 0.4.5 (Krueger, 2015; quality score > 30, minimum length > 40 bp) and processed through Meryl v1.3 (Rhie et al., 2020) to generate a k-mer count database (k = 20), which was used to generate a k-mer histogram. The histogram data were used as an input to Genomescope 2.0 (Ranallo-Benavidez et al., 2020) to estimate genome size and heterozygosity.
The polished contig assembly was scaffolded by Dovetail Genomics using HiRise (Putnam et al., 2016) with the Omni-C™ proximity ligation sequencing data used to orient and link the contigs using 3D contact information. The top 10 super-scaffolds with the HiRise assembly were > 40 Mb and matched the expected O. edulis karyotype (n = 10; Horváth et al., 2013;Leitao et al., 2002;Thiriot-Quiévreux, 1984; Figure 1a). The next two largest scaffolds (scaffolds 11 and 12, respective sizes: 13.5 and 9.4 Mb) were not assigned to one of the 10 super-scaffolds despite their large size, which led us to hypothesize these regions belonged to the 10 superscaffolds, yet had not been scaffolded by HiRise. In support of this hypothesis, visualization of the 3D contact information using Juicebox (Durand, Robinson, et al., 2016) revealed 3D contacts between HiRise scaffold 11 and scaffold 6 and between HiRise scaffold 12 and scaffold 1 ( Figure S1). To confirm these interactions, we repeated contig scaffolding with the Omni-C™ data using Juicer (default parameters; , and the resultant assembly was aligned and compared with the HiRise assembly using QUAST (Gurevich et al., 2013). Visualization of QUAST alignments in Icarus (Mikheenko et al., 2016) confirmed the locations of scaffolds 11 and 12 within super-scaffolds 6 and 1, respectively ( Figure S1).
Manual integration of these scaffolds in the HiRise assembly was performed using Scaffolder (Barton & Barton, 2012). Following this work, super-scaffold 6 became the second largest super-scaffold and was therefore renamed to be super-scaffold 2, and this annotation is used hereafter. The resulting scaffolds were polished for one round using Pilon v1.24, leading to the final assembly used in all downstream work (OE_Roslin_V1).

| Genome quality evaluation
OE_Roslin_V1 was screened for the presence of DNA contamination from other taxa using Blobtools v1.1.1 (Laetsch & Blaxter, 2017b) and for misassembly errors using Inspector v1.0.2 . Structural errors identified in the genome were corrected using the Inspector-correct.py step. The raw nanopore reads were mapped back to the OE_Roslin_V1 assembly using minimap2 (Li, 2018;parameter -ax map-ont) to check for assembly completeness. The genome assembly was compared with a novel linkage map to confirm the accuracy of scaffolding using the chromatin proximity Omni-C™ data (see later section). Assembly quality and efficiency of haplotig purging were evaluated by generating a copy number spectrum plot (tracking the multiplicity of each k-mer in the read set, revealing the number of times it is found in the genome assembly) using Merqury v1.3 (Rhie et al., 2020). Gene completeness was evaluated against a set of 5295 benchmark molluscan orthologous genes (mollusca_odb10) using BUSCO v4.1.4 (Simão et al., 2015). We mapped paired-end Ilumina data from the same individual to the finished genome assembly using the minimap2 (Li, 2018; parameter -ax sr). SAMtools (Danecek et al., 2021) was used to extract mean mapping depth values across the entire genome at 100 kb intervals. GC content across the genome was retrieved using BEDTools v2.29.2 (Quinlan & Hall, 2010) at an interval of 500 kb. The mean mapping depth and GC content data were plotted as a circos plot using the package Circlize 0.4.14 (Gu et al., 2014).

| Genome annotation
De novo repeat prediction was carried out using RepeatModeler v2.0.2 (Flynn et al., 2020). RepeatMasker v4.1.1 (Smit et al., 2015) was used for repeat masking with two databases: (i) RepBase-20,170,127 (Jurka et al., 2005) for Pacific oyster (set using parameters '-s Crassostrea gigas -e ncbi') and (ii) the de novo repeat database generated by RepeatModeler. Gene model prediction was carried out on the repeat masked assembly using Funannotate v1.8.7 (Palmer, 2017) after using the Funannotate clean module. Following this, the RNAseq reads were aligned to the genome using minimap2 v2.21-r1071 (Li, 2018). Protein sequences for C. gigas and C. virginica from the UniProt database were aligned using Diamond v2.0.9 (Buchfink et al., 2021) and the resultant BAM files utilized for gene model prediction. PASA v2.4.1 (Haas et al., 2003) was then used to predict an initial set of high-quality gene models, which were used to train and run Augustus v3.3.32 (Stanke et al., 2006), SNAP (Korf, 2004) F I G U R E 1 OE_Roslin_V1 assembly quality evaluation. (a) Omni-C contact map highlighting the top 10 super-scaffolds generated by HiRise. The contact map was visualized using Juicebox (Durand, Robinson, et al., 2016). (b) Merqury k-mer copy number spectrum plot for the curated genome assembly. Nearly half of the single-copy k-mers (black region) were missing from the heterozygous peak, indicating efficient purging of haplotigs from the final assembly. K-mers missing from the assembly (black region in the homozygous peak) indicates bases present in the Illumina data missing from the assembly. (c) BUSCO scores for the final scaffolded OE_Roslin_V1 assembly (mollusca_odb10 database). (d) Circos map highlighting the concordance between the 10 super-scaffolds (RL1 to RL10) and linkage groups (LG1 to LG10). Blue dotted squares within super-scaffolds 1 and 2 highlight the manual scaffolding performed on the basis of 3D contact information in the Omni-C data ( Figure S1). and GlimmerHMM v3.0.4 (Majoros et al., 2004). 40,283 high-quality gene models were automatically extracted from the ab-initio predictions before passing all the data to EVidenceModeler v1.1.1 (Haas et al., 2008) for a final round of gene model prediction. Gene models <50 aa in length (n = 2), spanning gaps (n = 2) and transposable elements (n = 5330) were filtered by Funannotate before the retained gene models underwent UTR prediction using PASA.
Functional annotation was performed using the annotate step within Funannotate. Interproscan (Jones et al., 2014) was used to annotate predicted gene products against the following databases: Pfam

| Additional validation of manually incorporated scaffolds
As mentioned above, two scaffolds were manually incorporated into the HiRise assembly (also see Results). To confirm the validity of these scaffolds beyond the quality assessments described above, we confirmed the genes present in these regions were: (i) of oyster origin and (ii) showed bioactivity comparable to other regions along the same chromosomes. Firstly, we retrieved the coding sequence of all genes predicted within the manually incorporated and remaining regions of super-scaffolds 1 and 2, which were subjected to BLASTn (Altschul et al., 1997) searches against the Pacific oyster genome (NCBI accession: GCA_902806645.1) and an independent Flat oyster genome . The BLASTn cut-off was <1e-20 with the remaining parameters default. Secondly, RNA-Seq data from the heart, striated muscle and gonad were mapped to the genome assembly using STAR (Dobin et al., 2013) with default parameters. Mean RNA-Seq mapping depth for all gene models along super-scaffolds 1 and 2 was retrieved using SAMtools. Graphs comparing statistics between the manually incorporated and remaining regions of super-scaffolds 1 and 2 were generated using ggplot2 (Wickham, 2016).

| Linkage map construction
Four oyster full-sibling families (n = 171 individuals representing 8 parents and 163 F1 offspring) were used to build a novel linkage map for O. edulis. The families were produced in the Porscave hatchery (Lampaul-Plouarzel, Brittany, France). DNA was extracted from the parents and the offspring using a standard phenol-chloroformisoamyl alcohol (PCI; 25:24:1, v/v) protocol. After two washes with PCI, DNA was precipitated overnight with absolute ethanol at −20°C, centrifuged, washed with 70% ethanol, dried and suspended in PCRgrade water. All DNA samples were run in a 1% agarose 1X TBE gel and quantified using a Qubit fluorometer (Thermo Fisher Scientific) with a high-sensitivity dsDNA quantification kit (Invitrogen) according to the manufacturer's instructions. Double-digest RAD-seq (ddRADSeq) libraries were produced for every sample following Brelsford et al. (2016)  Further quality control (QC) filters were applied to the genotype data in Plink v1.9 (Chang et al., 2015). Markers and individuals with excess missing data (>10%) were discarded. Principal component analysis revealed that seven individuals were separated from their family cluster ( Figure S2). Upon closer inspection, their high levels of Mendelian errors (>100 errors) suggested they had been mislabelled and were therefore removed from the dataset. After QCfiltering, 15,373 SNPs genotyped across 8 parents and 163 offspring were available for the construction of a linkage map using Lep-Map2 and Lep-Map3 (Rastas, 2017;Rastas et al., 2016). Genotype data were converted to genotype likelihoods (posteriors) using the linkage2post script in Lep-Map2. Missing or erroneous parental genotypes were imputed using the ParentCall2 module. SNP markers informative for both parents were assigned to linkage groups (LGs) using the SeperateChromosomes2 algorithm in Lep-Map3 with lodLimit = 11 and distortionLod = 1. Unassigned SNPs were added to the preliminary map using the JoinSingles2All module with lod-Limit = 8, lodDifference = 2 and distortionLod = 1. The ordering of markers within LGs was conducted using the OrderMarkers2 module after filtering markers based on segregation distortion (dataTolerance = 0.01). For each LG, the relative ordering of SNP markers was iterated ten times, and the configuration with the highest likelihood was selected to represent a sex-averaged map for O. edulis. One large gap (>10 cm) was identified and manually removed from the distal end of LG 10.

| Synteny and gene family expansion analyses
Gene level synteny was compared between OE_Roslin_V1 and genome assemblies for a range of bivalve species using an orthogroup-based approach. A list of putative one-to-one orthologues between O. edulis and assemblies for C. gigas (NCBI accession: GCF_902806645.1; Peñaloza et al., 2021), C. virginica (GCF_002022765.2) and P. maximus (GCF_902652985.1; Kenny et al., 2020) were generated using Orthofinder v.2.3.11 (Emms & Kelly, 2019). An independent O. edulis genome assembly generated by Boutet et al. (2022;NCBI bioproject: PRJNA772088) was also included. The genomic coordinates of each gene in the one-to-one orthologue list for any two species under comparison were extracted and circos plots generated using the package Circlize 0.4.14 (Gu et al., 2014).
We inferred gene family expansions in O. edulis building on a published strategy . The start-point was all predicted proteins from the genome assemblies of 16 bivalve species, inclusive of protein sequences predicted in the OE_Roslin_V1 genome (Table S1). The longest isoforms for each protein were retained using AGAT v0.4.4 (Dainat, 2020). These sequences were used to generate orthogroups in Orthofinder v.2.3.11 (Emms & Kelly, 2019). FastTree (Price et al., 2010) was used to infer gene trees per orthogroup, which were compared against the rooted species tree by Orthofinder to infer duplications/losses using a duplicationloss-coalescent model (Emms & Kelly, 2019). Kinfin v1.0 (Laetsch & Blaxter, 2017a) was used to identify orthogroups that showed evidence for gene expansion in O. edulis compared with other bivalves  Functional annotations were summarized based on their counts across all the expanded orthogroups. Protein sequence alignments from selected orthogroups were retrieved and maximumlikelihood phylogenetic trees were generated using IQTREE v1.6.8 (Nguyen et al., 2015) using the best fitting substitution model (Kalyaanamoorthy et al., 2017) and running the ultrafast bootstrapping (Minh et al., 2013) for 1000 iterations to generate branch support value. The trees were then visualized using iTOL online server (Letunic & Bork, 2021).

| Contig assembly and quality evaluation
PromethION sequencing yielded 20,061,494 reads summing to 143.42 Gb of basecalled data with N50 length of 9297 bp ( Figure S3) and mean length of 7149 bp, which was used for contig assembly.
Assuming a haploid genome size of 1.14 Gb following past flow cytometry work involving n = 20 flat oysters sampled from Galicia in Spain (Rodríguez-Juíz et al., 1996), ~120x long-read sequencing depth was achieved, including 26× with reads >15 kb. Around 281 million Illumina short reads (~72× sequencing depth) were used for genome polishing. Around 57.6 million Illumina reads were generated by sequencing the Omni-C™ library, which were used for genome scaffolding. RNA-Seq generated ~50 million Illumina reads per tissue for genome annotation. K-mer-based estimation predicted the O. edulis genome to be 881 Mb, with a repeat content of 437 Mb (i.e. 49.8% of genome) and a heterozygosity rate of 1.02% ( Figure S4).
The Flye assemblies OE_F1, OE_F2 and OE_F3 were 976.2, 1027.5 and 964.2 Mb, respectively. Purging for haplotigs resulted in the removal of 2-3% of data across each assembly (Table S2). The purged Flye assemblies had contig N50 values of 0.43, 0.39 and 0.34 Mb, respectively (Table S2). Thus, OE_F1, which used a minimum overlap of 10,000 bp to generate a contig, had the highest contiguity. The wtdbg2 contig assembly OE-RB1 was 829.1 Mb after purging and had an N50 value of 0.67 Mb (Table S2). All four contig assemblies had a high BUSCO completeness score (~90% complete) compared with the mollusca_odb10 database (Table S2). The final merged and haplotig purged contig assembly OE_contig_v1 was 934.9 Mb with a contig N50 of 2.38 Mb. Two rounds of genome polishing resulted in minor changes to contiguity but increased BUSCO completeness from 89% to 95.2% (Table S2), indicative of a strong positive effect on sequence accuracy.

| O. edulis chromosome-level genome assembly
Scaffolding using HiRise and Juicer led to assemblies of 935.08 and 936.34 Mb with N50 values of 94.05 and 82.94 Mb, respectively (Table S3). As the HiRise assembly was markedly more contiguous, it was taken forward as the basis for the final reference genome.
Based on two lines of 3D contact evidence within the Omni-C data (see Methods), two large scaffolds in the HiRise assembly (scaffolds 11 and 12) were manually inserted into the super-scaffolds of the HiRise assembly. Specifically, scaffold 12 was inserted into superscaffold 1 (at insertion point 65.4 Mb) and scaffold 11 was inserted at the start of super-scaffold 6 ( Figure S1). As noted in the methods, at this stage, super-scaffold 6 was renamed super-scaffold 2 as a product of it becoming the second largest scaffold in the HiRise assembly, maintaining the convention of naming scaffolds according to size (Table S4).  Horváth et al., 2013;Leitao et al., 2002;Thiriot-Quiévreux, 1984). The remaining 59.3 Mb of OE_Roslin_V1 comprises 1353 unplaced scaffolds. The final assembly size matches closely to the k-mer-based genome-size estimate and is slightly larger than other genome assemblies within Ostreidae, which could be due to lineage-specific repeat expansion (see later section).

The final assembly including the two manual corrections (OE_
Detecting and correcting structural errors arising during genome assembly is critical in achieving a high-quality reference genome . Evaluation of the assembly for structural errors identified 1126 (663 expansions, 387 collapses, 76 inversions) putative structural errors when benchmarked against the raw nanopore reads, which were corrected. Assembly screening revealed little contamination from other taxa ( Figure S5). We observed a 97.09% mapping rate of nanopore reads back to the assembly, further demonstrating the accuracy and completeness of the reference genome. A K-mer copy number histogram revealed that haplotig purging was very efficient (Figure 1b). We identified 4865 (91.9%) complete single-copy BUSCO genes and 131 (2.5%) complete duplicated BUSCO genes in the final assembly (Figure 1c).

| Linkage map and assembly validation
The de novo variant calling pipeline called 24,522 SNPs across the ddRAD-Seq dataset. After stringent filtering (see Methods), the finished genetic map contained 4016 SNPs anchored to the ten expected LGs ( Figure S6). We observed an overall high collinearity between these LGs and the OE_Roslin_V1 genome assembly pseudochromosomes (Figure 1d, Figure S7) confirming the accuracy of the scaffolding performed using the Omni-C data, including at the two manual joins we performed within the scaffold_1 and scaffold_2 of the OE_Roslin_V1 assembly (Figure 1d; Figure S7). We observed a potential inversion between LG1 and super-scaffold 1, which was unrelated to the manually scaffolded region ( Figure S7). However, on closer inspection, the Hi-C data were ambiguous in this region ( Figure 1a), with the opposite orientation of this region within the assembly being impossible to exclude, which would then match LG1.  . Note, that this work identified a very similar proportion of repeats (55.1%) using the same bioinformatic pipeline but not all could be confidently annotated.

|
Gene model prediction identified 35,699 coding genes in the masked genome (Table 2) Functional annotation of the predicted proteins resulted in the annotation of 23,109 gene models with EggNOG hits and provided 17,504 gene models with a GO annotation (Table 2). A range of annotate features are plotted along the genome in Figure 2c.

| Additional validation of manually incorporated scaffolds
To confirm the validity of the manually scaffolded regions in superscaffolds 1 and 2, we sought to concretely demonstrate that they belonged to the flat oyster genome. We firstly performed BLASTn (Altschul et al., 1997) searches for all coding genes predicted in these regions against C. gigas  and an independent O. edulis assembly , and compared the results (c) Circos plot highlighting annotated features across the ten super-scaffolds (window size 0.5 Mb except track-v, which is 0.1 Mb). Tracks as follows: i: 10 super-scaffolds OE-1 to OE-10; ii: GC percentage (33-38%), with red and green bars indicating GC >36.5% and <34.5%, respectively; iii: genic content (sum of annotated gene models) expressed as percentage of total window size, regions with <20% genic content are coloured blue, while 20 to 40% are coloured grey and >40% are coloured red; iv: gene density (0-80); v: mean Illumina sequencing depth, with values <45 and >150 shown as red points; vi: classified repeats expressed as percentage of total window size (0-35%); vii: novel unclassified repeat elements expressed as percentage of total window size (0-35%).

F I G U R E 2
with the remaining regions of super-scaffolds 1 and 2 (summarized in Table S5; raw data in Table S6). The proportion and percentage identity of BLAST hits to both oyster genomes were highly comparable for both regions along super-scaffolds 1 and 2. Secondly, RNA-Seq reads (pooled from heart, striated muscle and gonad) mapped with variable depth to approximately 40% of the predicted genes within the manually incorporated regions of super-scaffold 1 and 2 ( Figure S8). The RNA-Seq mapping rate and depth were lower in the manually incorporated regions than in the remaining parts of superscaffolds 1 and 2 ( Figure S8).

| Synteny analysis with other bivalve genomes
Synteny plots of 1-to-1 orthologue gene locations revealed con- For instance, super-scaffold 8 in OE_Roslin_V1, which shares synteny across the length of C. gigas chromosome 4, shares synteny with two major blocks on C. virginica chromosomes 5 and 6 ( Figure 3b).
The synteny relationship between OE_Roslin_V1 and the extensively rearranged P. maximus genome was consistent with that reported between C. gigas and P. maximus . We observed genome-wide synteny between OE_Roslin_V1 and an independently generated assembly for O. edulis , although there were a small number of chromosomal regions where synteny was broken (Figure 3d).

| Gene families expanded during Ostrea evolution
Gene duplication is associated with adaptation during evolution (Ohno, 1970), including in bivalves (Phuangphong et al., 2021;Regan et al., 2021). To gain insights into how gene duplication influenced Ostrea evolution, we identified gene family expansions in OE_Roslin_ V1 by comparison to 15 additional bivalve genomes. 712 gene families showed evidence of expansion (Table S7;   proposed to be involved in the chromosomal organization (Aravind & Koonin, 2000).

| DISCUSS ION
The high-quality, publicly available genome assembly we have generated and annotated for O. edulis serves as a novel reference for Additional resources of value to the research community have been produced and made publicly available, including multi-organ RNA-Seq data, which we used to support gene model prediction and confirm genome assembly quality but in future can be used to explore patterns of tissue gene expression. In terms of assembly quality, the contig N50 we achieved is among the highest of all bivalve assemblies publicly available. This demonstrates the utility of our choice to merge different contig assemblies using Quickmerge (Chakraborty et al., 2016), which has been shown elsewhere to be effective for generating high-quality assemblies in molluscs  and other taxa (e.g. Chen, Nie, et al., 2021;Li et al., 2021;Mathers et al., 2021). Genome-wide sequence accuracy was further evidenced by the high mapping rate of nanopore reads back to the assembly, and the limited number of structural errors in the genome, which was lower than reported for the recent C. gigas reference genome . BUSCO scores for our final O. edulis assembly are in the range of high-quality molluscan genome assemblies published to date (e.g. Sun et al., 2021), indicating an excellent level of gene representation.
Interestingly, our k-mer-based genome-size estimate (881 Mb), which matched closely with our final assembly length (876 Mb), was only ~77% of the 1.14 Gb genome size previously estimated by flow cytometry in a population of Spanish flat oysters (Rodríguez-Juíz et al., 1996). Similar observations have been made for other bivalve genomes, including C. gigas (e.g. Peñaloza et al., 2021). The discrepancy between this past flow cytometry assessment and our own sequencing-based estimates could be partly explained by population differences in genome size, considering the plasticity of genome content within bivalve species (Gerdol et al., 2020). However, this discrepancy cannot be easily explained by an under-representation of repeats in our assembly, considering that >97% of the raw nanopore reads mapped back to the final assembly. Underestimation of genome size can also arise due to high heterozygosity (Liu et al., 2020).
Our heterozygosity rate estimate of 1.02% for O. edulis was within the range reported for other bivalves, including 1.3% in C. gigas (Zhang et al., 2012) and 1.04% in scallop (Patinopecten yessoensis; Wang et al., 2017). This is interesting, as these previous estimates were made using individuals selected for reduced heterozygosity via inbreeding (Zhang et al., 2012) or by using a selfing family (Wang et al., 2017), implying a possible loss of genetic diversity in the O. edulis population we used for sequencing (e.g. a historic bottleneck). By contrast, an outbred C. gigas individual recently sequenced showed a much higher heterozygosity rate estimate of 3.2% .
With regards to our in-house genome annotation, the average gene length obtained (7411 bp; Figure 2c)  These greatly increased gene length statistics demonstrate that our in-house annotation strategy was inefficient in predicting gene models compared with the NCBI RefSeq method, leading to more fragmented and/or partially predicted gene models. However, the annotation we generated still has global utility, considering that we observe extensive 1-to-1 orthologue mapping compared with other genome assemblies (Figure 3), and were able to perform valid comparative genomic analyses both here (i.e. Figures 4 and 5) and in studies that have used our annotation to date (see later paragraph).
However, the higher quality Refseq annotation for our assembly supersedes our in-house annotation and will further enhance future genetics and comparative genomic investigations exploiting our genome as a reference. In the longer term, we anticipate that bivalve genomes will benefit from greatly improved functional annotations that extend beyond gene model prediction, incorporating functional assays defined by the FAANG initiative to identify chromatin state modifications, regulatory elements, noncoding RNAs and isoform diversity .
Our cross-species synteny analysis revealed few major chromosomal reorganizations in the flat oyster genome, consistent with previous reports describing the nearly conserved karyotype across all oysters (Guo et al., 2018). Furthermore, conserved synteny and chromosomal architecture against an independently assembled flat oyster genome assembly , coupled with the general high congruency of the assembled super-scaffolds with linkage groups, further confirmed the global quality of our assembly. Expansions to gene families involved in stress responses during bivalve evolution may reflect adaptation to a filter-feeding sessile lifestyle in a hostile environment (Guo et al., 2018;Hu et al., 2022;Regan et al., 2021). Past work has revealed expansions in gene families encoding heat shock proteins, and families involved in apoptosis inhibition and innate immunity, including C-type lectins and C1q complement domain-containing proteins. The gene family expansions reported here mirror these adaptation strategies, with enrichment in functional annotations for pathogen recognition and inflammatory response, e.g. C-type lectins, complement and immunoglobulin domains. The comparative genomic resources provided here can support future evolutionary analyses of gene families and should prove useful when interpreting the fine mapping of genetic variation around flat oyster genes, for instance, those identified in QTL regions.
Future applications of the O. edulis reference genome reported here, and for an independent genome assembly described for a French O. edulis individual in an accompanying article  will address challenges relating to flat oyster conservation and sustainable aquaculture production (see commentary by Bean et al., 2022). These genomes provide researchers with new tools that empower genetic approaches addressing the ubiquitous threat posed by Bonamia via a range of technologies Potts et al., 2021). In this regard, the genome reported here is proving useful already, with a recent study revealing that SNP markers previously associated with Bonamia resistance (Vera et al., 2019) are located in high linkage-disequilibrium across a large region of superscaffold 8, which contains many candidate immune genes (Sambade et al., 2022). Another recent study has mapped variants genotyped with an existing medium density SNP array (Gutierrez et al., 2017) against our new O. edulis genome, identifying QTLs underpinning variation in growth traits on super-scaffold 4 .
Via its public release with all accompanying raw data, we anticipate a rapid uptake of our genome by the research community and envisage the next steps for the field to include broader surveys of genome-wide diversity covering a global representation of populations . This new phase of genome-enabled biology is like to uncover many secrets on the genetic and functional basis for adaptation and disease resilience in this iconic oyster species.

CO N FLI C T O F I NTE R E S T
The authors declare no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The genome assembly generated in this study along with all raw sequencing data used in assembly and annotation (Oxford Nanopore