Annual transcriptome of a key zooplankton species, the copepod Calanus finmarchicus

Abstract The copepod Calanus finmarchicus (Crustacea, Copepoda) is a key zooplanktonic species with a crucial position in the North Atlantic food web and significant contributor to ocean carbon flux. Like many other high latitude animals, it has evolved a programmed arrested development called diapause to cope with long periods of limited food supply, while growth and reproduction are timed to take advantage of seasonal peaks in primary production. However, anthropogenic warming is inducing changes in the expected timing of phytoplankton blooms, suggesting phenological mismatches with negative consequences for the N. Atlantic ecosystem. While diapause mechanisms are mainly studied in terrestrial arthropods, specifically on laboratory model species, such as the fruit fly Drosophila, the molecular investigations of annual rhythms in wild marine species remain fragmentary. Here we performed a rigorous year‐long monthly sampling campaign of C. finmarchicus in a Scottish Loch (UK; 56.45°N, 5.18°W) to generate an annual transcriptome. The mRNA of 36 samples (monthly triplicate of 25 individuals) have been deeply sequenced with an average depth of 137 ± 4 million reads (mean ± SE) per sample, aligned to the reference transcriptome, and filtered. We detail the quality assessment of the datasets and provide a high‐quality resource for the investigation of wild annual transcriptomic rhythms (35,357 components) in a key diapausing zooplanktonic species.


| INTRODUC TI ON
The calanoid copepod Calanus finmarchicus (Crustacea, Copepoda) is a key ecological species in the North Atlantic that often dominates zooplankton biomass between ~40° and 80°N (Helaouët & Beaugrand, 2007). Primarily herbivorous, this species has a central position in the food web as it converts sugars from micro-algae (phytoplankton) into energy-rich lipids that sustain higher consumers including marine fish larvae, seabirds, and whales (Beaugrand et al., 2003;Berge et al., 2012;Prokopchuk & Sentyabov, 2006). Its high abundance and primary consumer position also makes it an important contributor to ocean carbon flux (Archibald et al., 2019). Despite this, copepods are among important non-model invertebrates for which genomic resources are still limited, principally due to their large genomes (Bron et al., 2011;Choquet et al., 2019). However, the de novo transcriptome of C. finmarchicus (Lenz et al., 2014) represents the most useful resource to date for molecular investigations in this ecologically important species.
Seasonal adaptations have evolved in zooplankton to deal with annual biotic and abiotic cycles, such as long periods of limited food resources (Baumgartner & Tarrant, 2017;Daase et al., 2021;Lenz & Roncalli, 2019). Copepods within the Calanidae and Eucalanidae families synchronize reproduction and growth with periods of high food availability, during which they accumulate large energetic reserves in sunlit surface waters (Baumgartner & Tarrant, 2017). This follows a migration of copepods into deeper water where they enter a prolonged phase of inactivity referred to as diapause, characterized by a non-feeding state with reduced metabolism lasting 8-10 months. The energy reserves accumulated during the spring production phase sustain the animals during periods of low food supply and are crucial to fuel moulting to adult stages and reproduction in the following year. In C. finmarchicus, the seasonal cycle begins prior to, and during, the spring bloom, when the juvenile preadult copepodid stage CV emerge from diapause, molt into adults, mate and spawn eggs (Baumgartner & Tarrant, 2017;Niehoff et al., 2002). The resulting offspring then develop through several nauplii and juvenile copepodid stages. The energy resource provided by the spring bloom is optimized by large lipid accumulations, specifically by the last copepodid stage CV. At this stage, CVs either directly migrate into deeper water where they enter diapause or molt into adults to produce another generation that enters diapause at a later time (Häfker et al., 2018;Johnson et al., 2008;Tarrant et al., 2014Tarrant et al., , 2016. The number of generations and timing of diapause vary according to populations and geographic locations (Häfker et al., 2018;Melle et al., 2014).
While the seasonal timing of the copepod life cycle is crucial for its survival and reproduction, broad geographic temperature increases caused by global climate change alter the seasonal dynamics of biotic and abiotic environmental cycles (Helm et al., 2013;Kharouba & Wolkovich, 2020). In particular, shifts in seasonal timing of phytoplankton blooms have been observed (Friedland et al., 2018).
Studying the seasonal life cycle of C. finmarchicus in its natural habitat, the open ocean, is extremely challenging. Populations are migrating vertically via diel vertical migrations (DVM) and horizontally via advection in ocean currents, and different copepod species frequently co-occur. However, an isolated population of C. finmarchicus can be found on the west coast of Scotland in Loch Etive. In the upper basin, including Bonawe deep (~150 m), C. finmarchicus dominates the mesoplankton community (Häfker et al., 2018;Hill, 2009;Mauchline, 1987), as less than 4% of CV and CVI stage individuals have been identified as Calanus helgolandicus (Hill, 2009).
Indeed, the Loch Etive Calanus species composition was analyzed in 2016 genetically, and no C. helgolandicus was detected (Choquet et al., 2017). It has been suggested that C. helgolandicus is intolerant to low salinities, such as found in Loch Etive (Helaouët & Beaugrand, 2007;Hill, 2009). Primary production in this loch is comparatively low (Wood et al., 1973), and terrestrial organic matter run-off probably provides a supplementary food source for C. finmarchicus . Häfker et al. (2018) have described a peak of phytoplankton production in May, followed by a sharp decrease until August, and very low Chl a concentration from November to March. Previous studies agreed on one single synchronized cohort of C. finmarchicus each year (Clark et al., 2012(Clark et al., , 2013Hill, 2009;Mauchline, 1987), which would start to descend in preparation for diapause (100-150 m depth) in July, while Häfker et al. (2018) argued for a possible second generation descending to diapause depth in October. During their active phase, C. finmarchicus perform DVM, with a daytime depth ~70 m (Häfker et al., 2018). Finally, the water column at Bonawe deep is highly stratified. Salinity and temperature in the top 50 m are variable seasonally due to the fluctuations in terrestrial freshwater run-off and tidal influence. Below 50 m, the water column is mixed with temperature and salinity extremely stable over the year (Häfker et al., 2018;Hill, 2009).
In this study, we took advantage of Loch Etive to perform a complete annual transcriptome of the copepod C. finmarchicus. The sampling campaign was conducted at Bonawe deep (UK; 56.45°N, 5.18°W) between 50 and 140 m depth, under extremely stable oceanographic conditions, but where both active and diapausing copepods are found. As copepods use multiple environmental cycles to synchronize their biological rhythms, from gene expression to behavior, the annual sampling was made at comparable phases of daily and lunar cycles (tidal and monthly cycles) (Häfker et al., 2017;Ibáñez-Tejero et al., 2018;Last et al., 2016;Payton et al., 2021;Schmitt et al., 2011). The genetic analysis focused on the copepodite pre-adult stage CV, which experiences all seasons and has a key role both in the active surface feeding and deep-water diapause states.
The 36 mRNA samples (3 replicates of 25 individuals per month), analyzed with a high sequencing depth, yielded a total of 4.9 billion reads, with an average of 137 ± 4 million reads (mean ± SE) per sample. In addition, we provide a quality assessment and the alignment to the reference transcriptome (Lenz et al., 2014), allowing the monthly quantification of 35,357 unique "components" in the C. finmarchicus transcriptome and the first investigation of annual transcriptomic rhythms in a key zooplanktonic species. Finally, existing annotations have been combined for further transcriptomic data exploration.

| Sampling design
Sampling design and analysis strategy are presented in Figure 1, Table 1, and Table S1.
Sampling was conducted over a year, with monthly samples from January to December 2019 (between the 1st and the 12th of each month), resulting in 12 time points. Sampling covered the winter (January-March), spring (April-June), summer (July-September), and autumn (October-December) seasons. Sampling was performed between 10:00 am and 12:15 pm, while local solar local noon is at ~12:30 pm, in order to be at a comparable phase of the daily rhythm (all times noted in UTC) (detailed in Table S2). Furthermore, sampling was performed on waxing moons (between New Moon and First Quarter Moon) and within 2:46 h of high tide, to provide comparable phases of lunar cycles (Table S2). However, while all samples were taken around high tide, some were during the flood tide and others during the ebb tide (Table S2). Copepods were collected from the RV Calanus (Scottish Association for Marine Science, SAMS) using a WP2-net equipped with an opening/closing mechanism (200 μm mesh size/50 μm cod end) hauled vertically through the water column at 0.5 ms from 140 m to 50 m depth. Samples were immediately (within 5 min after retrieval) fixed as bulk samples in RNAlater stabilization solution (Ambion). A 12 h period of incubation at 2-4˚C was allowed to soak the samples thoroughly with the stabilization solution before they were transferred to −80˚C for further transport and storage. Duplicate net sampling, with identical methodology, was carried out at the same time for species identification samples. These were fixed in 4% buffered formalin solution.

| Site description
Sampling was conducted in Bonawe deep, the deepest point (~150 m) of Loch Etive, a sea loch on the western coast of Scotland, UK (56.45°N, 5.18°W). Two sills with a depth of ≈7 m and ≈13 m at Connel and Taynuilt villages respectively limit the exchange with the open ocean (Häfker et al., 2018). The deeper waters of the sampling station typically cycle from normoxic to hypoxic and due to the sills, overturning of these water masses happens irregularly every ≈18 months when spring tides in spring/autumn co-occur with low precipitation (Edwards & Edelsten, 1977). Monthly CTD (conductivity, temperature, and depth) measurements were performed and archived by the British Oceanographic Data Centre (BODC) (Last & Dumont, 2021). As previously shown (Häfker et al., 2018), large seasonal changes in temperature and salinity occur above 50 m which is highly stratified, while only small seasonal changes occur in deeper waters (mixed), where net sampling was conducted. The average F I G U R E 1 Overview of the experimental workflow used to generate the transcriptomic data output. Sample details are available in Tables S1 and S2

Protocol 6
Filtering and normalization Protocol 7 Annotation temperature and salinity from 140 m to 50 m depth over the year were 11.4 ± 0.3°C and 27.8 ± 0.1 psu (mean ± SE) respectively (detailed in Table S2).

| Species identification
Formalin solution was removed from the monthly community samples by draining them through a 100 micron nylon gauze and then rinsing the sample with tap water. Animals were transferred to a round-bottomed flask and diluted to 300 ml with tap water. After careful sample mixing, sub-samples were taken with a 5 ml stempel pipette and transferred to a Bogorov tray. The first 200 Calanus were confirmed morphologically to corroborate species.

| Copepod sorting
Molecular analysis was performed on the CV copepodite stage of C. finmarchicus, the main overwintering stage that represents the bulk of the population over the year. Copepods were sorted at 2˚C under a stereo microscope for stage (CV) using morphological characteristics. For each time point, 3 replicates of 25 C. finmarchicus CV were sorted ("Jan.1," "Jan.2," "Jan.3"; to "Dec.1," "Dec.2," "Dec.3") ( Figure 1). The choice to pool 25 individuals was made to: 1) get the sufficient amount of RNA required for RNA sequencing and 2) increase the number of individuals thereby increasing the representativeness of the population (75 copepods per time point, 900 copepods in total) and to decrease the effect of individual variability.

| RNA extraction
Each replicate was transferred to a 2 ml Precellys ® homogenization tube (Bertin Instruments, France), containing a mix of 1.4 mm and 2.8 mm ceramic beads and homogenized in 600 µl of TRIzol ® reagent (ThermoFisher Scientific, USA) with a Precellys ® 24 Tissue Homogenizer (Bertin Instruments, France), using 2 × 15 s of homogenization at 1400 x g with a 10 s. break between. For RNA extraction, a Phenol/Chloroform-based single-step extraction in combination with a spin column based on solid phase extraction (Direct-zol™ RNA MiniPrep Kit, Zymo Research, USA) was used. Genomic DNA was removed by DNase I digestion on column as part of the RNA extraction kit and total RNA was eluted in ultra-pure water. Samples were sent to GeT-PlaGe core facility in dried-ice for RNA sequencing.

| Filtering
The quantification matrix revealed one outlier component (comp92_ c0_seq1; corresponding to the unique transcript GAXK01170082.1). and "Jul.2" (Table S3). For these reasons, and because of the clear increase in replicate correlations after filtering this outlier (see in Section 3), this component was removed. Additionally, the matrix was filtered with edgeR (Robinson et al., 2010) and only "comps" with more than 1 CPM (Count Per Million) in at least one sample were kept, providing a matrix of 35,357 unique "comps" and an average of 99 ± 3 mapped reads (mean ± SE) per sample (

| Annotation
An annotation matrix was generated by combining two existing annotation works: Bioproject PRJNA628886 (Payton et al., 2020) and Bioproject PRJNA236528 (Lenz et al., 2014) (Table 3) Table 3. Joining the two annotation works led to 22,643 components over 35,357 with at least one annotation (Table 3). GO (Gene Ontology) annotation was done using Bioproject PRJNA628886 (Payton et al., 2020), in which InterProScan (Jones et al., 2014) (Hughes et al., 2017;Koike et al., 2012;Li et al., 2015;Payton et al., 2021), is appropriate considering the high sequencing depth of the sample with the smaller library size (66.7 million reads after filtering, Table S3) (Li et al., 2015). This was performed with package R metaseqR

| Code availability
Parameters to software tools involved are described in the following paragraph.

| Species identification results
In all months no C. helgolandicus, only C. finmarchicus, were found between 140 and 50 m depth, where molecular sampling was performed. This result supports the assumption of extremely low risk of congener contamination.

| Extraction and RNA integrity
RNA extraction procedures were performed with randomization of samples to ensure reliable and unbiased data production. RNA purity was assessed by OD measurements with a NanoDrop 8000 spectrophotom- ribosomal ratio in this species, sample quality was evaluated on the electropherogram (Asai et al., 2015;DeLeo et al., 2018). No degradation in the inter region was observed. Total RNA samples were stored at −80°C.

| Raw reads assessment and quantification overview
All samples passed the FastQC (Andrews, 2010) "base quality control." No relevant contamination hit was found after the alignment against E. coli, Yeast, and PhiX. Mapping of the sequenced reads against the 96,090 unique "comps" of the reference transcriptome (Lenz et al., 2014) yielded an overall alignment of 102 ± 3 million reads (mean ± SE), that is, 74.5 ± 0.3% (mean ± SE) of total reads (Table 2, Table S3). This result is in accordance with the previously observed result on the de novo transcriptome of C. finmarchicus study (Lenz et al., 2014), where the alignment rate against unique comps was of 75%.

| Effect of filtering and normalization on counts distribution
Samples counts distribution before and after removing the outlier component (comp92_c0_seq1; corresponding to the unique transcript GAXK01170082.1) clearly highlighted the overrepresentation of the latter in samples "Mar.1" and "Jul.3" (Figure 2a vs. Figure 2b). Moreover, filtering components expressed below 1 CPM in all samples clearly increased counts distribution (Figure 2b vs. Figure 2c). Finally, the down-sampling normalization aligned the distribution of counts across samples, without observable impacts on the proportion of very low and very high expressed components (Figure 2c vs. Figure 2d) implying that this is a satisfactory approach.

| Proportion of components expressed above the threshold of 1 CPM
Of the 96,090 components of the reference transcriptome (Lenz et al., 2014), 36.8% (35,357) were expressed above the threshold of 1 CPM (Table 2). This result corroborates previously observed results on the C. finmarchicus transcriptome (Lenz et al., 2014;Payton et al., 2020). Thus, a large proportion of the whole contigs (63.2%) exhibited an extremely low level of expression, representing only 0.39 ± 0.01% (mean ± SE) of total aligned reads per sample (Table S3).

| CON CLUS ION
This year-round transcriptome dataset is of general interest for the understanding of annual oscillations of gene expression, specifically in organisms which perform diapause. Our study presents a significant contribution in providing the first in situ annual transcriptomic resource of a key zooplanktonic species, the copepod C. finmarchicus. This is significant since seasonal transcriptomic investigations in wild marine organisms are rare (Clark et al., 2013;Häfker et al., 2018;Lenz, Lieberman, et al., 2021;Roncalli et al., 2018Roncalli et al., , 2020Roncalli et al., , 2021Semmouri et al., 2020;Tarrant et al., 2014). The proposed dataset is of interest and of high quality due to the unique and rigorous temporal resolution of sampling spanning the entire year and thereby facilitating rhythmic analysis (Thaben & Westermark, 2014). Furthermore, there is a high depth of the RNA sequencing with resulting high-quality sequencing data.
Indeed, the average sequencing depth of 137 ± 4 million reads (mean ±SE) per sample optimizes the investigation of gene expression in a species with a large genome (Bron et al., 2011;Choquet et al., 2019).
We provide the raw reads, the quantification matrix after filtering, both before and after the normalization, and finally the combined annotation of the 35,357 components. Ultimately, this dataset will, among others, allow new insights into the understanding of mechanisms of animal diapause (Baumgartner & Tarrant, 2017;Hand et al., 2016), (circ)annual rhythms (Goto, 2013;Helm & Lincoln, 2017;Helm et al., 2013), and the consequences of climate-driven environmental changes (Kharouba & Wolkovich, 2020).

CO N FLI C T O F I NTE R E S T
We have no conflict of interest.  which includes all BioSamples used for the study (detailed in Table   S1). We also provide the following in Figshare