Investigating population‐scale allelic differential expression in wild populations of Oithona similis (Cyclopoida, Claus, 1866)

Abstract Acclimation allowed by variation in gene or allele expression in natural populations is increasingly understood as a decisive mechanism, as much as adaptation, for species evolution. However, for small eukaryotic organisms, as species from zooplankton, classical methods face numerous challenges. Here, we propose the concept of allelic differential expression at the population‐scale (psADE) to investigate the variation in allele expression in natural populations. We developed a novel approach to detect psADE based on metagenomic and metatranscriptomic data from environmental samples. This approach was applied on the widespread marine copepod, Oithona similis, by combining samples collected during the Tara Oceans expedition (2009–2013) and de novo transcriptome assemblies. Among a total of 25,768 single nucleotide variants (SNVs) of O. similis, 572 (2.2%) were affected by psADE in at least one population (FDR < 0.05). The distribution of SNVs under psADE in different populations is significantly shaped by population genomic differentiation (Pearson r = 0.87, p = 5.6 × 10−30), supporting a partial genetic control of psADE. Moreover, a significant amount of SNVs (0.6%) were under both selection and psADE (p < .05), supporting the hypothesis that natural selection and psADE tends to impact common loci. Population‐scale allelic differential expression offers new insights into the gene regulation control in populations and its link with natural selection.

In the present study, we proposed to measure the population-scale allelic differential expression (psADE) of genes. psADE depends on the difference between alleles abundance at the genomic and transcriptomic level. At population scale, it aggregates differential expression between the two alleles at smaller scales ( Figure 1). To measure psADE on small organisms, it would require the sequencing of several individuals at genomic and transcriptomic levels separately. An alternative approach could be to take advantage of the recent advances in metagenomic and metatranscriptomic sequencing of environmental samples, which offer a direct populational insight. In this context, polymorphic sites of a single species have to be extracted, allowing to evaluate whether the population-scale relative expression of an allele deviates from its genomic frequency.
Copepods, and particularly species belonging to the Oithona genus, are small crustaceans forming the most abundant metazoan on Earth (Gallienne, 2001;Humes, 1994;Kiørboe, 2011). This abundance, reflecting strong adaptive capacities to environmental fluctuations, together with large hypothetic effective population size (Peijnenburg & Goetze, 2013;Riginos, Crandall, Liggins, Bongaerts, & Treml, 2016;Madoui et al. 2017;Arif et al. 2018) make this species suitable for population genomics analyses. In addition, they play an ecological key role in biogeochemical cycles and in the marine trophic food chain (Wassmann et al. 2006). In this study, we propose to detect psADE by focusing on the widespread epipelagic copepod, Oithona similis (Cyclopoida, Claus, 1866). We used environmental samples collected by the Tara Oceans expedition (Karsenti et al. 2011;Pesant et al. 2015) during its Arctic phase (2013) for which both metagenomic and metatranscriptomic data are available.
Arctic Seas is an area where O. similis is known to be highly abundant (Blachowiak-Samolyk, Kwasniewski, Hop, & Falk-Petersen, 2008;Castellani et al., 2016;Dvoretsky, 2012;Zamora-Terol, Nielsen, & Saiz, 2013). First, variants of O. similis were extracted and the genetic structure between the populations was studied. Then, we detected loci under psADE, under selection and under both psADE and selection. From these results, we tried to decipher the potential links between psADE, genomic differentiation and natural selection.
Lastly, we investigated the molecular functions and biological processes of candidate loci under psADE and selection.

| Variant calling using Tara Oceans metagenomic and metatranscriptomic data
We used metagenomic and metatranscriptomic reads generated from samples of the size fraction 20-180 µm collected in seven Tara F I G U R E 1 Allelic differential expression at population-scale. (a) Different scales of allele-specific expression detection for a heterozygous gene, from population to cellular levels. For a heterozygous genotype, ADE is understood as the difference in expression between two alleles of a single gene, opposed to strict biallelic expression. For clarity, the example of ADE presented here is monoallelic expression Oceans stations (TARA_155,158,178,206,208,209,and 210l Figure 2a) according to protocols fully described by Alberti et al. (2017) (Table S1). Because of the lack of a reference genome, the reference-free variant caller DiscoSNP++ (Uricaru et al. 2014;Peterlongo et al. 2017) was used to extract SNVs simultaneously from raw metagenomic and metatranscriptomic reads and was run using parameter -b 1. Only SNVs corresponding to biallelic loci with a minimum of 4x of depth of coverage in all stations were initially selected. Then, to capture loci belonging to Oithona similis, SNVs were clustered based on their loci co-abundance across samples using density-based clustering algorithm implemented in the R package dbscan (Ester, Kriegel, Sander, & Xu, 1996;Ram, Jalal, Jalal, & Kumar, 2010) and ran with parameters epsilon = 10 and minPts = 10. , with G A , G B the metagenomic read counts supporting the reference and alternative alleles respectively and T A , T B the metatranscriptomic read counts supporting the reference and alternative alleles, respectively.
Biallelic loci were then filtered based on their metagenomic coverage. For each sample, the median and standard deviation σ of the distribution of metagenomic coverage of all biallelic loci were estimated. Biallelic loci must be characterized by a metagenomic coverage between a limit of median ± 2σ, with a minimum and maximum of 5× and 150× coverage in each sample to avoid low covered and multicopy genomic regions. To keep out rare alleles and potential calling errors, only variants characterized by a BAF comprised between 0.9 and 0.1, and a BARE between 0.95 and 0.05 in at least one population were chosen for the final dataset resulting in 25,768 biallelic loci (Table S5).
To ensure that these loci belong to O. similis, the global Fstatistics (or Wright's fixation index (Wright, 1951;Lewontin & Krakauer 1973) over the seven populations was computed as fol- , with p and σ 2 being the mean allele frequency and the related variance, and its distribution was tested for unimodality via a Hartigans' dip test (Hartigan & Hartigan, 1985). Moreover, LK statistics (Lewontin & Krakauer 1973)

| Population-scale ADE detection using metagenomic and metatranscriptomic data
In each population, we first selected heterozygous loci variants (BAF ≠ {0,1}). We tested the correlation between BAF and BARE and modeled their relationship by a linear regression. Then, we computed D = BAF-BARE and estimated the distribution parameters µ and σ 2 by fitting a normal distribution via fitdist function from fitdistrplus R package (Delignette-Muller & Dutang, 2015).
We then tested the psADE of each variant using a two-sided Fisher's exact test on a 2 × 2 table containing the read counts G A , G B , T A , and T B . Given the large number of tests, we applied the Benjamini and Hochberg p-value correction (Benjamini & Hochberg, 1995) to control the False Discovery Rate (FDR). This generated seven sets of candidate loci under psADE, one set for each population.

| Noise detection in population-scale ADE using simulated data
To account for noise originating from potential sampling bias during sequencing, simulations were performed by generating sets of variants: (a) We modeled the distributions of the genomic depth of coverage of the loci (i.e., the sum of G A and G B ) from each of the seven samples separately (Table S3) by a negative binomial (NB) distribution (Robinson & Smyth, 2007) and estimated seven µ and θ (the NB mean and shape parameters); (b) The relationship among the seven samples between the observed µ and θ by a linear regression ( Figure S4), allowing us to estimate a shape parameter θ for any given mean µ; (c) A-allele frequencies followed a U-shaped distribution, approximated by a beta distribution of shape parameters α and β; (d) The expression level (i.e., the sum of T A and T B ) was modeled by a gamma distribution of shape and rate parameters a and b. These estimations were performed with fitdist function of fitdistrplus R package (Delignette-Muller & Dutang, 2015) and are presented in Figure S3.
To simulate the genomic A-allele frequency for one biallelic loci in a given population, we extracted one random deviates from its beta distribution of shape parameters α, β estimated previously for each population in (c). The A-allele frequency was multiplied by the estimated µ to obtain an expected genomic read count of allele A, G A . In the same way, G B was obtained by the difference between µ n and G A . Simulated genomic read counts for allele A and B were then obtained by generating a random value from negative binomial distributions of parameters µ A = G A and µ B = G B respectively, and the corresponding size parameters θ A and θ B estimated using the linear regression between θ and µ as described above in (b).
A locus expression level was then computed by generating random deviates following a gamma distribution of shape a and rate b estimated previously for each population in (d). Under the null hypothesis, the allele abundance at genomic and transcriptomic level is the same (see Table S3), the expected transcriptomic read count of allele A, T A , was generated by multiplying the genomic A-allele frequency previously computed and the locus expression level. In the same way, T B was obtained by the difference between the locus level expression and T A . Finally, simulated transcriptomic read count T A and T B were obtained by generated random values from a Poisson distribution of parameter λ A = T A and λ B = T B , respectively. All simulations were performed using lm, rbeta, rnbinom, rgamma, and rpois R functions.
To formally test for psADE due to noise (i.e., under the null hypothesis), seven sets of 50,000 loci were simulated using the seven sets of parameters learnt for each sample. Fisher's exact test was performed only on heterozygous loci with a non-null expression level, and simulated variants with a q-value < 0.05 (Benjamini-Hochberg correction) were considered as noisy psADE, as described above.

| Estimation of genomic differentiation and detection of variants under selection
Pairwise-F ST was computed as follows, Hochberg p-value < .05 were considered under selection.

| Modeling psADE with population differentiation
The seven sets of candidate loci under psADE were crossed to identify variants under psADE in several populations (named "shared psADEs") and all nonempty, nonoverlapping crossings between populations were represented by an upset plot.
To test whether populations characterized by weak genetic differentiation tend to share more loci under psADE than genetically distant populations, we modeled the number of shared psADEs between populations by genomic differentiation using a nonlinear model: y = a e -bx + c, with y being the number of shared ADE and x the genomic differentiation. The latter was estimated by computing the median F ST of each nonempty crossing set of populations as described above, with i being here the considered set of populations. To find the starting values, the model was linearized as follows, log(y-c 0 ) ≈ log(a) + bx, with c 0 = min(y)*0.5 and a and b parameters were estimated via the lm R function. The nonlinear model was then applied, and least squares estimates were used via the nls R function.
Pearson's correlation between the fitted and empirical values was then computed via the cor.test R function.

| psADE and link with natural selection
To From the Large Bay of Toulon samples, O. similis individuals were isolated under the stereomicroscope (Nishida, 1985;Rose, 1933). We selected two different development stages: four copepodites (juveniles) and four adult males. Each individual was transferred separately and crushed, with a tissue grinder (Axygen) into a 1.5 ml tube (Eppendorf).
Total mRNAs were extracted using the 'RNA isolation' protocol from NucleoSpin RNA XS kit (Macherey-Nagel) and quantified on a Qubit 2.0 with a RNA HS Assay kit (Invitrogen) and on a Bioanalyzer 2100 with a RNA 6000 Pico Assay kit (Agilent). cDNA was constructed using the SMARTer-Seq v4 Ultra low Input RNA kit (ClonTech). The libraries were built using the NEBNext Ultra II kit for paired-end sequencing with an Illumina HiSeq2500. After adaptors trimming, only reads with a mean Phred score > 20 were kept.

| Variant functional annotation
The variant functional annotation was conducted in two steps.

| Gene enrichment analysis
To identify putative biological function or processes associated to the variants, a domain-based analysis was conducted. The Pfam annotation of the transcripts carrying variants categorized as psADE and selection was used and compared to Pfam annotation of the total sets of variants with a hypergeometric test for enrichment.
A significant excess was declared for a q-value < 0.05 (Benjamini-Hochberg correction). To complete the domain-based analysis, the functional annotations obtained from the homology searches against the nr were manually curated.

| Extracting polar Oithona similis variants from environmental samples
From metagenomic and metatranscriptomic raw data of seven sampling stations (Figure 2a), we identified 102,258 variants using a reference-free approach. Among them, 25,768 expressed O. similis variants were retrieved after filtering. To ensure that the variants be-  Figure S6d).
Finally, the LK distribution over all the loci followed the expected chisquared distribution (Figure 2d), showing that most of the loci follow the neutral evolution model, as expected in a single species.

| Detection of population-scale ADE
As expected, most of the loci presented a strong correlation between B-allele frequency and B-allele relative expression (Figure 3a, Table S3). Thus, we observed the difference between BAF and BARE, which followed a Gaussian distribution centered on 0 in each population ( Figure 3a, Table S3). The number of SNVs tested for psADE varied between 13,454 and 22,578 for TARA_210 and 206 respectively. We found a significant amount of variants under psADE in each population under a Fisher's exact test (Figure 3c, Table S4).
Potential noise due to sample bias during sequencing was estimated p-values compared with the simulated ones, resulting in a proportion of true-positives psADEs varying from 70% to 100% (Table S4).
Overall, we found 572 variants under psADE, including 513 population-specific psADEs, and 59 psADEs shared by several populations ( Figure 3c). Remarkably, 29 psADEs out of the 59 were present only in the populations from TARA_158, 206, and 208 that correspond to the genetically closest populations, leading us to compare the relationship between sharing psADEs and population differentiation. By comparing the number of shared psADEs in the different sets of populations to their genomic differentiation, we found a negative trend between the two (with a strong negative exponential slope estimate), illustrated by a significant correlation between nonlinear fitted and empirical values (0.87, p-value 5.6 × 10 −30 , Figure 3d). This modeling shows that genetically close populations tend to share more variants under psADE.

| Loci under population-level ADE and selection in Arctic populations
The set of variants was tested for selection using pcadapt. The

| Functional analysis of transcripts under population-scale ADE and selection
The full dataset of variants was positioned on the eight transcriptomes to extract putative functional information, with a total of 25,048 variants (97% of total) successfully mapped on 16,272 transcripts. First, SNPEff was used to estimate the localization of variants inside the transcripts, and an enrichment was estimated for all categories, and for three sets of variants categorized respectively as under selection, psADE and both ( Figure S9). Overall, the two first sets showed a significant excess of variants in 3' UTR. Plus, among the variants under psADE, one was categorized as a "stop gained" and one as a "stop retained". However, variants under psADE and selection did not show any excess of specific effect.
Among the 84 loci identified under psADE and selection, 80 were located on O. similis transcripts (Table S6). Amid these transcripts, 64 (76%) were linked to at least one Pfam domain (61 different domains) and 59 (70%) to a functional annotation from the nr database. From this total of 61 Pfam domains, 23 presented a significant excess compared with domains present in the global set of transcripts, corresponding to 21 transcripts ( Figure S10). On the latter, three transcripts were involved in nervous system features: omega-amidase NIT2, vang-like 2B protein, and 5-oxoprolinase.

| Genomic and transcriptomic variation data belong to a single Oithona similis lineage
Because genomes of small animals like copepods are difficult to reconstruct, we used DiscoSNP++, a reference-free variant caller to extract variants from metagenomic data, that already showed its accuracy on Tara Oceans metagenomic data (Arif et al. 2018).
Global populations of O. similis are known to be composed of cryptic lineages across oceanic basins . It is also known that this species in highly abundant among other copepods in Arctic Seas (Blachowiak-Samolyk et al., 2008;Castellani et al., 2016;Dvoretsky, 2012;Zamora-Terol et al., 2013). Thus, the assessment that the extracted variants from the seven samples used in our study belongs to the same O. similis cryptic lineage was a prerequisite for further analyses. Three different analyses support this assumption. First, the distribution of depth of coverage in each of the seven samples followed the expected negative binomial distribution (Supplementary Figure S3). Indeed, the possibility to observe these patterns in the presence of different species would require them to be equally co-abundant, which is unlikely.
Thus, this covariation of the depth of coverage of these variants supports the single species genome origin. Secondly, the high proportion of variants (97%) mapped on the Mediterranean O. similis transcriptomes, another cryptic lineage , showed that the variant clustering method was efficient to re-

| Oithona similis populations are weakly differentiated within the Arctic Seas
We observed that the seven populations examined showed low genomic differentiation, despite the large distances separating them, which was illustrated by a nonsignificant Mantel test for isolationby-distance ( Figure S7). Höring, Cornils, Auel, Bode, & Held, 2017;Peijnenburg & Goetze, 2013). Weak genetic structure in the polar region was highlighted for other major Arctic copepods like Calanus glacialis (Weydmann, Coelho, Serrão, Burzyński, & Pearson, 2016) and Pseudocalanus species (Aarbakke, Bucklin, Halsband, & Norrbin, 2014). The absence of structure was explained by ancient diminutions of effective population size due to past glaciations (Aarbakke et al., 2014;Bucklin & Wiebe, 1998;Edmands, 2001), or high dispersal and connectivity between the present-day populations due to marine currents (Weydmann et al., 2016). Using Lagrangian travel time or dispersal probabilities could help to estimate how much marine currents explain this observed genomic differentiation.

| Population-scale ADE in O. similis populations and its link with differentiation and selection
We were able to detect variants under psADE in the seven populations. First, allele frequency and relative expression are strongly correlated in the data, showing as expected that the more an allele is observed at the genomic level, the more this allele is expressed.
Simulations performed showed that although this sequencing bias noise is present in our data, it does not significantly affect psADE detection. Among the variants under psADE, a large part was population-specific and a minority was under psADE in several populations. From the latter, we showed that closely related populations tended to share more variants under psADE than other more differ- with this psADE due to a low abundance and high expression. In this population, the allele B sees its frequency decreasing because another allele appears in this population. However, since the latter is under-expressed, it could mean that it is a deleterious mutation, and strong regulatory elements or molecular mechanisms repress its expression, or that even a small expression enables a higher fitness for individuals carrying it, or that the allele favored by psADE is the one enabling higher fitness, leading to fixation in other populations.
Ultimately, although determining how psADE and selection interact remains beyond the scope of this study, we can hypothesize from these observations that the action of the two mechanisms on a locus can be (a) independent, psADE and selection acting separately, (b) sequential, with psADE acting before, while or after selection occurs.
The process of acclimation through gene expression and the link with genetic variation and adaptation have been studied widely in several organisms (Fay & Wittkopp, 2008;Signor & Nuzhdin, 2018;Williams, Chan, Cowley, & Little, 2007). In a first study in human, a link has been established between gene expression and selection, affecting particular genes and phenotypes, looking at cis-acting SNPs (Fraser, 2013 Further investigations including replicates, more populations, and the production of a genome and genotypes would help to confirm our results, disentangle the different causes of psADE, and question the link between psADE and selection. Among the transcripts under psADE and selection, three are of specific interests, as they are also involved in nervous system. Omegaamidase NIT2 is an enzyme that produces α-ketoglutaramate, a precursor of glutamate and GABA. The 5-oxoprolinase produces glutamate from 5-oxoproline. Finally, vang-like protein 2B is involved in the formation of ommatidies in Drosophila (Leung et al. 2016). Other studies focusing on axon myelination in calanoid species illustrate how nervous system can play an important role in copepod evolution (Lenz, 2012;Weatherby, Davis, Hartline, & Lenz, 2000). From our results, more functional analyses would allow a better characterization of these genes, but it reveals the potential evolutionary importance of nervous system in copepods.

| CON CLUS ION
Gene expression variation is thought to play a crucial role in acclimation and adaptive history of natural populations. Herein, we integrated metagenomic and metatranscriptomic data to detect ADE at the population level in populations of copepods. Then, we demonstrated the link between psADE and population differentiation on one hand and with natural selection on the other hand, by providing a quantitative observation of this phenomenon and its impact on specific biological features of copepods. In the future, we will try to expand these observations to other organisms and question the nature of the link between psADE and natural selection. Bourgois and the Tara schooner and its captain and crew. Tara

ACK N OWLED G M ENTS
Oceans would not exist without continuous support from 23 institutes (oceans.tara-expeditions.org). We acknowledge the support of Vincent Segura and Leila Tirichine for fruitful scientific discussions and support on the analyses and manuscript. This is contribution number 106 from Tara Oceans.

CO N FLI C T O F I NTE R E S T
The authors declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data are available at ENA (European Nucleotide Archive); see Table S1.