SNP discovery and genetic structure in blue mussel species using low coverage sequencing and a medium density 60 K SNP‐array

Abstract Blue mussels from the genus Mytilus are an abundant component of the benthic community, found in the high latitude habitats. These foundation species are relevant to the aquaculture industry, with over 2 million tonnes produced globally each year. Mussels withstand a wide range of environmental conditions and species from the Mytilus edulis complex readily hybridize in regions where their distributions overlap. Significant effort has been made to investigate the consequences of environmental stress on mussel physiology, reproductive isolation, and local adaptation. Yet our understanding on the genomic mechanisms underlying such processes remains limited. In this study, we developed a multi species medium‐density 60 K SNP‐array including four species of the Mytilus genus. SNPs included in the platform were called from 138 mussels from 23 globally distributed mussel populations, sequenced using a whole‐genome low coverage approach. The array contains polymorphic SNPs which capture the genetic diversity present in mussel populations thriving across a gradient of environmental conditions (~59 K SNPs) and a set of published and validated SNPs informative for species identification and for diagnosis of transmissible cancer (610 SNPs). The array will allow the consistent genotyping of individuals, facilitating the investigation of ecological and evolutionary processes in these taxa. The applications of this array extend to shellfish aquaculture, contributing to the optimization of this industry via genomic selection of blue mussels, parentage assignment, inbreeding assessment and traceability. Further applications such as genome wide association studies (GWAS) for key production traits and those related to environmental resilience are especially relevant to safeguard aquaculture production under climate change.


| BACKG ROU N D
Blue mussels from the genus Mytilus are an abundant component of the benthos, found in high latitude habitats (Gosling, 2015). These foundation species can aggregate in high densities, forming extensive beds or reefs, which provide a number of important ecosystem services (e.g. providing spatial structure, undertaking nutrient cycling, and forming an important food source) (van der Schatte Olivier et al., 2020). Additionally, mussels play an important economic role, as both a fishery and aquaculture species, accounting for approximately 12%, or ~2 million tonnes, of global mollusc production (Subasinghe, 2017). Most landings (>90%) derive from aquaculture (Avdelas et al., 2021), with farmed bivalves identified as one of the most sustainable sources of animal protein (Hilborn et al., 2018).
From a nutritional perspective, mussels contain high levels of omega-3 polyunsaturated fatty acids and essential amino acids, which in the human diet have significant health benefits (Carboni et al., 2019).
Commercial blue mussel production in Europe relies almost exclusively on collection of naturally-settled spat (i.e. settled juveniles) (Kamermans et al., 2013). Several environmental factors (e.g. water temperature, salinity, food availability, and local currents) influence the reproductive cycle, in addition to triggering spawning events and determining larval dispersal patterns.
Species from the M. edulis species complex (M. edulis, M. trossulus, and M. galloprovincalis), the main blue mussel species prevailing in the northern hemisphere, and their southern counterpart M. chilensis, are the Mytilus species with the highest economic value in terms of aquaculture production. They are also the species that, wherever their geographic range overlaps, readily hybridize. This includes regions occurring in the south west of the United Kingdom (Gardner, 1996;Hilbish et al., 2002;Vendrami et al., 2020), the European coast of the north east Atlantic (Bierne et al., 2002;Bierne, Borsa, et al., 2003;Fraïsse et al., 2016;Simon et al., 2021), north west Atlantic (Koehn et al., 1984;Rawson et al., 2001;Toro et al., 2004), the Baltic sea (Riginos & Cunningham, 2005;Stuckas et al., 2017;Väinölä & Hvilsom, 1991), subarctic and arctic regions (Mathiesen et al., 2017), the north east Pacific (Rawson et al., 1999;Saarman & Pogson, 2015), south and east Pacific (Larraín et al., 2019;Popovic et al., 2020). Nonetheless, in the Northern hemisphere M. californianus and M. coruscus also coexist with the previously mentioned taxa, but do not seem to readily hybridize. In the Southern hemisphere M. chilensis coexists with other distinct lineages of blue mussels including M. platensis, M. aoteanus from New Zealand and M. planulatus from Australia, together with the recently introduced Northern hemisphere M. galloprovincialis (Oyarzún et al., 2021;Popovic et al., 2020;Zbawicka et al., 2018). Broad-scale population structure and population dynamics in this species complex is therefore predominantly shaped by interactions between oceanography and the biology of each species. Both pre-and post-settlement selection drive geographical and ecological segmentation, and contribute to determining species distribution and in shaping hybrid zones (Bierne et al., 2002;Knöbel et al., 2021;Koehn et al., 1980). Moreover, the success of mussel aquaculture is tightly coupled to the environment, across all stages of production, with environmental change also influencing key performance traits including growth, survival, and susceptibility to disease . Therefore, further research is needed to fully understand the genetic barriers that determine distribution of these taxa, as well as the genomic mechanisms driving such processes across different environments.
Selective breeding has been highlighted as a key tool to facilitate sustainable intensification of bivalve aquaculture, allowing the development of specialized breeding lines resilient to environmental and pathogenic challenges (FAO, 2016;Potts et al., 2021). Selection has benefited the production of many cultured aquatic taxa (Gjedrem & Rye, 2018), including the most important marine bivalve molluscs such as mussels and oysters (Hollenbeck & Johnston, 2018). Methods of selection vary from mass selection, for example breeding from the fastest growers in a population, to methods using genetic markers spread across the genome, known as genomic selection (GS). GS is particularly powerful as it can improve accuracy of selective breeding, contributing to highly targeted results, even in the case of polygenic traits, whilst allowing for full control of the genetic relationships of the offspring Meuwissen et al., 2001). Nonetheless, GS approaches in marine bivalves are still in their infancy, and their efficacy within different species require significant investigation to be fully elucidated. The development of a SNP-array for blue mussels will therefore contribute to the investigation of whether GS is a viable option to enhance aquaculture production in these taxa.
To utilize GS, genome-wide markers are required. In most organisms, single nucleotide polymorphisms (SNPs) are the most common form of genetic variation and, therefore, the marker of choice for GS. Other forms of structural genetic variation, including large insertions and deletions, are known to be prevalent in blue mussels (Gerdol et al., 2020), and therefore may play an important role determining phenotypic diversity in this group. However, SNPs occur in genes/regulatory regions, as well as noncoding regions and still provide relevant information on phenotype and trait heritability. Whilst these markers may be contributing to for key production traits and those related to environmental resilience are especially relevant to safeguard aquaculture production under climate change.

K E Y W O R D S
aquaculture, mussels, Mytilus, population genomics, SNP, SNP chip the expressed phenotype, using a high density of markers can assure that a majority are in linkage disequilibrium (LD) with causal mutations. Advances in sequencing and computational technologies have facilitated lowering costs of SNP discovery and enable the generation of large quantities of sequencing data and high throughput screening of SNPs. Consequently, these genetic markers have been widely applied for the development of further genomic resources.
SNP-arrays use a probe-based approach to generate high-quality genotype data whilst requiring less investment in sample preparation. Computational analysis of genotype data generated by arrays is less demanding than techniques such as genotyping-by-sequencing that provide a similar amount of data. Furthermore, SNP-arrays enable genotyping in multiple pre-defined loci, guaranteeing reproducibility of analysis (Robledo et al., 2018). Such information contributes to our understanding of genomic processes underpinning research in evolutionary genomics, population genetics, conservation and ecology (Allendorf et al., 2010;Andrews et al., 2016). Therefore, SNP-arrays have valuable applications for the study of both wild and farmed populations and have been successfully applied across multiple taxa (Houston et al., 2014;Kranis et al., 2013;Michelizzi et al., 2011;Stoffel et al., 2012). Arrays are currently available for several farmed aquatic species including the Pacific (Crassostrea gigas) and European flat oysters (Ostrea edulis) (Gutierrez et al., 2017;Lapègue et al., 2014;Qi et al., 2017), Atlantic salmon (Salmo salar) (Houston et al., 2014), Rainbow trout (Onchorynchus mykiss) Palti et al., 2015) and Nile Tilapia (Oreochromis niloticus) (Joshi et al., 2018;Peñaloza et al., 2020), offering a rapid, accessible and cost-effective approach for medium-and high-density genotyping.
In this study, we generate a SNP database for the Mytilus edulis complex (M. edulis, M. galloprovincialis and M. trossulus), together with the southern hemisphere, Mytilus chilensis, given the high economic relevance of these species to the mussel industry, as well as the ecological significance of this group given their propensity to hybridize.
A subset of this database comprising 60 K SNPs was then used to develop a medium density SNP-array on the Affymetrix ThermoFisher platform. Polymorphic markers were discovered using low-coverage whole-genome sequencing (lcWGS) data of 23 globally distributed blue-mussel populations, including multiple pure and hybrid genotypes. This tool will facilitate the investigation of population genetic processes such as local adaptation and reproductive isolation in this species complex, and increase the potential of selective breeding in mussel aquaculture.
Coverage of each of the contigs was calculated as the mean cover- The final PCA and admixture analysis were ran using SNPs that were genotyped across all populations which were identified using BCFtools v1.9 flag isec (Li, 2011). The results from the population structure and admixture analyses were further applied to determine individual-level species composition used in the development of the SNP-array.

| Initial quality check, alignment, variant calling and filtering for array development
SNP filtering for the SNP-array genotyping platform followed the initial filtering steps as described in the section above, but different subsequent steps were applied for filtering: first we removed markers with overall minor allele frequency <0.01, subsequently we kept only markers with monomorphic flanking regions (35 bp in each direction) and, we excluded those which were present in <50% of individuals within each of the 23 populations.

| SNP selection for axiom blue mussel array
The set of SNPs with putative monomorphic flanking probes (71meter bp) was submitted to Affymetrix ThermoFisher for in silico evaluation. During the quality control (QC) process, each submitted SNP received a design score (p-convert value) for each 35 bp probe flanking the variant. Based on p-convert value, probes were classified as 'Recommended', 'Neutral', 'Not recommended', or 'Not possible'. For subsequent analysis, only variants that scored as 'Recommended' for both flanking probes were included. TA B L E 1 Sampling geographical coordinates and number of sequenced individuals in each of the populations used to generate WGS sequencing data in this study.

Samples
Sampling location  (Figure 2), and markers were classified as either genotyped in all species, three species, two species or species-exclusive ( Figure 3), with those shared on any level among two or more groups being included on the platform (~54 K). To reach the remaining target of 60 K SNPs, markers unique for each species complex (10 K, 2.5 K SNPs from each species) were selected, which were randomly filtered using Plink v1.9 flag -thin-count (Purcell et al., 2007). In addition, a set of previously published informative SNPs associated with species identification, transmissible cancer, phenotypic sex and population structure in this species complex were provided by collaborators ( Table 2).

| SNP summary statistics
One hundred and twenty-seven samples from the original discovery population, from which DNA was available from genotyping following the WGS, were genotyped using the Blue Mussel array platform.
For each of the markers in the platform, three intensity genotyping clusters were generated using the Axiom Analysis Suite software (v 5.1.1 Thermo Fisher Scientific) to segregate alleles from a single locus.
The standard protocol for marker quality check (QC) using the AxaS software includes selecting thresholds for the following filters: marker call rate (CR), average sample CR and DishQC (DQC). DQC is a quality control metric that measures the amount of overlap between two homozygous peaks created by a subset of non-polymorphic probes. In this study, we did not identify a subset of non-polymorphic probes shared by all the genotyped subspecies, most likely due to not identifying polymorphisms present in the flaking regions of the markers during initial filtering, rendering DQC inappropriate as a sample QC metric. For this reason, we used marker CR as a direct sample QC for the array validation, and five different runs were performed in the software using the following approach: firstly, combining all discovery population samples (run-ALL) and subsequently For each of the runs, we analyzed the final number of SNPs called with a marker CR equal or above 90% to 95%. Finally, we accounted for polymorphism among markers within and between species.

| Assessment of array performance and applications
To assess the array performance, we compared genotype information within the four species generated by two different methods.
The first dataset, obtained from the Axiom Analysis Software, consisted of genotype information generated by the 60 K Blue mussel We explored marker frequency and data missingness in both datasets using Plink (v1.9) (Purcell et al., 2007). We then compared called SNP genotypes in each of the individual species. For this, in each individual, we excluded SNPs with missing genotypes. This was done in both genotype datasets separately (lcWGS and SNP-array).
We then assessed the percentage of genotypes that were equivalently called for the SNPs retained in both datasets. Results are presented as (mean ± standard error) per species.

| Population structure and introgression of Mytilus sp generated from lcWGS data
The structure of the 23 Mytilus spp populations used to generate the 60 K SNP-array was assessed by PCA using the 4367 SNPs which intersected among all populations (Figure 2 and Figure S1). The    ( Table 2.
The final composition of the panel of SNPs is presented in Table 3.

| SNP summary statistics
In this study, SNPs were retained for genotyping based on marker CR and not sample DQC. In all marker CR filtering scenarios, the number of usable SNPs generated was higher when analysing individual species groups than when analysing all discovery population samples in combination (  The majority of markers passing a CR threshold ≥95% were polymorphic within species (Table 5)

| DISCUSS ION
In this study, we developed the first high throughput genotyping assay for the four main cultured species in the Mytilus genus, a

Probes (n) Probes (n) Probes (n) Probes (n) Probes (n)
CR (%) multi-species medium density 60 K SNP-array. SNPs were selected from 138 individual mussels, originating from 23 wild, globally- Here we chose to apply this alternative filtering method as samples genotyped in the 60 K blue mussel array consistently received low scores with the DQC parameter (ST1) and were removed from the analysis by the software, indicating the existence of polymorphisms present in the DQC 71 bp flanking probes. This is likely a consequence of the very high polymorphism present in mussel genomes (Romiguier et al., 2014), coupled with the chosen sampling approach of 23 globally distributed mussel populations including four species.
Whilst low coverage sequencing allowed the discovery of a large number of SNPs within the samples, it also resulted in patchy data.
Choosing to sequence individuals with low coverage might have led to a high missingness, which in turn resulted in a high proportion of the genetic variation being missed during the SNP selection process.
This may have also resulted in the removal of accurate data on the presence of SNPs in flanking regions.
In this study, we analyzed the performance of SNPs across different sample sets analyzed in the AxaS software. A higher number of markers were kept when genotyping samples from each species individually (run-GALLO, run-EDULIS, run-TROS, run-CHIL) than when genotyping all samples combined (run-ALL). Clusters generated by the Axiom Analysis Software classify the genotypes as either homozygous for one of the two alleles in the probe, or heterozygous.
As a consequence of the lcWGS approach chosen to sequence individuals in this study, hemizygous loci, which are missing one of the alleles at a specific position of a marker included on the array, would be incorrectly called by the platform as a missing genotype. Whilst this is a limitation of the genotyping platform, the above mentioned genomic variations are commonly expected to occur in the mussel genome (Gerdol et al., 2020) and might contribute to the high frequency of low quality scores of markers observed in the Blue mussel array. Grouping samples of similar genetic background for analysis reduced the number of genotypes being classified ambiguously (i.e. "Other" and "OTV"), due to distortion of clear genotype clustering that sometimes arose with multispecies sample sets. We appreciate it is not always possible to previously infer the genetic background of samples that will be analyzed using the array. However, our re- Polymorphism within each of the mussel species in the array was higher in comparison to other multi-species arrays, such as the ~49.9 K European Pine tree multispecies array, in which approximately one quarter of the converted SNPs was polymorphic in each species in the array (Perry et al., 2020). Nonetheless, in our study, only 960 polymorphic markers were shared among all four mussel species. These observations reassure that although markers are shared between either two, three or four mussel species constituting the array, this marker set is appropriate for investigating population structure and species divergence. Blue mussels, as is the case for other marine bivalve mollusc species, are known to have a highly complex and polymorphic genome (Calcino et al., 2021;Gerdol et al., 2020). Although a high proportion of markers is lost following the QC filtering, a sufficient number (~20 K) is retained for each of the four species when running QC for batches of samples with similar genetic background. Approximately a quarter of these markers are unique per species. These values reassure that the diversity existing within and among species can be captured with the Blue mussel 60 K array. The addition of 691 markers, which have been previously confirmed as species-diagnostic and relevant for sex determination and identification of cancer-positive samples (cited in Table 2), further guarantees the applicability of the genomic tool generated in this study as highly valuable for fine scale investigation of genetic structure in these taxa.
The percentage of genotypes called equivalently using both approaches was higher for M. galloprovincialis than for the other tested species. Such discrepancies might result from variance in the flanking probes, due to divergence in the genomes of these sister species, interfering in the produced signal or in the physical binding of the genetic material. Furthermore, it is very likely that two additional factors: (i) the higher number of M. galloprovincialis individual in the founding and validation populations and (ii) the alignment to a Mediterranean M. galloprovincialis genome assembly, may also be contributing to a biased comparison of genotypes in the array. Besides, the high level of missing data across the lcWGS data suggests that this sequencing method was not the optimal approach for SNP selection for this species complex.
Especially when taking into consideration the high hemizygous fraction of the mussel genome, the low coverage approach most likely contributed to incorrectly call individuals with a missing locus as a homozygous with a missing loci. Furthermore, these hemizygous regions are highly associated with the dispensable portion of the mussel genome, which, in turn, suffer massive presence and absence variation among individuals. Using a non-annotated genome assembled at a contig level did not allow the identification of these specific regions, or the selection of SNPs located in core regions. In future, when working with these highly polymorphic species, a different approach, either pool sequencing larger numbers of individuals from each population (Kranis et al., 2013), or whole genome resequencing larger numbers of individuals from each population at higher coverage, will likely yield more useful SNPs to add to the next version of this array platform. In addition, newer versions of the mussel genome are now available, assembled to a higher level. Using these genomes as references would allow the selection of SNPs associated with functional genomic regions, consequently relevant for segregating species/populations more clearly.  (Araneda et al., 2016;Fraïsse et al., 2016;Gardner, 1994;Hilbish et al., 2002;Koehn, 1991;Michalek et al., 2016;Saarman & Pogson, 2015;Väinölä & Hvilsom, 1991;Vendrami et al., 2020;Zbawicka et al., 2018). genotypes, recent evidence has suggested that 'dock mussels' (i.e. mussels inhabiting port environments) in Europe have a similar admixture pattern between M. galloprovincialis and M. edulis (Simon et al., 2020), and CH populations might be an additional example of such mussels in the north east Pacific. However, further research is required to understand the nature of such admixture patterns in these North West American populations. The Baltic population from Finland (FIN) shows a more advanced introgression pattern between two of the K clusters, likely between M. edulis and M. trossulus. While populations from Kiel and Ahrenshoop (GK, GA) in the contact zone show a mixing of introgressed M. edulis and hybrid genotypes, which is in agreement with previous literature (Stuckas et al., 2017). Along the northern Pacific coast, M. galloprovincialis is present in sheltered waters, contrasting to its preference in north east Atlantic populations where this species predominantly populates the exposed rocky tidal environments along the coast (Bierne, Borsa, et al., 2003;Hilbish et al., 2002), with the exception of commercial ports (Simon et al., 2020). This observation is potentially consistent with an inverted coupling between local adaption genes and intrinsic species barriers as previously suggested for the inverted genetic-environment relationship observed with M. edulis and M. trossulus in the eastern and western Atlantic (Bierne et al., 2011). Finally, mussel populations in Exmouth, southern England, which have previously been described as pure M. edulis (Hilbish et al., 2002), are shown to be introgressed with M. galloprovincialis, an observation supported by a recent RAD sequencing study (Vendrami et al., 2020). This can be explained by a better efficiency to detect introgression across a semi-permeable barrier to gene flow with an increased number of markers distributed along a larger portion of the genome, as discussed by Fraïsse et al. (2016).

| CON CLUS I ON S AND FUTURE PER S PEC TIVE S
We have developed the first medium density multi species blue mussel SNP panel, subsequently validating its performance on 127 individuals from 23 globally-distributed populations. The blue mus- can be applied for breeding purposes (i.e. parentage assignment, inbreeding level assessment and species/product identification and provenance), contributing to the understanding of genetic architecture of traits of interest, and leading to the optimization of blue mussel aquaculture via genomic selection. Equally, this tool can be used to deepen our understanding of population genetic processes in these taxa. Shedding light on such relevant phenomena can contribute to the development of conservation guidelines and effective management strategies of wild mussel populations as well as those exploited for aquaculture purposes.

ACK N OWLED G M ENTS
We would like to first acknowledge the extensive list of colleagues, collaborators and friends who have provided mussel samples for inclusion in this study. This includes Prof. Anne Todgham and Ms. For the purpose of open access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. Finally, we thank the two anonymous reviewers for the valuable comments that contributed to the final version of the manuscript.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflict of interest for this article.

DATA AVA I L A B I L I T Y S TAT E M E N T
Raw sequence reads from the blue mussel samples used for SNP discovery have been deposited in NCBI's Sequence Read Archive (SRA, https://www.ncbi.nlm.nih.gov/sra) under accession number PRJNA932792. The Blue mussel SNP-array is available for from ThermoFisher (array 551,406 Axiom MytVv1 384S384 Layout, Email: bioinformaticsservices@thermofisher.com).