Fine‐scale population structure of the northern hard clam (Mercenaria mercenaria) revealed by genome‐wide SNP markers

Abstract Aquaculture is growing rapidly worldwide, and sustainability is dependent on an understanding of current genetic variation and levels of connectivity among populations. Genetic data are essential to mitigate the genetic and ecological impacts of aquaculture on wild populations and guard against unintended human‐induced loss of intraspecific diversity in aquacultured lines. Impacts of disregarding genetics can include loss of diversity within and between populations and disruption of local adaptation patterns, which can lead to a decrease in fitness. The northern hard clam, Mercenaria mercenaria (Linnaeus, 1758), is an economically valuable aquaculture species along the North American Atlantic and Gulf coasts. Hard clams have a pelagic larval phase that allows for dispersal, but the level of genetic connectivity among geographic areas is not well understood. To better inform the establishment of site‐appropriate aquaculture brood stocks, this study used DArTseq™ genotyping by sequencing to characterize the genetic stock structure of wild clams sampled along the east coast of North America and document genetic diversity within populations. Samples were collected from 15 locations from Prince Edward Island, Canada, to South Carolina, USA. Stringent data filtering resulted in 4960 single nucleotide polymorphisms from 448 individuals. Five genetic breaks separating six genetically distinct populations were identified: Canada, Maine, Massachusetts, Mid‐Atlantic, Chesapeake Bay, and the Carolinas (F ST 0.003–0.046; p < 0.0001). This is the first study to assess population genetic structure of this economically important hard clam along a large portion of its native range with high‐resolution genomic markers, enabling identification of previously unrecognized population structure. Results of this study not only broaden insight into the factors shaping the current distribution of M. mercenaria but also reveal the genetic population dynamics of a species with a long pelagic larval dispersal period along the North American Atlantic and Gulf coasts.


| INTRODUC TI ON
Aquaculture is growing rapidly worldwide, and sustainability is dependent on understanding current genetic variation and levels of connectivity among populations. Genetic data are essential to mitigate the genetic and ecological impacts of aquaculture on wild populations and guard against unintended human-induced loss of intraspecific diversity in aquacultured lines. When wild and captive animals interbreed, the impacts on the wild populations can include loss of genetic diversity within and between populations and disruption of patterns of local adaptation via outbreeding depression leading to a decrease in fitness (Glover et al., 2017;Laikre et al., 2010;Sylvester et al., 2019;Waples et al., 2016). For aquacultured animals, an understanding of the wild population structure and levels of genetic diversity is critical to selecting animals with desired traits and supplementing brood stocks for efficient production and preventing inbreeding depression from unintended loss of genetic diversity in the domestication process (Li et al., 2017;Mather & de Bruyn, 2003;Miller et al., 2006;Purcell et al., 2015). Given these needs for genetic information and the decreased cost of next-generation sequencing, it is "inexcusable not to know" the genetic makeup of lines and families used for aquaculture and restoration whenever possible to mitigate adverse effects (Ryman et al., 1995).
Bivalve mollusks have life history strategies that are unique compared to other marine species, making molecular studies of their population structure more challenging. These traits include a high inbreeding load, segregation distortion (Plough, 2016), and higher levels of polymorphism than most other animals (Curole & Hedgecock, 2005;Hedgecock et al., 2005;Reece et al., 2004).
Single nucleotide polymorphisms (SNPs) are estimated to occur every 30-50 base pairs throughout the genomes of mollusks (Curole & Hedgecock, 2005;Garvin et al., 2010;Hedgecock et al., 2005;Reece et al., 2004) compared to one every 116 bp in channel catfish (Sun et al., 2014) and one every 137 bp in the European hake (Milano et al., 2011). These high levels of polymorphism have made using more traditional primer-targeted markers like microsatellites challenging for genetic studies of mollusks due to the high incidence of null alleles caused by variation in primer-binding sites.
Advancements in high-throughput genotyping by sequencing (GBS) and bioinformatics have enabled the rapid identification of polymorphic loci without reliance on specific sets of primers, making it possible to identify thousands of SNPs that can be screened to develop genetic resources for aquacultured bivalve species (Davey et al., 2011;Elshire et al., 2011).
The northern hard clam or northern quahog, Mercenaria mercenaria (Linnaeus, 1758), hereafter hard clam, is a marine bivalve inhabiting soft substrates of mud, sand, and shell fragments in coastal lagoons and estuaries from the Gulf of St. Lawrence, Canada, through the northern Gulf of Mexico (Mackenzie et al., 2002).
The hard clam has been the target of a wild harvest fishery along portions of its native range and has become an economically important aquaculture species in the Mid-Atlantic and southeastern USA (Whetstone et al., 2005). Globally, the value of aquacultured hard clam production was estimated to be $78 million in 2017 (FAO, 2018). Virginia is the largest producer of aquacultured hard clams in the USA, with farm sales estimated at $58 million in 2021 (Virginia Marine Resources Commission fisheries statistics), making it the most economically valuable aquaculture species in the US Mid-Atlantic region.
The life history of the hard clam is similar to that of other marine bivalves, having a pelagic larval stage, a benthic adult stage, high fecundity, and a relatively short time to maturity (Carriker, 2001;Eversole, 2001), characteristics that can shape genetic population connectivity. Hard clams can grow to a maximum shell length (anterior-posterior parallel to the hinge) of about 120 mm and, using sclerochronological analysis, have been aged up to 106 years (Ridgway et al., 2011;Roegner & Mann, 1990). They are protandrous consecutive hermaphrodites with separate sexes and no sexual dimorphism, establishing their definitive sex by either transitioning to female or remaining male (Eversole, 2001;Loosanoff, 1936;Roegner & Mann, 1990). Hard clams have high fecundity and broadcast spawn from spring through fall depending on latitude, with timing and the length of the spawning season affected by temperature and food availability (Eversole, 2001;Roegner & Mann, 1990). The pelagic larval duration (PLD) is temperature dependent, lasting 18-24 days in 18°C water and 7-14 days in 30°C water, until the clams reach the proper size and development for metamorphosis (Mackenzie et al., 2002). Juveniles and adults are considered to have a sedentary life in the benthos due to their limited post-metamorphosis mobility, so the interactions among PLD, hydrodynamics, geography, and other physical factors are thought to shape genetic connectivity (Cowen & Sponaugle, 2009).
The population genetics of wild hard clams have been investigated in two previous molecular studies. Dillon and Manzi (1992) examined isozymes in clams at the northern limit of their range and resolved three regional populations: (1) Prince Edward Island (PI), Canada, (2) Maine, and (3) Massachusetts, USA. Baker et al. (2008) used the mitochondrial cytochrome oxidase subunit I (COI) gene to examine population structure from PI, Canada, to Cedar Key, Florida, with a sampling focus within Florida. They found three genetically distinct populations separated by a strong genetic break at Cape Hatteras, North Carolina, and a weaker genetic break at Cape Cod, Massachusetts (Baker et al., 2008). Although isozymes and allozymes are well suited for population studies and the mitochondrial COI gene is useful for understanding the phylogeographic history of a species (Allendorf, 2017), both types of markers have limited capabilities for fine-scale resolution of population structure.
The objective of this study was to determine the population structure and genetic diversity of wild hard clams along a large portion of their native range using thousands of SNPs identified through genotyping by sequencing (GBS). The fine-scale population structure delineated by this study allows a better understanding of the mechanisms involved in maintaining population structure for marine bivalves along the North American east coast. Additionally, it provides baseline measures of standing genetic variation in wild populations that may be valuable for monitoring and quantifying changes in genetic diversity and effective population size in the face of warming coastal waters and proliferation of genetically focused aquaculture stocks.

| Sample collection and DNA isolation
Hard clam samples (n = 452 total) were collected from wild popu-

| Next-generation sequencing
To identify SNP loci, isolated DNA was sent to Diversity Arrays Technology (DArT Pty Ltd) at concentrations between 50 and 100 ng/μL for high-throughput GBS using DArTseq™ Kilian et al., 2012;Sansaloni et al., 2011). Approximately 30 individuals from each of the 15 sampling locations were chosen for DArTseq™ to avoid ascertainment bias and to accurately capture the genetic variation at each sampling location. The DArTseq™ method includes a restriction enzyme digestion step to reduce genome complexity followed by hybridization to microarrays and massively parallel sequencing followed by SNP identification (Sansaloni et al., 2011).

| Data filtering
Sequenced SNP data were filtered to minimize genotyping errors and missing data using the R software (R Core Team, 2018) package dartR v1.3.4   (Table S1). Loci with less than 98% reproducibility were removed . To address the proportion of missing data, call rate was set at <98% for loci and at <95% for individuals. Data were first filtered for call rate by locus to retain as many individuals as possible in the analysis. Data were TA B L E 1 List of hard clam collection sites with number of individual clams that were collected from each location and genetic diversity statistics by sampling location; expected heterozygosity (H e ), observed heterozygosity (H o ), and inbreeding coefficient (G IS ).

Location State
Loc. ID

Avg. height
Year ( also filtered to remove loci based on Hamming distance (<20%) (Hamming, 1950;Pilcher et al., 2008). When more than one SNP was found within the same 69 base pair sequence fragment (i.e., a secondary SNP), one was retained at random to ensure that loci were independent. Loci with a minor allele frequency (MAF) of less than 1% were removed and data were also filtered for read depth with thresholds set to remove loci with less than 5× coverage depth or more than 75× coverage to minimize the likelihood of errors in the data.
After SNP data were filtered in dartR, additional filtering was applied to identify loci that deviated from the expectations of Hardy-Weinberg equilibrium (HWE) in multiple sampling locations, likely because of genotyping errors and other technical artifacts, and to compare individual levels of average heterozygosity across loci to look for evidence of cross-contamination in the R package radiator v1.1.4 (Gosselin et al., 2020). Loci that did not conform to the expectations of HWE in at least four sampling locations using a p-value threshold of 0.01 were removed from the dataset.
The default parameters were used for OutFLANK. For pcadapt, the default threshold of minor allele frequency (MAF) was used, and the false discovery rate (FDR) method was implemented, with loci having q-values less than the FDR of α set to 10% considered outliers. The outliers identified by both OutFLANK and pcadapt were compared and the subset of loci identified by both methods was retained.
Therefore, along with the "full dataset," two additional datasets were created; the "outlier dataset" included only the loci identified using both methods, and the "neutral dataset," which excluded those loci included in the outlier dataset. All three datasets were analyzed as described below. BLAST searches (Altschul et al., 1990) were conducted for the outlier dataset using MegaBlast on the NCBI website to optimize for highly similar sequences.

| Data analysis: Genetic diversity and population structure based on all datasets
The filtered datasets were used to estimate levels of genetic diversity within collections from different geographic locations and the level of divergence among collections. Observed heterozygosity (H o ) and expected heterozygosity (gene diversity, H e ) were calculated in the R package adegenet v2.1.2 (Jombart, 2008;Jombart & Ahmed, 2011), and the coefficient of inbreeding (G IS , Nei, 1987) was calculated in Genodive v3.03 (Meirmans, 2020). The level of genetic differentiation among sample collections was assessed using unbiased F-statistics (F ST , Weir & Cockerham, 1984). Pairwise significance was based on 10,000 iterations of the data.
Several alternative methods with differing underlying assumptions were used to delineate population structure. A principal component analysis (PCA) was used to visually explore the population structure and the level of genetic differentiation among clams within and between each sampling location using adegenet, which calculates Euclidean genetic distances between samples (Jombart et al., 2009).
Samples were grouped into genetically distinct populations using a Bayesian model-based clustering method implemented in Structure v2.3.4 (Pritchard et al., 2000). Cluster analysis was conducted on the complete filtered dataset and, due to the uniqueness of the Prince Edward Island samples, on a subset of the data excluding the Prince Edward Island samples (hereafter the "no PI" subset) to resolve finerscale (more nested) population structure obscured by their inclusion (Janes et al., 2017). Both datasets were run with the admixture model and sampling locations were used as prior information, the latter of which can help with identifying structure when levels of divergence are lower (Hubisz et al., 2009). Allele frequencies were allowed to be correlated between locations to assist in identification of closely related populations (Falush et al., 2003). Each dataset was run with 10 replicates per K value, K being the inferred number of ancestral lineages: K = 1-7 for the full dataset and K = 3-6 for the no PI subset. Initially, the range of K values tested for each dataset corresponded to the number of sampling locations included. The most supported K values from those runs informed the final range of K values tested. Each replicate had a burn-in of 500,000 Markov Chain Monte Carlo (MCMC) iterations followed by 500,000 MCMC iterations. Results were compiled in Structure Harvester (Earl & von-Holdt, 2012) to identify the most well-supported K value based on the Evanno method (Evanno et al., 2005). Replicates were combined and visualized using the R package pophelper v2.3.0 (Francis, 2017).
A discriminant analysis of principal components (DAPC) (Jombart et al., 2010), which does not assume HWE, was also used to assess the most likely number of clusters comprised by the data in adegenet for both of the datasets described above. The optimal number of PCs retained for de novo cluster identification and DAPC analysis was determined using the function xvalDapc. The function find. clusters was first run to assess increasing numbers of clusters (K = 1 to K = 15) to identify de novo the optimal number of clusters based on the Bayesian information criterion (BIC) value, and individuals were assigned to these clusters with the DAPC.
To further address partitioning of sampling locations into different genetic populations, analysis of molecular variance (AMOVA, Excoffier et al., 1992;Michalakis & Excoffier, 1996) (Table 2). Significance was based on 10,000 iterations of the data.

| Data analysis: Diversity and divergence of genetic populations
Once population structure had been delineated, the levels of genetic diversity within populations and the level of divergence among populations including H o , H e , G IS , and F ST were calculated as above.
Pairwise significance of F ST values was based on 10,000 iterations of the data. The effective population size (N e ) of each genetic population was determined to assess the impacts of genetic drift and inbreeding on the populations using NeEstimator v2.1 (Do et al., 2014).
The linkage disequilibrium (LD) model was used with random mating and assessed with 95% confidence intervals of three critical values (allele frequencies); 0.01, 0.02, and 0.05.
To determine if the data were consistent with a pattern of isolation by distance (IBD), a Mantel test (Mantel, 1967) of the correlation between genetic and geographic distance matrices was conducted in adegenet. IBD can be confounded by population structure and spatial autocorrelation (Meirmans, 2012), so IBD was assessed in the Mid-Atlantic region between the genetic breaks previously identified at Cape Cod, MA, and Cape Hatteras, NC (i.e., for samples collected between New York and Virginia) (Baker et al., 2008). Nei's genetic distance (Nei, 1972(Nei, , 1978 between populations was calculated in adegenet. Geographic distance was measured in Google Maps with the shortest overwater distance between locations from best-known coordinates of sampling locations. The significance of the correlation was assessed with 50,000 permutations of the data.

| Filtering and construction of datasets
DArTseq™ of the complete sample set (n = 452, Pairwise F ST values ranged from 0 to 0.047 (Table 3) Figure S1). DAPC analysis of all loci indicated that the optimal number of PCs to retain was 50, and the BIC indicated that the optimal number of clusters was K = 5 ( Figure 2c, Figure S2). In the DAPC that included all sampling locations, the groupings identified were the same as those identified using STRUCTURE (Figure 2b,c). Exclusion of PI from analysis did not result in increased resolution ( Figure S3). Hierarchical

TA B L E 3 Pairwise
AMOVAs indicated a significant difference among sampling locations within groups (F SC ) until six groups were tested ( Table 2). The optimal grouping was consistent with the groupings delineated by the individual-based clustering methods (F CT = 0.013, p < 0.0001) (    Figure S4).

| Genetic structure and divergence-Outlier loci
BLAST searches on the 95 outlier loci found 29 significant matches, 15 of which were uncharacterized transcripts, however, proteins related to those involved in stress response, immune function, protein processing, and regulation, such as heat shock, zinc finger-like, complement C3, and calmodulin proteins, were identified (Table S2).
A PCA of the dataset comprised of the outlier loci showed 10.96% variance explained by PC 1, 8.02% by PC 2, and 5.75% by PC 3 (Figure 3a). The clustering pattern was comparable to that observed in the PCA that included all loci, with more distinct separation and a higher percentage of the variance explained. DAPC analysis of outlier loci indicated that 70 PCs were the most optimal to retain for analysis and K = 8 was the most optimal number of clusters based on comparison of BIC values ( Figure S5). When sampling locations were grouped based on similarity of the proportions of different clusters, they were comparable to the results for the total dataset except that a more distinct break was observed between samples from the ocean side of Virginia (WP) and those located within the Chesapeake Bay (JR, MB, and PS) (Figure 3b).

| Population structure
This study employed a high-resolution set of SNP markers to describe the population structure of the hard clam, Mercenaria mercenaria, across a large portion of its native range. Samples from 15 locations from PI to SC resulted in delineation of 6 genetically distinct populations based on analyses with 4960 SNP loci. The five genetic breaks identified by the whole dataset were also recovered by the outlier dataset, increasing the resolution of previous studies based on fewer genetic markers and sampling locations (Baker et al., 2008;Dillon & Manzi, 1992).
It is common for species that occupy bays or estuaries to have abrupt genetic discontinuities (Miller et al., 2006), likely due to shifting ocean currents and landmasses acting as barriers to gene flow.
The geographic distribution of the hard clam populations identified by this study largely corresponds to the marine biogeographic provinces that shape the western North Atlantic (Briggs & Bowen, 2012).
The hard clam populations that were resolved along the US east coast in this study were separated by genetic barriers identified between (1) Maine and Massachusetts, (2) Cape Cod, (3) between the bayside and oceanside of the Delmarva Peninsula, and (4) Cape Hatteras. The observed genetic discontinuities were most pronounced when comparing samples across Cape Hatteras and Cape Cod-two barriers known to disrupt gene flow in other marine species (Mach et al., 2011). These capes define the three biogeographic provinces that are characterized by their temperature, salinity, and seasonality: the Acadian, the Virginian, and the Carolinian (Hale, 2010;Mach et al., 2011). The Acadian province north of Cape Cod has water that is cold and stable year-round, influenced by the Labrador Current and the Gulf of Maine. The Virginian province has higher, seasonally variable water temperatures, and to the south of Cape Hatteras, the Carolinian province is characterized by the warmer-temperate waters that are carried north by the Gulf Stream. In some cases, outlier loci are suspected to be adaptive under divergent natural selection and resolve finer-scale and unique population structure as compared to neutral markers (Batista et al., 2016;Hargrove et al., 2015). However, observed differences could also be an artifact of demographic processes including allele surfing during range expansions, which can result in a rapid drift of some alleles at the leading edge of the expansion (reviewed in Hoban et al., 2016;Travis et al., 2007). In the current study, there was an order-ofmagnitude difference in divergence values between neutral and outlier loci, but the patterns of structure between the two datasets were concordant in that both recovered six major genetic populations. The congruence between the two datasets and the life history characteristics of the hard clam suggests that neutral processes are more likely responsible for the observed differences between the two datasets.

| Fine-scale genetic structure
To address genetic structure in the Mid-Atlantic region, which lacked representation in previous studies (e.g., Baker et al., 2008), can survive in salinities as low as 12.5 (Castagna & Chanley, 1973;Castagna & Kraeuter, 1981;Mulholland, 1984). Hard clams originating from the bayside and the oceanside of the Virginian Eastern Shore also show regional differences in growth, with those in the estuary growing more slowly. A meta-analysis based on the mitochondrial COI gene that included the data for M. mercenaria from Dillon and Manzi (1987) and Baker et al. (2008) (Altman et al., 2013).
In marine systems, outlier loci have been shown to better differentiate fine-scale genomic patterns and provide greater resolution for population divergence than neutral loci when population structure is weak (Hohenlohe et al., 2018). Analysis with the outlier dataset showed a comparatively small but significant difference (F ST values) between samples from the two Massachusetts locations, as well as between samples from locations in the Mid-Atlantic that were not observed with the dataset that included all loci (Table S3) including between Wachapreague (WP) and several other locations (RB, GB, and AC), and between OH and GB. This subtle difference between the two datasets could be attributed to the outlier loci being adaptive, as discussed above. BLAST searches indicated that some outlier loci identified in this study had matches to proteins involved in stress response, immune function, protein processing, and regulation, indicating that they might be involved in responses to environmental stressors. However, the additional structure suggested by the outlier dataset could also be caused by random stochastic processes including human movement of clam stocks and sweepstakes reproduction. Additional studies, including comparative transcriptomic studies and population genetic studies that include environmental data, will be needed to definitively resolve whether selection is playing a role in the maintenance of population structure in the hard clam.

| Isolation by distance
For sampling locations between Cape Cod and Cape Hatteras, including the Chesapeake Bay and Mid-Atlantic populations, there was a significant correlation between genetic distance and geographic distance. This pattern of IBD could indicate that the sampling locations are connected through larval dispersal, likely from north to south via the Virginia coastal current (Wares, 2002). While the hard clam has a long pelagic larval phase that could theoretically facilitate connectivity between these locations, the finding of a pattern of IBD in this region, as opposed to a panmictic population, indicates a shorter realized dispersal and connectivity among hard clams in close geographic proximity. Patterns of IBD have been identified in other bivalves in the region, including C. virginica within the Chesapeake Bay (Rose et al., 2006). The relative genetic isolation of species within western North Atlantic coastal bays has also been noted for bay scallops (Bert et al., 2011;Wilbur et al., 2005), and as a constraint to recolonization of seagrasses following catastrophic loss (Orth et al., 2020).

| Levels of genetic divergence and genetic diversity
The pairwise F ST levels among hard clam populations in this study were low overall but revealed statistically significant levels of divergence, a pattern that has been observed in other invertebrates with a pelagic larval phase whose population structure was examined using genotyping by sequencing. were similar to those found in the Eastern oyster, higher than those found in the American lobster or giant Californian sea cucumber, and much lower than those found with the black-lip pearl oyster (Benestan et al., 2015;Bernatchez et al., 2019;Lal et al., 2017;Xuereb et al., 2018). inbreeding values that appear as observed heterozygote deficiencies are common in marine bivalves (Bernatchez et al., 2019;Lal et al., 2017;Mallet et al., 1985;Reece et al., 2004). In highly fecund marine species, heterozygote deficiencies are often attributed to a large variance in reproductive success among individuals from different age classes, resulting in population substructure among age classes (Bernatchez et al., 2019;Hedgecock & Pudovkin, 2011). This is a possible contributor to the heterozygote deficiency and positive inbreeding observed in this study that cannot be addressed due to the lack of age data for individual clams. This pattern of higher expected than observed heterozygosity was also seen in wild hard clams in Florida using seven microsatellite markers (Hargrove et al., 2015). Although the current study found positive inbreeding values in all locations, inbreeding levels were generally lower than what has been reported in other bivalve studies using SNPs

| CON CLUS ION
This study was the first to use GBS to assess the population structure and genetic diversity of hard clams along a large portion of their native range. Six genetically distinct populations of hard clams were identified: Canada, Maine, Massachusetts, Mid-Atlantic, Chesapeake Bay, and the Carolinas. Results support the findings of previous genetic studies, which not only identified breaks at Cape Cod and Cape Hatteras (Baker et al., 2008;Dillon & Manzi, 1992) but also identified additional barriers to gene flow between Canada and Maine, and between Maine and Massachusetts. This study also documented fine-scale geographic differentiation between bayside and oceanside Virginia populations, as well as a pattern of isolation by distance in the Mid-Atlantic region. With the identification of 4960 novel SNPs markers across all sampling areas, this study creates a genetic baseline for monitoring natural hard clam populations along a large portion of their native range.
These markers also provide a baseline for future research on this economically important species to support fisheries and aquaculture. Highly informative marker subsets can be identified for assay panels that can be used for many applications including stock assignment of individuals from unknown populations. These panels also could be used with aquacultured clams to assess heterogeneity and genetic ancestry, track parentage, monitor genetic diversity, and identify desirable brood-stock source populations. In addition, how genotypes and genetic populations correlate with environmental data can be examined. Physiological experiments paired with transcriptomic analysis could provide a better understanding of differences in salinity, hypoxia, and temperature tolerances among hard clam genetic populations. Identification of genes associated with traits of interest for selective breeding will benefit the aquaculture industry.

ACK N OWLED G M ENTS
We thank several people who contributed to our sample collections

CO N FLI C T O F I NTER E S T S TATEM ENT
The authors have no conflict of interest to declare.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are openly available in the Dryad Digital Repository at https://datad ryad.org, reference number DOI:10.5061/dryad.kprr4 xh9f.