Can we rely on selected genetic markers for population identification? Evidence from coastal Atlantic cod

Abstract The use of genetic markers under putative selection in population studies carries the potential for erroneous identification of populations and misassignment of individuals to population of origin. Selected markers are nevertheless attractive, especially in marine organisms that are characterized by weak population structure at neutral loci. Highly fecund species may tolerate the cost of strong selective mortality during early life stages, potentially leading to a shift in offspring genotypes away from the parental proportions. In Atlantic cod, recent genetic studies have uncovered different genotype clusters apparently representing phenotypically cryptic populations that coexist in coastal waters. Here, we tested if a high‐graded SNP panel specifically designed to classify individual cod to population of origin may be unreliable because of natural selection acting on the SNPs or their linked background. Temporal samples of cod were collected from two fjords, starting at the earliest life stage (pelagic eggs) and carried on until late autumn (bottom‐settled juveniles), covering the period during summer of high natural mortality. Despite the potential for selective mortality during the study period, we found no evidence for selection, as both cod types occurred throughout the season, already in the earliest egg samples, and there was no evidence for a shift during the season in the proportions of one or the other type. We conclude that high‐graded marker panels under putative natural selection represent a valid and useful tool for identifying biological population structure in this highly fecund species and presumably in others.


| INTRODUC TI ON
In order to increase statistical power to resolve weak population genetic structure, a select panel of loci with higher than average level of genetic differentiation is often employed Banks, Eichert, & Olsen, 2003;Henriques et al., 2018;Johansen et al., 2018;Jorde, Kleiven, et al., 2018;Larson, Seeb, Pascal, Templin, & Seeb, 2014;Nielsen et al., 2012;Russello, Kirk, Frazer, & Askey, 2012). Such a high-graded panel is likely to include loci under divergent selection, raising concerns over their reliability as a tool for inferring demographic population structure (Luikart, England, Tallmon, Jordan, & Taberlet, 2003;Nielsen, Hansen, & Meldrop, 2006). Selected loci may nevertheless be excellent tools for the more restricted purpose of discriminating populations (Bekkevold et al., 2015;Lamichhaneya et al., 2012;Milano et al., 2014;Teacher, André, Jonsson, & Merilä, 2013) and for assigning individuals to population of origin (Banks et al., 2003;Freamo, O'Reilly, Berg, Lien, & Boulding, 2011;Helyar et al., 2011;Kavakiotis, Samaras, Triantafyllidis, & Vlahavas, 2017;Nielsen et al., 2012;Wilkinson et al., 2011). Challenges arise when selection on the markers is strong enough for environmental differences to override population demography on allele frequency dynamics. Individuals and genotypes sampled after an episode of selective mortality may poorly represent the parental generation and could lead to false impressions of population structuring. Such a scenario is illustrated in Figure 1, depicting the outcome of hypothetical selective mortality on genotype composition following transport of juveniles to different nursery areas. Upon sampling and genetic screenings of samples from the nursery areas, the results indicate genetically distinct groups that may be mistaken for separate biological populations, which they are not. While strong selection acting on a single or small number of marker loci is unlikely to have a great overall effect on a large panel of markers, the situation is different when using a small set specifically chosen for their high levels of divergence. This could be a problem especially when the true population structure is weak, absent or even moderate, as selection may generate patterns of genetic structure that trace environmental drivers rather than population processes (Lamichhaneya et al., 2012;Nielsen et al., 2006).
Strong selection in the form of non-random survival of genotypes is not unreasonable in organisms that combine extremely high fecundity with widespread dispersal of offspring into a diverse range of environments. High fecundity implies a high reproductive F I G U R E 1 Hypothetical scenario of a breeding population distributing juveniles (e.g., seeds or larvae) to two nursery areas that differ in environmental conditions and thus in selective mortality. Selection is assumed to favor individuals that are homozygote in three particular loci (identified as orange dots) and in one nursery area (area 2) but not in the other (area 1). Below is a Structure plot of samples from the two hypothetical nursery areas. See Supporting Information for details N u r s e r y a r e a 1 ( u n s e l e c t e d ) N u r s e r y a r e a 2 ( s e l e c t e d )
To maintain population size, this excess must be balanced by high mortality, usually at early life stages. Thus, there is a potential for selective mortality in the offspring and the tiny fraction of individuals that survive and end up being sampled for genetic analyses may then poorly represent the parent population. While most mortality is likely to be unrelated to the individual's genotype and thus non-selective, even when, say, 99.9% of deaths are unrelated to genotype, there remains a reproductive excess on the order of 1,000 to cover the cost of natural selection if the excess was a million to begin with.
Many highly fecund species also have a highly dispersive early life stage (e.g., seed plants [Nathan & Muller-Landau, 2000], marine invertebrates [Grantham, Eckert, & Shanks, 2003], and fishes [Cowen & Sponaugle, 2009]), and offspring may end up in environments their parents were not adapted to. Temporal fluctuations in environmental conditions could also contribute to create a mismatch between parental adaptation and optimal offspring genotypes, creating an option for selective mortality in offspring.
The use of high-graded markers is particularly attractive for marine organisms because population structure is typically weak within oceans (Hauser & Carvalho, 2008;Waples, 1998;Ward, Woodwark, & Skibinski, 1994). However, many marine species represent precisely the pattern of high fecundity and widespread dispersal followed by massive juvenile mortality that could cause problems for some genetic markers to provide reliable information on biological population structure and for correctly assigning individuals to population of origin. Here, we explore these issues empirically, using a panel of 27 SNP markers that were specifically developed for assigning Atlantic cod (Gadus morhua) along the south coast of Norway to population of origin, that is, to putative "North Sea" or "fjord" populations (Jorde, Kleiven, et al., 2018;Knutsen et al., 2018). We tested the hypothesis that such F I G U R E 2 (a) Map of study area with sample locations for eggs and juveniles. Blue dots indicate position of reference samples in the North Sea (NSn and NSs) and within three fjords (KRS: Kristiansand; LI: Lillesand; RI: Risør). Black arrows indicate the dominant ocean currents (simplified from Danielssen et al., 1997). Insets: details of sampled fjords with sample locations (numbered yellow dots): (b) Topdalsfjord; (c) Tvedestrandsfjord assignments were driven by selective mortality during the early life stages by monitoring genotype composition in eggs and juveniles over the time period (early spring to autumn) with highest natural mortality. The potential for selection on polymorphic loci in this highly fecund species lies in the extensive drift of pelagic eggs and larvae with ocean current and in the potentially contrasting environments where they settle and grow up. The alternative hypothesis is that genetic clustering and assignments of coastal cod is not unduly affected by ongoing selection on the SNP markers.

| The study species and experimental setting
The Skagerrak is an extension of the North Sea, situated between Denmark, Sweden, and southern Norway, bordering Kattegat ( Figure 2). Spawning of Atlantic cod occurs in the North Sea, in the Kattegat, and in Skagerrak coastal waters during early spring (February to early April). The Atlantic cod is a highly fecund species, the female producing approximately half a million eggs per kg body weight (Kjesbu, 1989;May, 1967;Oosthuizen & Daan, 1974).
Spawning products (eggs and larvae) are pelagic and subject to transport with ocean currents (Munk, Larsson, Danielsen, & Moksness, 1995), which in the Skagerrak form a counter-clockwise path from the North Sea along the Skagerrak coast ( Figure 2). Thus, spawning products from the North Sea can and do reach the Skagerrak coast (Knutsen et al., 2004;Spies et al., 2018;Stenseth et al., 2006), and cod from the outer coastal areas in the Skagerrak appears to be genetically similar to or identical with North Sea cod (André et al., 2016;Barth et al., 2017;Jorde, Kleiven, et al., 2018;Knutsen et al., 2011;Sodeland et al, 2016). Eggs hatch after three to four weeks (von Westernhagen, 1970) and the larvae remain pelagic until early summer when they descend to the bottom and are referred to as 0-group. Mortality rates during early life stages of cod have been estimated to approximately 10.9% per day at the early larval stage, declining to 2.2% per day for larger larvae, and considerably lower than this for post-settled 0-group cod (Sundby, Bjørke, Soldal, & Olsen, 1989).
Genetic studies of 0-group and older cod along the Norwegian Skagerrak coast have found genetic differences mainly between inner fjords and outer skerries (Knutsen et al., 2011;Øresland & André, 2008). This spatial pattern of genetic variability has been attributed to the existence in the Skagerrak of genetically distinct forms or putative ecotypes of cod (Barth et al., 2017), co-occurring in coastal waters Sodeland et al., 2016). Based on a panel of >9,000 SNPs, Jorde, Kleiven, et al. (2018) developed a small panel of 27 SNPs for cost-efficient assignment of coastal cod from Skagerrak into two ecotypes, referred to as "fjord cod" and "North Sea cod", respectively. The panel was developed by ranking loci according to levels of genetic divergence (Nei's G ST ) in their study area, which broadly overlapped the present one, while avoiding closely linked (composite linkage disequilibrium, CLD > 0.5) loci.
Thus, the 27 SNP panel represents a high-graded subset of genetic markers specifically developed to provide high levels of divergence among cod in the present study area.

| Study areas
The present study areas include two nearby fjords on the Norwegian Skagerrak coast, the Topdalsfjord and Tvedestrandsfjord ( patterns in this fjord indicate that pelagic eggs and larvae on average tend to experience an inward transport by estuarine circulations and thus become retained within the inner fjord basins (Ciannelli et al., 2010;Knutsen et al., 2007). Tvedestrandsfjord has recently been protected as a marine protected area (MPA), including a no-take zone, and fishing mortality during the present study is expected to be negligible.

| Sampling
Cod eggs were sampled during the spawning season from February to late March 2015, once in Topdalsfjord and five times in Tvedestrandsfjord. Six sampling sites or "stations" were arranged in the form of transects from the innermost to the outer part of the fjords (Figure 2b,c). Eggs were sampled with a WP2 planktonic net (Fraser, 1968) with 60 cm diameter and 500 µm mesh size. The net was hauled vertically from 30 m depth to the surface at a speed of 0.5 m/s. Eggs were identified and determined to species according to size and pigmentation (Hiemstra, 1962). Cod eggs were considered to be 1.2 mm to 1.5 mm in diameter (Thompson & Riley, 1981).
Eggs were stored in 96% ethanol at −22°C until DNA extraction.
Sampling of young-of-the-year juveniles (0-group) was done first in early summer (June), then once again later in autumn (September and October) in both fjords, using a standardized protocol for the an- Sampling was done at five different locations within the inner parts of the fjord (approximately stations 1 through 4: Figure 2b) over three days of fishing. Sampled cod were sacrificed, measured, and sexed by visual examination of gonads. A piece of the dorsal fin was saved for genetic analysis and was stored in 96% ethanol at −22°C until DNA extraction.

| Reference samples
As genetic references for cod in the study area we used two previously sampled and genotyped sets of individuals from the Norwegian Skagerrak coast and from the North Sea, respectively (Jorde, Kleiven, et al., 2018)

| Genotyping
A total of 333 cod eggs, 100 young-of-the-year juvenile cod, and 52 adult cod were genotyped for the present study (Table 1). Genotyping of the 27 SNPs was carried out on a Sequenom MassARRAY platform at the Centre for Integrative Genetics, Norway (https://cigene. no). We dismissed individuals with 10 or more missing genotypes as having poor DNA quality, resulting in 76 individuals (70 eggs, 6 juveniles, 0 spawners) being removed from further analyses, which were based on the remaining 409 individuals (Table 1). We consistently got genotypes only from 25 of the 27 SNPs, with two SNPs (ss1712301578 and ss1712299621: www.ncbi.nlm.nih.gov/SNP/) often failing, and all statistical analyses were therefore limited to 25 SNPs.

| Statistical analyses
Correlations of alleles within individuals relative to the sample (F IS ) and among samples relative to the total (F ST ) were calculated according to Weir and Cockerham (1984), separately for each SNP and as averages over loci, using the Genepop software (v. 4.2.1: Rousset, 2008 Note. For each sample are given date of sampling, life stage sampled, sample sizes (n 1 = total number of genotyped individuals; n 2 = number of those that were successfully genotyped, i.e., with <10 genotypes missing), average F IS over 25 loci (NA = not calculated due to low sample size; asterisks indicate significance at the 5% level with Genepop probability test), and numbers assigned by Geneclass2 to the "North Sea" (NS) and "fjord" types. χ 2 refers to the contingency chi-square test for homogeneity of proportions assigned to the two types at different sample times and life stages.
TA B L E 1 The target samples from the Topdalsfjord and Tvedestrandsfjord test. Individuals were clustered on the basis of their multilocus genotypes using Structure (v. 2.3.4: Pritchard, Stephens, & Donnelly, 2000) with the correlated allele frequencies model (Falush, Stephens, & Pritchard, 2003). For each predefined number (K = 1 to 5) of clusters, Structure was run with 1 million MCMC iterations following 1 million burnins. The distribution of ln prob(data|K) was evaluated for assessing the most likely number K. Individual Q-values (i.e., the estimated membership coefficients for each individual) were plotted graphically with Distruct (Rosenberg, 2004). Geneclass2 (v.2.0.g: Piry et al., 2004) was used to assign individuals to the aforementioned two reference samples, employing the Bayesian method of Rannala and Mountain (1997).
We used individual cluster memberships, as assigned by  (Table 1), resulting in three temporal samples from this fjord.
In addition, a sample of adult spawners was available for comparison from the inner part of the Topdalsfjord. For Tvedestrandsfjord, heterogeneity in proportions of the two genotype clusters was tested in five temporal replicate samples for eggs (February 20 to March 24) and two temporal replicates for juveniles (June and October), for a total of seven temporal samples (Table 1).
To test for difference among temporal samples in proportions of individuals assigned to each genetic cluster, we used standard chi-square heterogeneity tests and regression analyses. We chose logistic regression with Geneclass2 score as response variable and date of sampling and position of sampling site in the fjord as explanatory variables. The model is logistic because score is a binary variable (1 = individual belong to the North Sea cluster, 0 = individual belong to the fjord cluster) and we used regression because the two explanatory variables are ordinal, and regression is then statistically more powerful than alternative approaches that ignore this information (Agresti, 2013, p. 87). The first explanatory variable was day of sampling, counted as the number of days after the first sampling date, and was taken to represent the time of exposure to the fjord environment. Clearly, this is not exactly so, as eggs may have been spawned at different dates, but these differences should be relatively minor (a few weeks) considering the total time-span of the study (eight months). The second explanatory variable was sampling position in the fjord (Figure 2: 1 = inner part of fjord, 6 = outer part), which was assumed to represent any of a number of environmental gradients running from the inner to the outer part of the fjords.
These gradients could reflect differences in temperature, salinity, oxygen level, prey availability and species composition, parasite prevalence, and so on (cf. Schulze, 2006) that might induce selective mortality on genotypes. The two fjords were analyzed separately, and spawning fish (Topdalsfjord) were not included in the regression analysis, which was based on the following logistic model: where the response variable (s) is the Geneclass2 score and explanatory variables (x and y) are sampling date and station number, respectively, and i index individuals. The model parameters (b and c) were estimated and tested for significance with the glm function in the R statistical environment (R Core Team, 2016).

| RE SULTS
A total of 409 individuals, representing adults, eggs and juveniles, were genotyped successfully, in the sense that >15 SNPs produced a valid genotype (i.e., <10 SNPs failed). Eggs typically had more missing genotypes than did juveniles and adults, and the number of missing genotypes was greater for eggs with low DNA concentration (Supplementary Information Figure S1). The few eggs that were obtained at the first sampling event, on February 20 in Tvedestrandsfjord, all had very low DNA concentration, presumably reflecting recent spawning (Espeland & Sannaes, 2018). The distribution of egg DNA concentration, and hence age distribution, in Tvedestrandsfjord, was much wider already at the next sampling event a week later (February 27), and by early March tended to be wider than seen in Topdalsfjord at the same date (cf. Supplementary Information Figure S1).
Most SNPs displayed a deficiency of heterozygotes in the pooled sample (n = 409), with positive F IS estimates at 21 out of 25 SNPs (Figure 3). For ten of the SNPs deviation from Hardy-Weinberg (HW) genotype proportions were significant at the 5% level in Tvedestrandsfjord, while three SNPs deviated significantly in Topdalsfjord, two of them in common between fjords. Deficiencies of heterozygotes were also evident from positive average F IS estimates in seven out of ten temporal samples from within fjords, two of the ten samples reaching significance at the 5% level (Table 1).
The deviations from HW within loci appeared to be linked to the locus' level of genetic diversity in this geographic region, as singlelocus F IS estimates correlated significantly with levels of divergence (F ST ) between the North Sea and fjord reference samples (r = 0.578, p = 0.0017: Figure 3). The average F ST over the 25 SNPs was 0.174 between the fjord and North Sea reference samples and ranged among SNPs from 0.059 to 0.414.

| Number of clusters
Results from Structure software were consistent with the existence of two genetic clusters or populations of cod in the samples, with a (1) maximum Ln Prob(data|K) for K = 2 (Table 2)

| D ISCUSS I ON
Strong selection acting on standing genetic variation could in principle lead to different clusters of genotypes, predominating in different environments, that could be mistaken for genetically differentiated biological populations (cf. Figure 1). If selective survival of members from a common gene pool was responsible for generating genetic clusters of Atlantic cod in Skagerrak coastal waters, the shift in genotypic composition would be expected to take place during a period of strong natural mortality. Given the very high mortality characterizing early life stages in this broadcast spawner, we expected genetic shifts to occur sometime during our first (egg stage) and last (bottom-settled juvenile fish) sampling times.
In Topdalsfjord, we found no evidence for the predicted genetic changes and members of both clusters were presented in apparently constant proportions during all life stages, including the adult spawners that presumably gave rise to the present offspring cohort. Moreover, the fjord type was the by far most numerous type at all sample times. We therefore reject the hypothesis of selective mortality as an explanation for the observed genetic clusters in this   Of the two putative ecotypes, the North Sea type is the only one thus far observed in the North Sea proper (cf. Figure 4c, NS reference sample) and its presence also within fjords may represent drift of pelagic eggs or larvae from the North Sea cod population to the Skagerrak coast (Knutsen et al., 2004;Stenseth et al., 2006). Local spawning of this type on the coast cannot be excluded, however, and nearly 10% (5 out of 52: Table 1) of the adult and presumably mature cod in Topdalsfjord were of this type. We do not know if these individuals actually spawned inside the fjord or represent strayers from other areas, but local spawning of this type could explain why we found apparently very young egg also of the "North Sea" type within fjords (cf. Supplementary Information Figure S1). The drift time from North Sea spawning grounds into the (inner) Skagerrak has been estimated to at least 10 days (Munk et al., 1995).
Since the fjord genetic cluster dominates the inner fjord samples it likely represents a unique lineage of cod. There is evidence that this lineage may be related to the western Baltic cod stock (Barth et al., 2017). Whatever its origin, this type must be largely reproductively isolated from North Sea cod in order to maintain its genetic characteristics where the two types coexist. Apart from the putative indications for selective removal of North Sea cod from within Tvedestrandsfjord, the circumstances allowing co-occurrence of two types of cod in coastal Skagerrak remain unknown. Similar phenomena of coexisting types have been described for coastal and migratory cod along northern Norway (Johansen et al., 2018;Kirubakaran et al., 2016;Sarvas & Fevolden, 2005;Westgaard & Fevolden, 2007), Iceland (Halldórsdóttir & Árnason, 2015), Greenland (Therkildsen et al., 2013), and Canada , and thus appear to be common for this species. Phenotypically cryptic, coexisting lineages or ecotypes may be common also in other species but may be under-reported because their detection requires either highly informative markers or extensive sampling to detect the often weak statistical signals of heterozygote deficiency and admixture linkage disequilibrium (Jorde, Andersson, Ryman, & Laikre, 2018).
A number of studies have explored population genetic differentiation patterns between panels of putative neutral and selected loci and found largely consistent, yet more pronounced differentiation and/or differentiation at finer geographic scales for selected loci (Bekkevold et al., 2015;Larson et al., 2014;Milano et al., 2014).
This consistency may be interpreted in support of the notion that selected markers loci represent a valid, and highly informative, tool for population studies in species with low levels of neutral structure.
On the other hand, there is little evidence that gene loci generally follow a clear dichotomy into purely neutral and selected classes, and different statistical tools used for discriminating among such locus classes often yield conflicting results (Lotterhos & Whitlock, 2014;Narum & Hess, 2011). The present study does not rely on comparisons of spatial differentiation patterns among putative distinct classes of loci as a means of assessing their reliability as population markers. Instead, our aim was to test the hypothesis that F I G U R E 5 Effects of time (number of days after first sample date) and position in fjord (sample station number) on proportion of individual eggs and juveniles that were scored (Geneclass2) to the North Sea type (vertical axes). The shaded plane represents the effects predicted by the model (Equation 1, dots represent data for single samples scaled in proportion to sample size, and other graphical elements are visual aids. Parameter estimates and test statistics are given in Table 3 S a m p l i n g s t a t i o n Tvedestrandsfjord observed differentiation in a high-graded SNP marker panel might be attributed to recurrent, strong selection.
Despite the high potential for selective shifts of high-graded SNPs in a species as fecund as the Atlantic cod, we reject this hypothesis.
This does not imply that selection on these SNPs or on their linked genomic background is not occurring, but the magnitude of selective mortality during a single season is clearly too small to be detected in the present experimental setting, and also too small to affect the statistical assignment of individuals to population of origin. Hence, this selected SNP panel may be considered valid and highly useful markers for certain population studies, including detection of population subdivisions and assignment of individuals to population of origin. By implication, high-graded panels should be useful for addressing similar questions also in other areas and for other species, the great majority of which have lower fecundity than the cod and less potential for rapid selective shifts. Of course, due considerations need to be made to the scientific question at hand when employing such a panel.

ACK N OWLED G M ENTS
This work was supported by the Research Council of Norway (project #21610/O10), the European Regional Development Fund (Interreg IVa, "MarGen" project), and regional research funding from Aust and Vest-Agder county. Further funding was provided by The Norwegian Ministry of Fishery and Coastal Affairs. We thank Hanne Sannaes and Ida Mellerud for technical assistance, and two anonymous reviewers for providing useful critics of an earlier version of this paper.

CO N FLI C T S O F I NTE R E S T
The authors declare no conflict of interests.

AUTH O R CO NTR I B UTI O N S
This work represents part of AES's Master thesis. The study was designed by HK, AES, PEJ, SEH and MS. AES did the field work, DNA extraction, initial data analyses and wrote the first draft. MS designed the SNP panel. SEH assisted in egg sampling and beach seine hauls. PEJ did the statistical analyses and wrote the present version with assistance from all authors.

DATA ACCE SS I B I LIT Y
SNP genotypes and metadata are available at Dryad https://doi.