Population structure of chum salmon and selection on the markers collected for stock identification

Abstract Genetic stock identification (GSI) is a major management tool of Pacific salmon (Oncorhynchus Spp.) that has provided rich genetic baseline data of allozymes, microsatellites, and single‐nucleotide polymorphisms (SNPs) across the Pacific Rim. Here, we analyzed published data sets for adult chum salmon (Oncorhynchus keta), namely 10 microsatellites, 53 SNPs, and a mitochondrial DNA locus (mtDNA3, control region, and NADH‐3 combined) in samples from 495 locations in the same distribution range (n = 61,813). TreeMix analysis of the microsatellite loci identified the greatest convergence toward Japanese/Korean populations and suggested two admixture events from Japan/Korea to Russia and the Alaskan Peninsula. The SNPs had been purposively collected from rapidly evolving genes to increase the power of GSI. The largest expected heterozygosity was observed in Japanese/Korean populations for microsatellites, whereas it was largest in Western Alaskan populations for SNPs, reflecting the SNP discovery process. A regression of SNP population structures on those of microsatellites indicated the selection of the SNP loci according to deviations from the predicted structures. Specifically, we matched the sampling locations of the SNPs with those of the microsatellites and performed regression analyses of SNP allele frequencies on a 2‐dimensional scaling (MDS) of matched locations obtained from microsatellite pairwise F ST values. The MDS first axis indicated a latitudinal cline in American and Russian populations, whereas the second axis showed differentiation of Japanese/Korean populations. The top five outlier SNPs included mtDNA3, U502241 (unknown), GnRH373, ras1362, and TCP178, which were identified by principal component analysis. We summarized the functions of 53 nuclear genes surrounding SNPs and the mtDNA3 locus by referring to a gene database system and propose how they may influence the fitness of chum salmon.


| INTRODUC TI ON
Chum salmon (Oncorhynchus keta) has a wide distribution range from the Arctic to California in North America and from Siberia to Japan and Korea in Asia (Salo, 1991). The ability to adapt to the varied environments of spawning rivers in the past and at present may have expanded its distribution range. Chum salmon also comprise the world's largest hatchery release program (Amoroso et al., 2017;Flagg, 2015;Kitada, 2018;Naish et al., 2007), and more than 3 billion hatchery-reared chum salmon are released annually into the North Pacific. Therefore, chum salmon is one of the best species to study the structure and history of populations, which could be underpinned by environmental adaptation and anthropogenic selection pressures.
The genetic population structure of chum salmon has traditionally been studied in the context of genetic stock identification (GSI). GSI has been a major management tool of Pacific salmon (Oncorhynchus Spp.) since the early 1980s (Beacham, Candy, Wallace, et al., 2009;Beacham, Sato, et al., 2008;Beacham, Varnavskaya, et al., 2008;Beacham, Winther, et al., 2008;Beacham et al., 2020;Grant et al., 1980;Larson et al., 2014;Shaklee et al., 1999;Utter, 1991), and it has contributed to understanding high-seas and coastal migration patterns (Myers et al., 2007;Seeb et al., 2004). In Pacific salmon, samples from mixed-stock fisheries and forensic studies have been analyzed to provide optimal resolution of proportions of mixed stocks at a reasonable cost (Beacham et al., 2020). GSI studies have provided genetic baseline data for salmon populations across the Pacific Rim, and these data have contributed to studies into population structure, mixed-stock fisheries, and genetic interactions between hatchery and wild salmon (Waples et al., 2020).
Genetic markers for GSI have progressed from allozymes to microsatellites and single-nucleotide polymorphisms (SNPs) (Beacham et al., 2020;Bernatchez et al., 2017). Allozyme loci often have a small number of alleles. To improve the power of GSI resolution for the high gene flow salmonids, microsatellites were developed because the number of alleles is generally much larger than that of allozymes, and much more information can be included. However, standardizing hundreds of microsatellite alleles across sampling points in different countries is difficult (Seeb et al., 2011). To avoid the standardization problem, genotyping of microsatellites of salmon species was generally performed by a single laboratory Beacham, Candy, Wallace, et al., 2009;Beacham, Sato, et al., 2008;Beacham, Varnavskaya, et al., 2008;Seeb et al., 2011).
In contrast, calibrating SNP genotyping is more straightforward because genotype data can be stored in a unified format and can be accessed by different laboratories on different continents (Waples et al., 2020).
Populations of chum salmon have been widely surveyed for genetic variability and show large allele frequency differences in three studies (Elfstrom et al., 2007;Smith, Baker, et al., 2005;. The markers selected for SNP typing were originally identified as rapidly evolving genes (Elfstrom et al., 2007;Seeb et al., 2011) that also showed positive selection in humans and chimpanzees (Nielsen et al., 2005). They included genes associated with fatty acid synthesis, testis-specific expression, olfactory receptors, immune responses, and cell growth and differentiation (Elfstrom et al., 2007;Smith, Baker, et al., 2005;. The population structure determined using the SNPs selected for the GSI was influenced not only by natural selection on the genes but also by the SNP discovery process. Specifically, the three studies were focused on Western Alaska, which was the area of the authors' interest (Seeb et al., 2011). As a result, the SNP allelic richness and heterozygosity are high in Alaskan populations.
The use of neutral and adaptive markers in various combinations can be useful in establishing optimal management strategies (Funk et al., 2012). Population structures inferred using neutral markers reflect gene flow and genetic drift (Waples & Gaggiotti, 2006), which affect within and among population variations and can lead to adaptive divergence in the genome (Funk et al., 2012). To integrate adaptive markers into the definition of conservation units, Funk and colleagues proposed a framework of comparing population structures inferred from putatively neutral and adaptive loci.
The inclusion of information for loci putatively under selection can help to understand mechanisms of local adaptation and is useful for conservation and management of the species (Moore et al., 2014).
Here, we analyzed the published data sets of microsatellites and the SNPs genotyped for chum salmon GSI. First, we inferred the chum salmon population structure and its demographic history using the microsatellite data in a distribution range. Then, we matched the sampling locations of the SNP genotyping studies with those of the microsatellite data. By regressing the SNP population structure on the microsatellite population structure, we estimated the selection on the SNPs as deviations from the predicted structure.
Through the data screening, we found several publicly available data sets that covered the distribution range of chum salmon, including Japan. These data sets comprised microsatellite allele frequencies , nuclear SNP genotypes with a combined mtDNA locus (Seeb et al., 2011), isozyme allele frequencies (Seeb et al., 1995;Winans et al., 1994), and mtDNA control regions (Sato et al., 2004). In this study, we analyzed the microsatellite ) and SNP (Seeb et al., 2011) data sets. All Japanese samples were from hatcheries and weirs in hatchery-enhanced rivers and were therefore hatchery-reared fish and/or hatchery descendants.

| Microsatellite and SNP data sets analyzed in this study
Microsatellite allele frequencies of chum salmon were retrieved from the Molecular Genetics Lab Online Data, Fisheries, and Oceans Canada website (https://www.pac.dfo-mpo.gc.ca/scien ce/facil ities -insta llati ons/pbs-sbp/mgl-lgm/data-donne es/index -eng.html, accessed 01 August 2020). The data consisted of the allele frequencies at 14 loci from chum salmon populations at 381 localities in a distribution range (n = 51,355) (Table S1; Figure S1) . These data were used to infer the population structure and demographic history of chum salmon. SNP genotypes were retrieved from the Dryad data repository (Seeb et al., 2011). The data consisted of 58 SNPs collected from 114 locations in a distribution range (Table S2; Figure S2) (Seeb et al., 2011). We excluded four loci, Oke_U401-220, Oke_GHII-2943, Oke_IL8r-272, and Oke_U507-87, to avoid pseudo-replication following the original study. A total of 53 nuclear SNPs and a combined mtDNA locus (mtDNA3) were included in our analysis (n = 10,458).

| Accounting for ascertainment bias in the data sets
Significant deviations from Hardy-Weinberg equilibrium at four loci (Oke3, Ots103, One111, and OtsG68) from a 14-microsatellite locus panel were found in Japanese, Russian, and representative North American populations (Beacham, Sato, et al., 2008;Beacham, Varnavskaya, et al., 2008). These deviations were likely due to ascertainment bias (Seeb et al., 2011), and therefore, we excluded these four loci and used the remaining 10 loci in our analysis.
For the SNP data, the highest levels of allelic richness detected in parts of the Alaskan region may be affected by ascertainment bias (Seeb et al., 2011). Allele/site frequency spectra have been used to examine potential ascertainment bias in SNPs (Nielsen et al., 2004). We calculated allele frequency spectra in six geographical areas according to the major lineage used in the original study (Seeb et al., 2011), namely Western Alaska, Yukon/ Kuskokwim upper, Alaskan Peninsula, Southeast Alaska (SEA)/ British Columbia (BC)/Washington (WA), Russia, and Japan/ Korea.

| Inferring population structure and demographic history
We used microsatellite allele frequencies  to infer the population structure of chum salmon. In previous coalescent simulations, we found that the bias-corrected G ST moment estimator (Nei & Chesser, 1983) performed better than other F ST estimators when estimating pairwise F ST values (Kitada et al., 2017). Therefore, we used G ST (NC83) to estimate pairwise F ST between populations. Because only allele frequency data could be accessed, we used the "read.frequency" function and computed pairwise F ST values averaged over loci using the "GstNC" function in the R package FinePop v1.5.1 on CRAN.
A multidimensional scaling (MDS) analysis was performed on the pairwise F ST distance matrix using the "cmdscale" function in R. We used the cumulative contribution ratio up to the kth axis (j = 1, …, k, …, K) as the explained variation measure, which was computed in the R function as C k = ∑ k j=1 j ∕ ∑ K j=1 j , where λ j is the eigenvalue and λ j = 0 if λ j < 0. Following the original study , populations were classified into eight geographical areas: Japan, Korea, Russia, Alaskan Peninsula, Western Alaska, Yukon/Tanana/Upper Alaska, SEA/BC, and WA (Table S1).
A neighbor-joining (NJ) tree (Saitou & Nei, 1987) was constructed using the pairwise F ST distance matrix using the "nj" function in the R package "ape." To infer demographic history with admixture, we applied TreeMix (Pickrell &Pritchard, 2012) to the microsatellite allele frequencies ). The analyses were performed using six regional genetic groups, where Japan/Korea and SEA/BC/WA were combined to reduce the number of parameters to be estimated under the limited number of loci. On the basis of the allele frequencies and the lengths (numbers of repeats), we computed the mean and variance in length at each microsatellite locus and used them to run TreeMix v1.13. We tested up to five migration events and found the best tree using the Akaike Information Criterion (AIC; Akaike, 1973); AIC = −2LogL + 2 × s, where L is the maximum composite likelihood and s is the number of parameters to be estimated, which, in this study, was three per migration event, namely, migration edge, branch length, and migration weight. For example, for the model with two migration events, the number of parameters was six. TreeMix maximizes composite likelihood and the 2 × log likelihood ratio is expected to follow an approximate χ 2 distribution with three degrees of freedom. Because TreeMix calculates composite likelihoods, the penalty in AIC needs to account for the correlation in the likelihoods. To avoid unconscious bias toward selection of parameter-rich models, we kept all the candidate models except those where the AIC values were decisively different.

| Gene flow and genetic diversity
We recorded approximate longitudes and latitudes of the sampling sites according to the names of rivers and/or areas and maps from the original studies using Google Maps. Sampling locations were plotted on the map using the "sf" package in R. Sampling points with pairwise F ST values lower than a given threshold were connected by yellow lines to visualize the population connectivity. Under the as- where N e is the effective population size and m is the average migration rate between populations (Slatkin, 1987), a threshold of F ST = 0.01 corresponds to 4N e m ≈ 99 migrants per generation (see, Waples & Gaggiotti, 2006;Whitlock & Mccauley, 1999).
We computed expected heterozygosity (H e ) values for the data sets. For microsatellites, we computed H e for each population at each locus based on the allele frequencies. We then averaged H e values over loci. For the 53 nuclear SNPs, we converted the original genotype data to genepop format (Rousset, 2008) and read the data using the "read.GENEPOP" function in the R package

| Identifying highly differentiated SNPs beyond the neutral population structure
To compare the genetic differentiation of SNP allele frequencies with the population structure inferred using the microsatellite markers, we analyzed 10 microsatellite loci and 53 nuclear SNPs loci plus the combined mtDNA3 locus. The mtDNA3 locus had five alleles, and the major allele (second one) was used for the meta-analysis.
The functions of all 54 analyzed gene loci were confirmed by referring to the GeneCards database system (https://www.genec ards. org/) and published literature. These explanatory variables are population-specific (J = 114). The regression analysis was performed for each marker (i = 1, …, 54), and 54 sets of regression coefficients ( (i) 0 , (i) 1 , (i) 2 ) were obtained. The pvalues for the coefficients were adjusted for the multiple comparison (q-values) by the method of Benjamini and Hochberg (1995) using the "p.adjust" function in R. Given the population structure predicted from the neutral microsatellites, the outlier loci from the set of 53 nuclear SNPs and mtDNA3 locus that deviated largely from the predicted population structure are likely under natural selection and/or anthropogenic selection.
To identify genes that best characterize geographical areas, we conducted a principal components analysis (PCA) for the allele frequencies of the 53 nuclear SNPs and mtDNA3 locus using the "prcomp" function in R. Finally, we visualized the geographical distributions of the identified highly differentiated genes using boxplots and geographical maps of the allele frequencies and H e values.
Sampling locations were mapped with a color gradient of allele frequencies for population j rendered as RGB (p j0 , 0, 1− p j0 ), where p j0 = (p j −min p)/(max p−min p). This conversion represents the standardized allele frequency values at the sampling point, with a continuous color gradient ranging from blue to red for the lowest and highest allele frequencies, respectively.

| Population structure and demographic history of chum salmon
Pairwise The mds1 to mds4 explained 60% variation in the pairwise F ST distant matrix ( Figure S4). In the unrooted NJ tree generated from this data set ( Figure S5a), five large regional population groupings were

| Ascertainment bias in the SNP data
The allele frequency spectra ( Figure S7) showed a similar pattern in all areas, except in Western Alaska, where much smaller occurrence was found for the lowest allele frequencies in Western Alaska.

| Gene flow and genetic diversity
Using

| Differentiated SNPs beyond the neutral population structure
We summarized 53 nuclear SNPs and the mtDNA3 locus with functions (if known), locus names, and raw data locus names used in our analysis (Table S3). Five of the outlier SNPs had previously been identified as potential candidates for selection (Seeb et al., 2011). We found 31 SNPs in functional genes, and 23 SNPs were unknown.
The results of our regression analysis are summarized in

| D ISCUSS I ON
The inferred population structure based on microsatellites was consistent with the one obtained by the TreeMix analysis Our regression analysis identified the top five outlier gene loci (Table 1, Figures 3 and 4) that were significantly differentiated beyond the neutral population structure in Japanese/Korean populations, namely, mtDNA3, U502241, GnRH373, ras1362, and TCP178.
The scatter plot of mds1 and mds2 indicated that most of the SNPs were differentiated beyond the neutral population structure.
Selection can be induced by natural and anthropogenic pressures.
The mds1 described the north-south cline of population expansion ( Figure 1). The maximum regression coefficient of mds1 was 15.5 for RFC2 (Table 1). In contrast, the top five outlier gene loci had mds2 coefficients >18.4, which suggested that the allele frequencies of the top five gene loci were the result of anthropogenic selection pressures in Japanese/Korean populations.
mtDNA3 (a combined locus of the control region and NADH-3, Smith, Baker, et al., 2005) was the most differentiated gene locus (Figures 3 and 4). mtDNA encodes some of the proteins in the oxidative phosphorylation enzymatic complex and plays a key role in aerobic ATP production, by contributing to the ability to respond to endurance exercise (reviewed by Stefàno et al., 2019). The mtDNA control region functions in dNTP metabolism (Nicholls & Minczuk, 2014) and oxygen consumption (Kong et al., 2020). Subunits  (Seeb et al., 2011).
d Five outlier loci found in the original study (Seeb et al., 2011).

TA B L E 1 (Continued)
of the NADH dehydrogenase complex encoded by mtDNA are involved in the pathway from NADH to oxygen (Rivera et al., 1998). A significant reduction in mtDNA3 allele frequencies might reduce the efficiency of aerobic exercise ability and endurance, energy metabolism, and oxygen consumption. These results are consistent with the lower swimming endurance of Japanese hatchery-born chum salmon fry measured in a stamina measurement experimental device in the Japanese oldest Chitose Hatchery, Hokkaido (Kobayashi & Ohkuma, 1983). Wild chum salmon fry had a 1.4-fold higher swimming ability (56.6 ± 11.1 cm/s) than hatchery-reared fry (41.4 ± 12.3) (t = 2.45, df = 8.7, p = .038) in the early 1980s.  (Seeb et al., 2011) found this phenomenon and detected U502241 as an outlier SNP. No function has been associated with the chum U502241 locus so far (Elfstrom et al., 2007), but the Atlantic salmon immunoglobulin IgH locus B genomic sequence was the most likely match for U502241 (Seeb et al., 2011). Immunoglobulins initiate immune responses (Gene Cards).
The ras gene has a central role in cell growth (Rotchell et al., 2001).
These outliers related to reproduction and growth are key parameters in hatchery production. In chum salmon, GnRH is involved in gonadal maturation during the early and final phases of upstream migration (Kudo et al., 1996). GnRH also improved stream odor discrimination in adult fish (Ueda, 2019), and expression of GnRH increased in the brain during homing migration (Ueda et al., 2016).    Figure S8). In the Japanese populations, the largest microsatellite H e values may have been influenced by the transplantation history in Japanese river populations (Beacham, Sato, et al., 2008), whereas the smallest SNP H e values may be a consequence of artificial selection in the hatcheries. The H e values in Japanese/Korean and Western Alaskan populations were similar for U502241 and ras1362, whereas the H e values for GnRH373 were smaller in the Japanese/Korean populations than in the Western Alaskan populations as shown in Figure 4. In contrast, the H e values in Japanese/ Korean populations were significantly larger than those in Western Alaskan populations for mtDNA3 and TCP178, suggesting increases in heterozygosity. These results suggested genotype-specific effects on fitness.
The analyses in this study relied on limited data sets. Clearly, further genomic studies are needed to obtain more precise views of the demographic history, and environmental and anthropogenic selection of chum salmon. Population genomics studies can also provide finer resolution of GSI and will be useful for management and conservation of chum salmon.

ACK N OWLED G M ENTS
This research was made possible by the SNP genotypes of chum salmon throughout the distribution range maintained by Lisa W.
Seeb and her colleagues, and by the North Pacific baseline microsatellite allele frequencies maintained by Terry D. Beacham and his colleagues at Fisheries and Oceans Canada. We also appreciate researchers and staff for their substantial efforts to obtain baseline samples. We thank Allen Moore, the editor, for reviewing the manuscript, and the associate editor and the reviewers for constructive comments, which improved the paper substantially. This study was supported by the Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research KAKENHI (nos. 18K0578116 to SK and 19H04070 to HK).

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
The authors affirm that all data necessary for confirming the conclusions of this article are present within the article, figures, and supporting information.