Canadian polar bear population structure using genome‐wide markers

Abstract Predicting the consequences of environmental changes, including human‐mediated climate change on species, requires that we quantify range‐wide patterns of genetic diversity and identify the ecological, environmental, and historical factors that have contributed to it. Here, we generate baseline data on polar bear population structure across most Canadian subpopulations (n = 358) using 13,488 genome‐wide single nucleotide polymorphisms (SNPs) identified with double‐digest restriction site‐associated DNA sequencing (ddRAD). Our ddRAD dataset showed three genetic clusters in the sampled Canadian range, congruent with previous studies based on microsatellites across the same regions; however, due to a lack of sampling in Norwegian Bay, we were unable to confirm the existence of a unique cluster in that subpopulation. These data on the genetic structure of polar bears using SNPs provide a detailed baseline against which future shifts in population structure can be assessed, and opportunities to develop new noninvasive tools for monitoring polar bears across their range.

. Recent climate projections suggest that the Arctic could experience ice-free summers as soon as 2030 (Overland & Wang, 2013), leading to the question of what consequences an ice-free Arctic will have on the population structure of ice-adapted species such as the polar bear.
Our current understanding of polar bear (Ursus maritimus) population structure has emerged over time through population genetic studies based on mitochondrial sequences and microsatellite genotypic data (Malenfant, Davis, Cullingham, & Coltman, 2016;Paetkau et al., 1999;Paetkau, Calvert, Stirling, & Strobeck, 1995;Peacock et al., 2015). A recently developed SNP chip for polar bears (Malenfant, Coltman, & Davis, 2015), used for a preliminary population structure analysis, revealed four genetic clusters in the Canadian Arctic; however, other than this no large-scale study has yet been published using genome-wide markers, although there are focused studies at more local levels (Malenfant, Davis, Richardson, Lunn, & Coltman, 2018;Viengkone et al., 2016). In general, polar bears show a pattern of isolation-by-distance at both population and individual levels (Campagna et al., 2013;Paetkau et al., 1999) within these distinct genetic clusters in the Canadian Arctic Paetkau et al., 1999).
With ongoing rapid sea ice loss and environmental change in the Arctic (Comiso, 2002(Comiso, , 2012Howell, Duguay, & Markus, 2009;Howell, Tivy, Yackel, & McCourt, 2008;Rothrock, Yu, & Maykut, 1999), there is the potential for rapid changes in polar bear population structure (Hamilton & Derocher, 2019;Laidre et al., 2015;Vongraven & Peacock, 2012). Responses of polar bears to climate change are not likely to be uniform across their range (Rode et al., 2014). For example, some telemetry studies show increased frequency of long-distance swimming in response to unsuitable ice coverage for travel and hunting (Durner et al., 2011;Pagano, Durner, Amstrup, Simac, & York, 2012). In contrast, a recent study based on telemetry and genetic data identified range size contractions in Baffin Bay, suggesting increasing physical and genetic isolation of this subpopulation .
Here, we aim to further develop our understanding of polar bear population structure in Canada using a new set of SNP markers developed through double-digest restriction site-associated DNA (ddRAD) sequencing. We compare our results based on analyses of the ddRAD dataset to previous studies of polar bear population structure.

| Sample collection and DNA extraction
The pan-Arctic geographic range of polar bears is divided into 19 subpopulations or "management units" (Durner, Laidre, & York, 2018), 13 of which are wholly or partially within Canada (Figure 1).
Our work draws upon a set of archived muscle tissue samples (dry or frozen in ethanol) accumulated by the Nunavut and Northwest Territories governments from polar bears harvested by Inuit hunters, in line with annual harvest regulations. Most of these tissue samples have not been used in previous studies and represent an independent sample from which to draw inferences. We sought a balanced number of samples from each subpopulation, while ensuring that they were collected in as close to the same period of time as possible. We used samples from 12 of the 13 subpopulations that occur in Canada. For ten subpopulations, we had at least 11 sampled individuals (range: 11-59), with only five samples and one sample from Southern Beaufort Sea and Norwegian Bay, respectively, and no samples from the Kane Basin subpopulation (Table 1). While this sampling limitation is important, we note that these latter two subpopulations are estimated to have very few individuals (KB = 357, NW = 203-see Hamilton & Derocher, 2019). To minimize sampling confounds, we selected samples spanning a minimum breadth of years, so the mean sampling dates ranged from 2008 (Gulf of Boothia) to 2014 (Davis Strait), with the exception of our single Norwegian Bay sample that was collected in 2004.
Genomic DNA was extracted from these samples using a modified salt extraction protocol (Aljanabi & Martinez, 1997), with the addition of RNase A (Thermo Fisher Scientific) following the lysis step. DNA quality was assessed by running it on 1.5% agarose gels stained with RedSafe TM Nucleic Acid Staining Solution (iNtRON Biotechnology). Only extractions with high molecular weight DNA were used for library preparation. Extracts were quantified using a NanoDrop ND_1000 Spectrophotometer (NanoDrop Technologies Inc.).

| Double-digest restriction site-associated DNA sequencing
We chose to use ddRAD to discover and genotype a new panel of SNP loci, rather than to use the existing SNP chip. The chip has 3,411 F I G U R E 1 Map of samples used for analyses. Outlined regions are the subpopulations that are wholly or partially in Canada; abbreviations follow Table 1. Colored points correspond to the sampling location and genetic cluster that the individual has majority assignment to, based on the SNP dataset and STRUCTURE analysis (pink = Polar Basin, orange = Arctic Archipelago, green = Hudson Complex). Individuals with membership of <0.7 to a cluster are represented as black dots putatively neutral SNPs developed based on RAD sequencing of 38 individuals (Malenfant et al., 2015); electing to use ddRAD allowed us to target more loci while possibly reducing ascertainment bias.
We constructed ddRAD libraries for 528 individuals following the Peterson, Weber, Kay, Fisher and Hoekstra (2012) protocol, with the addition of adapters that were modified to have degenerate base regions (Schweyen, Rozenberg, & Leese, 2014;Vendrami et al., 2017) to allow for removal of PCR duplicates using bioinformatic tools.
We used 1,000 ng of total genomic DNA for each individual, and Genomics (SickKids Hospital). Technical replicates were included in each library to assess genotyping error rates.

| Sequence assembly, variant detection, and SNP filtering
After assessing the quality of each sequencing run, the data were processed as follows. First, read duplicates were removed using a custom script developed by E. Jensen (https://github.com/Eljen sen/ Parse DBR_ddRAD). Briefly, the script removes duplicates when the degenerate region in the P2 index matches between reads to ensure PCR duplicates are removed. Once this is done, libraries were demultiplexed using the "process_radtags" tool included as part of Stacks v2.2 (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013).
Because the barcodes are part of the sequence in both reads, "pro-cess_radtags" was run using the "inline_inline" mode. Additionally, the "clean" functionality was activated to remove reads with uncalled bases, and the "rescue" functionality was used to rescue barcodes and RAD-tags. aligner ), excluding reads with a minimum quality score of <30. Alignments were sorted, indexed, and read pairs were fixed using tools from the SAMtools v1.9 suite ).
Finally, the alignments were used to call SNPs. To reduce the computation time, we used an initial representative group of 327 individuals from the first set of libraries sequenced for variant detection using BCFTOOLS v1.9 (Danecek et al., 2011) and then used the loci detected as variants for targeted genotype calling with the remaining individuals. First, a read pileup was performed for the 327 samples using the "bcftools mpileup" tool set to ignore indels, with a maximum depth of 1,000, and recalculated individual base alignment qualities ("redo-BAQ"). Once the full pileup was produced, actual variants were called "bcftools call" set to the multiallelic-caller mode and keeping all possible alternative alleles. All detected variant sites (n = 411,630) were used in a second round of "bcftools mpileup" for all individuals.
Filtering of the resulting vcf files was done in two rounds using VCFTOOLS v0.1.15 (Danecek et al., 2011). For the first round, all individuals were included and a minimum read depth of 6× and genotype quality score of 18 were required. Loci were filtered out if they were not present in at least 50% of individuals, had more than two alleles, had a mean depth of coverage greater than two standard deviations above the mean depth, or had a minor allele count less than 3. Loci were thinned to retain one site per 10,000 bp. Following this, the amount of missing data was assessed for each individual, and those with >60% missing were removed. Using the retained individuals, SNP filtering was repeated but starting again with all variants TA B L E 1 Diversity metrics of each of the surveyed Canadian polar bear subpopulations based on the single nucleotide polymorphisms dataset (13,488 loci) included. This time, loci were filtered out if they were not present in at least 70% of individuals, while maintaining the thresholds of remaining filters. Following this, we again assessed individual missing data and removed individuals with >50% missing data. Genotyping error rates were calculated by evaluating the number of mismatched genotypes between technical replicates (n = 16).

| Population genetic analyses
Standard measures of genetic diversity, including observed and expected heterozygosity and G IS, were calculated for each subpopulation using GENODIVE V 2.0b27 (Meirmans & Van Tienderen, 2004). Weir and Cockerham's (1984) fixation index was calculated between all subpopulation pairs using the hierfstat package (Goudet, 2005) in the R statistical package, version 3.6.0 (R Development Core Team, 2010).
Evidence for population substructure was assessed using multiple methods. We used Bayesian clustering analysis, as implemented in STRUCTURE 2.3.4 (Pritchard, Stephens, & Donnelly, 2000). We evaluated from 1 to 10 clusters (K), with ten iterations of each, using a run length of 300,000 Markov chain Monte Carlo replicates after a burn-in period of 100,000 and correlated allele frequencies under an admixture model with alpha set to 0.5. The most likely number of clusters was determined by plotting the log probability of the data (ln Pr(X|K)) across the range of K values tested and selecting the K value where the value of ln Pr(X|K) plateaued, as suggested in the STRUCTURE manual, and by calculating the deltaK statistic (Evanno, Regnaut, & Goudet, 2005) in STRUCTURE HARVESTER (Earl & vonHoldt, 2011). The 10 iterations were averaged using CLUMPP (Jakobsson & Rosenberg, 2007) to produce a single q-matrix. A clustering analysis based on maximum likelihood implemented in ADMIXTURE (Alexander, Novembre, & Lange, 2009) was also used, with the optimal value of K determined using a cross-validation procedure. We used default values and 10-fold cross-validation. We also used the model-free discriminant analysis of principal components (DAPC; Jombart, Devillard, & Balloux, 2010)

implemented in
Adegenet (Jombart, 2008) in R. The best-fit value of K was selected using the find.clusters function and Bayesian information criterion (BIC). The chosen value of K was based on the minimum number of clusters after which the BIC decreased by a negligible amount.
We used analysis of molecular variance (AMOVA, implemented in GENODIVE) to evaluate the proportions of total genetic variation that were contained within and among subpopulations, and within and among the genetic clusters suggested by population structure analyses.
To evaluate whether there are geographic regions where the strength of isolation-by-distance is higher-than-average or lower-than-average, we used EEMS (Petkova, Novembre, & Stephens, 2015), a method of visualizing variation in effective migration rates inferred from genetic dissimilarities. EEMS is based on a stepping-stone model, where migration occurs between neighboring demes modeled in a dense regular grid (where the number of demes equals the number of spaces on the grid). The expected genetic dissimilarity between two individuals is calculated over all possible migration histories and routes between their deme on the grid, and migration rates for edges in the grid are adjusted so that the genetic differences expected under the model best match the observed differences in the data. These migration rates are interpolated across the grid to produce the "estimated effective migration surface." In geographic areas with no samples, estimates cannot be calculated, and thus, in these regions, they are driven by the prior: no heterogeneity in migration rates (Petkova et al., 2015). In our dataset, this is particularly relevant in the geographic areas on the edges of our sampling distribution and in undersampled areas. Inputs required for this analysis are an outline of the area to be modeled, the geographic locations of samples, and a matrix of observed differentiation values.
We generated this matrix based on the average pairwise differences between individuals using bed2diffs_v2, which imputes missing data based on the average genotype for that locus. We ran EEMS initially using the default values for proposal variances and other parameters and changed the values in subsequent runs until the frequency of accepting proposals of each type was between 10% and 40%, as suggested in the software documentation. Using the optimized set of parameters, we ran five independent chains for 10,000,000 MCMC iterations, with a burn-in of 2,000,000, and thinning every 9,999 iterations. We assessed convergence of these chains and plotted the combined EEMS results using the R package rEEMSplots. The resulting contour plot was overlaid with a map of Canada using the R packages rworldmap and rworldxtra (South, 2011).

| Assignment accuracy
We estimated assignment accuracies across polar bear subpopulations using principal component analyses and Monte Carlo crossvalidation procedures implemented in the AssignPOP package (Chen et al., 2018) in R. We investigated the assignment accuracy results of a predictive model built using a support vector machine (model = svm) classification, based on training sets using the most informative 75% of loci and a randomly sampled 75% of individuals.
The rate of assignment was tested using the 25% of individuals that has been left out of training, and then averaged across 30 iterations.
Assignment tests were only performed for the nine subpopulations with a sample size greater than 10.
Similar patterns were resolved across the three population cluster analysis methods. The ln Pr(X|K) for STRUCTURE plateaued around K = 3 ( Figure S1a), with the highest level of deltaK at K = 2 ( Figure S1b), the lowest cross-validation error scores in ADMIXTURE were for K = 3 (Table S3), and for the find.clusters DAPC analysis, the lowest BIC score was K = 2 ( Figure S2), with K = 3 only having a slightly higher value. Over all analyses, the groupings at K = 3 correspond to geographic areas that we hereafter refer to as the "Hudson Complex," "Arctic Archipelago," and "Polar Basin" (Figure 2). Barplots showing K = 2, 4, and 5 can be found in Figure S3.
Most individuals show a high assignment to one genetic cluster.
Only 34 individuals in our sample of 358 have a membership Q-value of less than 0.7 to a cluster (depicted as black dots on the map in The EEMS analysis clearly showed regions where isolation-by-distance is stronger than the average rate ( Figure 3). Results are consistent with previous analyses, identifying less migration than expected based on the null expectation of isolation-by-distance (IBD) in areas separating the three genetic clusters. Many of the regions identified as having less migration (i.e., higher-than-average IBD) appear to correspond with areas with more contiguous terrestrial habitats, while at least some areas of high migration lie along shoreline or marine areas with islands (e.g., King William Island, or between NW Baffin Island and Somerset Island), although some areas identified as having high migration lie in obvious sampling gaps.
The results from the assignment tests (Table 1) showed that some subpopulations are genetically distinguishable, with self-assignment rates the highest in the Northern Beaufort Sea (NB, 0.93).

| D ISCUSS I ON
In this study, we used thousands of genome-wide SNP markers to assess population structure across the range of polar bears in Canada. Our results clearly show differentiation within Canada, and the membership patterns and geographic ranges of the "Hudson Complex," "Arctic Archipelago," and "Polar Basin" genetic groups match closely with the results of previous studies using microsatellite loci Paetkau et al., 1999;Peacock et al., 2015) and a preliminary analysis using ~3,000 SNPs and 78 individuals (Malenfant et al., 2015). The previous studies included the global range of polar bears, and found that the genetic cluster of the Beaufort Sea continues around the arctic coast of Asia and Europe to the eastern side of Greenland Paetkau et al., 1999;Peacock et al., 2015), which is why we have chosen to refer to this cluster in our study as the Polar Basin. A fourth genetic cluster has been identified by previous authors (Malenfant et al., 2015Paetkau et al., 1999) coincident with the Norwegian Bay subpopulation, but due to our very small sample size (n = 1), we cannot confirm its existence.  Table 1 F I G U R E 3 The EEMS contour plot of effective migration rates. The scale is log10(migration), relative to the overall migration rate across the modeled area. Points are the sampling locations of the polar bears, colored following Figure 1. Areas at the dark yellow end of the spectrum exhibit on average higher-than-average isolation-by-distance In the EEMS analysis, areas with reduced migration generally corresponded to landmasses, such as Baffin Island (Figure 3 Levels of genetic diversity, as measured by observed and expected heterozygosity, do not vary markedly across the subpopulations (Table 1), which is consistent with previous studies Paetkau et al., 1999;Peacock et al., 2015). Our initial goal was to mitigate ascertainment bias by genotyping a balanced sample of individuals from each subpopulation, but ultimately this was not possible due to issues with poor DNA quality, and limited sampling in subpopulations where less harvesting occurs. Harvesting is intentionally skewed toward male bears, and our sample reflects that, with 62% of samples on average within a subpopulation and 65% of samples overall being male. Male-skewed sex ratio in our sample should not impact conclusions regarding population structure. Taylor et al. (2001) found no differences in distances moved between sexes in six more northern Arctic polar bear subpopulations. Further, Campagna et al. (2013)

| CON CLUS IONS
Within the Canadian range of polar bears, subpopulation structure is present and has been consistently recovered across datasets and genetic markers. There is congruence between our results based on thousands of genome-wide SNPs and previous studies using 16-21 microsatellite or ~3,000 SNP loci. Our samples have not been included in any previous studies (except for three which were also used by Peacock et al., 2015) and provide independent corroboration of major genetic patterns. Thus, between mutually exclusive datasets of individuals and genetic markers when compared to previous studies, we have reconstructed the same understanding of population differentiation and genetic diversity patterns. We do have some sampling gaps in southern extent of Hudson Bay, southern Davis Strait, and northern Quebec, with the former two populations reported to exhibit modest genetic differentiation (Crompton, Obbard, Petersen, & Wilson, 2008;Malenfant et al., 2016;Peacock et al., 2015). Regardless, our now firm baseline understanding of population structure will be critical to our ability to accurately assess the effects of climate change and sea ice loss on connectivity, and measure the subsequent impacts on demography and evolutionary dynamics in polar bears. From these data, a subpanel of SNP markers can be selected for use in future studies to monitor changes in population structure from both high-and low-quality DNA sources, including tissue samples from aerial biopsy sampling and the annual harvest, or noninvasively collected hair from snags, or scat collected on the land.

ACK N OWLED G M ENTS
We acknowledge that this research would not have been possible without the samples collected by individuals in the many communities of Nunavut and Inuvialuit Settlement Region and thank those involved. We thank J. Ware, C. Mutch, F. Piugattuk, and M. Harte for sample preparation at the polar bear harvest laboratory in Nunavut. We thank A. Siew and Z. Sun for help with laboratory work. K. Flock was indispensable in facilitating acquisition of permits and agreements, and overseeing management of this large collaborative project. We thank L. Waits for advice in designing the study. We acknowledge J. H. Gálvez at the Canadian Centre for Computational Genomics for his contribution to the bioinformatic analysis, and G. Bourque for support. We would like to thank the members of the BEARWATCH team for their support and collaboration.

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
The ddRAD sequence data are available as individual BAM files through the NCBI Sequence Read Archive as BioProject PRJNA602523. The filtered vcf file used for analyses is available on Dryad at https://doi.org/10.5061/dryad.gb5mk kwkw.