Genomic inbreeding and population structure of northern pike (Esox lucius) in Xinjiang, China

Abstract Northern pike (Esox lucius) was widely distributed in the high latitudes of the northern hemisphere. In China, northern pike was originally distributed only in the upper reaches of the Irtysh River in Xinjiang and has appeared in many water bodies outside the Irtysh River Basin in Northern Xinjiang. A total of four populations were collected from north to south in Xinjiang, including Irtysh River (RIR), Ulungu Lake (LUL), a small lake nearby Ulungu River (LJD), and Bosten Lake (LBO). We estimated population genomic parameters, performed gene flow analysis, and estimated the effective population size of each population. The proportion of individuals with high inbreeding coefficient (F ≥ 0.0625) accounted for 36.4% (44/121) of all sequenced individuals, approximately 4.5% (1/22) in LUL, 25.9% (7/27) in LBO, 42.9% (18/42) in RIR, and 60% (18/30) in LJD. RIR had the highest mean of genomic relatedness (coancestry coefficient = 0.025 ± 0.040, IBD = 0.036 ± 0.078). Gene flow results showed that the population spreading was from RIR into two branches, one was LBO, and the other continued to split into LUL and LJD, and migration signal from LBO to LUL was detected. Our results suggested that the extinction risk of northern pike was very low in Xinjiang of China, and the controlled capture fishery of northern pike could be developed reasonably.

In China, northern pike was originally distributed only in the upper reaches of the Irtysh River in the Xinjiang Uygur Autonomous Region (referred to as "Xinjiang") (Huo et al., 2009). Xinjiang was located in the inland region of Central Asia where water resources were scarce, covering an area of 1.66 million square kilometers, accounting for about 1/6 of land area of China. Before the 1960s in China, the output of northern pike in the Irtysh River Basin accounted for about 20% of the total weight of the fish capture, and the annual output could reach 120 tons (Ren, 2002). In recent years, due to the drop-off of water level in the Irtysh River, the deterioration of breeding conditions, and illegal overfishing, the pike population in the Irtysh River has declined sharply, and the capture output F I G U R E 1 Wild fish of northern pike (Esox lucius) photographed by Peixian Luan with permission F I G U R E 2 Four water systems involved in sample collection in this study. A total of four populations were collected from north to south in Xinjiang, China. Irtysh River population (RIR, 47°42'N, 86°50'W), Ulungu Lake population (LUL,47°18'N,87°02'W), the population of a small lake nearby Ulungu River (LJD, 46°41'N, 87°57'W), Bosten lake population (LBO, 41°50'N, 86°36'W). The blue circle represents a natural population. The red circle represents a current population in 2006 was about 6% of the peak in the 1960s (Huo et al., 2009). At present, commercial fisheries have been completely banned in the Irtysh River Basin.
Similar to other Central Asian regions, due to lack of water resource, many water conservancies projects have been built to divert water from natural rivers to urban and rural areas in Xinjiang (Li & Su, 2008). These projects provide opportunities for fish to spread to new habitats, and northern pike has spread to the Ulungu Lake and its surrounding small rivers, lakes, and swamps outside the Irtysh River Basin through these channels, too. According to the monitoring of our institute, northern pike has appeared in the water system outside the Irtysh River since the 1980s (Jiang & Huo, 2014).
Meanwhile, pike game fishing and consumption is an increasingly popular pastime in Xinjiang. With the continuous arrival of tourists, the consumption demand for northern pike is increasing (Yang et al., 2018). It becomes increasingly difficult to control the range and distance of illegally and disorderly spreading. As a result, northern pike has appeared in many water bodies outside the Irtysh River Basin in Northern Xinjiang and Bosten Lake in Southern Xinjiang with human activities (Guo et al., 2005). However, the number, distribution, and source of northern pike outside the Irtysh River Basin remain unknown.
The present study aimed to estimate population genomic parameters such as the genetic diversity, population structure, and connectivity of the various populations and to perform gene flow analysis among four population of northern pike captured from four habitats, including the Irtysh River, Ulungu Lake, a small lake nearby Ulungu River, and Bosten Lake. The methods mentioned above were performed to understand the genomic inbreeding and population structure of northern pike, help us trace the spread pattern better, and inform the management and conservation of northern pike fishery resources in Xinjiang, China.

| Water systems involved in sample collection
In this study, a total of four water systems were involved in sample collection of northern pike ( Figure 1) from north to south in Xinjiang, including Irtysh River, Ulungu Lake, a small lake nearby Ulungu River, and Bosten Lake. These four water systems are abbreviated as RIR (Irtysh River), LUL (Ulungu Lake), LJD (a small lake nearby Ulungu River has no official name and near a small village called Jidiele), and LBO (Bosten Lake). The spatial distribution of the four water systems was shown on the map (Figure 2). Northern and Southern Xinjiang are separated by the Tianshan Mountains, and there is no waterway that crosses the Tianshan Mountains from Northern to Southern

Xinjiang.
The Irtysh River is the only river that flows into the Arctic Ocean in China, and the Chinese section is about 546 kilometers of a total length of 4,248 kilometers (Jiang & Huo, 2014;Ren, 2002). Ulungu Lake is composed of two connected freshwater lakes, the Jili Lake with an area of 165 square kilometers and the Brento Sea with an area of 730 square kilometers, and the average depth of Ulungu lake is 8 meters (Jiang & Huo, 2014). Bosten Lake is the largest inland freshwater lake in Xinjiang, which is located in Southern Xinjiang and on the northeastern edge of the Tarim Basin (Guo et al., 2005).
Bosten Lake is nearly 1,200 square kilometers, and the average depth of the lake is 9 m.

| Animal ethics statement
The wild fish capture involved in this study had obtained a scien-

| Sampling populations
The wild fish were captured with standard gillnetting procedures.
The standard gillnet was a sinking gillnet, 30 meters long and 3 meters high. The net had 3 mesh panels, each of which was 10 meters long. The mesh sizes were 10 cm, 12 cm, and 14 cm. RIR population (42 individuals) were captured from the mainstream of Irtysh River

| Genomic DNA extraction and SLAF sequencing
All fin samples were digested overnight with proteinase K, and genomic DNA was extracted using standard phenol-chloroform protocol (Psifidi et al., 2015). A NanoDrop 2000C spectrophotometer (Thermo Scientific) was used to detect the quality of DNA, and highquality genomic DNA was used to construct the specific locus amplified fragment (SLAF) library (Sun et al., 2013). For SLAF sequencing, the extracted genomic DNA was digested with the restriction enzymes RsaI and HaeIII. The gel extraction kit (Qiagen) was used to isolate DNA fragments with a size of 550-600 bp (with indexes and adaptors

| SNP calling
The algorithm of BWA-MEM in the BWA software (Li & Durbin, 2010) was used to compare the SLAF-seq reads with the reference genome of Esox lucius at NCBI (Genome accession number: GCA_011004845.1, released by Vertebrate Genomes Project on March 2020), with the settings (1 for score for a sequence match, 4 for penalty for a mismatch, default for other settings).
The SelectVariants and MarkDuplicates tool of GATK (Mckenna et al., 2010) and the mpileup utility of SAMtools

| Genomic diversity and differentiation
The average observed heterozygosity (Ho) and the average expected heterozygosity (He) were estimated within population by using the R package "diveRsity" (Keenan et al., 2013). The Ho and He values of all 14,124 SNPs were calculated one by one within population, and then, the average value was taken as the Ho and He values of the population. The nucleotide diversity (Pi) was measured by SNPs in 10Mb windows across the entire genome using VCFtools (Danecek et al., 2011), and 2Mb steps were used to specify the step of size between windows. The average Pi of each population was calculated by mean of Pi value of all windows. Scheffe test was applied for the multiple comparisons of the means of the four populations using the R software (R Development Core Team, 2019). Pairwise F ST values (Weir & Cockerham, 1984) were assessed via SNPs in a 10 Mb sliding window with a 2 Mb step size across the entire genome using VCFtools (Danecek et al., 2011), and the significance test of the paired F ST was performed using permutation test (10,000 permutations).

| Genomic relatedness and inbreeding
In this study, two parameters, including genomic coancestry coefficient and identity by descent (IBD), were used to measure the genomic relatedness between any two individuals. The genomic coancestry coefficient between two individuals was estimated based on the half off-diagonal value in the genomic additive relationship, which were obtained using a mixed model based on the quantitative genetics model (Falconer & Mackay, 1996;. The formulation of genomic additive relationship was obtained by the genomic covariance, which can be expressed as following equation: where a represented a total of additive effect, α was a vector of additive effects of SNP markers, M α was model matrix of α, G a was genomic additive relationship matrix, V a was covariance matrix of genotypes between two individuals, and m was the number of SNP markers. G a was calculated using a definition of genomic correlation proposed by Da et al. (2014), which assumed equal SNP variances across SNP markers and was a within-SNP standardization of each SNP using its own SNP variance. The genomic coancestry coefficient between two individuals was estimated based on the half off-diagonal value in G a , being the Definitions VI, implemented by GCORRMX program in the GVCBLUP software using genomic information . A hidden Markov model method was implemented for estimating genomic IBD between two individuals, assuming independence between the loci, and the standard method and default settings suggested in the PLINK documentation were used in the present study (Purcell et al., 2007). The individual's genomic inbreeding coefficient (F) was estimated by observed and expected autosomal homozygous genotype counts for each individual via PLINK software (Purcell et al., 2007). Multiple comparisons were also implemented by using Scheffe test for population means of genomic relatedness and inbreeding coefficients.

| Population stratification and genetic structure
The genetic relationship matrix was extracted with the default parameters by principal component analysis (PCA) method using PLINK software (Purcell et al., 2007), and the first three feature eigenvectors were retained to create a two-dimensional plot. The genetic structure of the 121 individuals was analyzed with the genome-wide SNP markers via ADMIXTURE software (Alexander et al., 2009). Allele frequencies in the ancestral populations derived from multi-locus genotype frequencies, setting a number of clusters (K) ranging from 1 to 10 with 10,000 iterations, and ADMIXTURE's cross-validation with 5-fold were used to choose the correct value for K.

| Gene flow and effective population size
Gaussian approximation was used to find genetic drift from genomewide allele frequency data and then to infer gene flows of historical population splits and mixtures for the four populations of northern pike by using Treemix software (Pickrell & Pritchard, 2012a, 2012b. Population history was modeled from zero to four migration events (0 ≤ m ≤ 4), and the block size was set to 100 and calculated the proportion of variance correlated between populations explained by each model. The tree was visualized using the R script embedded in Treemix.
The effective population size (Ne) can be inferred by the linkage disequilibrium model, which requires accurate estimation of linkage disequilibrium (LD) statistics from sequencing data. For two loci, LD is typically given by the covariance or correlation of alleles co-occurring on a haplotype. Allele A has frequency p in the population (allele a has frequency 1−p), and B has frequency q (b has frequency 1−q). The covariance is denoted D, An extended approach for unbiased estimators for a large set of two-locus statistics including D 2 and 2 D was used to accurately estimate LD and calculate effective population size by Moments software (Ragsdale & Gravel, 2019).
At the same time, a bias-corrected version of the LD method was also used to infer the Ne, implemented with NeEstimator software (Do et al., 2014;Waples & Do, 2008). The 95% confidence intervals for point estimates were computed using 200 resampled bootstrap replicates in Moments and the parametric method in NeEstimator. Physical linkage was also taken into account when calculating Ne. In order to ensure that the locus used for LD analysis was located on a separate chromosome, so as not to underestimate Ne, an approach proposed by Waples et al. (2016) was applied to correct this bias, and the formulation was expressed as:

| Genetic diversity analysis for the populations
The value of Ho within each sampling population ranged from 0.299 to 0.323 (Table 1). The highest diversity was observed in LUL population, and the lowest diversity was found in LJD population. The results of multiple comparisons showed that there were significant differences in Ho of the four populations (p < .05). The value of He ranged from 0.314 to 0.317, and the highest diversity was observed in LUL and LJD populations, and the lowest diversity was found in RIR population (p < .05). The ranking of Pi results was similar to that of He, but the differences between populations were not significant (p > .05) in multiple comparisons; the detailed information is shown in Table 1.

| Genomic inbreeding level estimates for the individuals
The average genomic inbreeding coefficients (F) estimated from genome-wide SNP data were 0.083 for LJD, 0.063 for RIR, 0.039 for LBO, and 0.012 for LUL, respectively (Table 2). Among them, the inbreeding level of the LJD population and the RIR population was very high, and the average value was higher than the value of the first-level cousin mating (F ≥ 0.0625). The value of inbreeding coefficient for the LUL population was F = 0.012 ± 0.064, which was significantly lower than that of the other three populations in multiple comparisons (p < .05) ( Table 2)  Notes: Within a column with no common lowercase superscript (a, b, c, and d) differs significantly in multiple comparisons (p < .05). Sampling populations: RIR (Irtysh River), LUL (Ulungu Lake), LJD (a small lake nearby Ulungu River), LBO (Bosten Lake).

TA B L E 2
Average genomic inbreeding coefficients, pairwise relatedness by population F I G U R E 3 Genomic inbreeding coefficients of the 121 sequenced individuals of four populations. The genomic inbreeding coefficient (F) of each individual was estimated by observed and expected autosomal homozygous genotype counts for each individual via PLINK software (Purcell et al., 2007). Sampling populations: RIR (Irtysh River), LUL (Ulungu Lake), LJD (a small lake nearby Ulungu River), LBO (Bosten Lake) coancestry coefficient analysis (Figure 4b), except that no negative values appeared (Table 3).

| Population stratification and genetic structure analysis
The pairwise F ST between any two of the four populations ranged from 0.008 to 0.051, showing low genetic differentiation (Table 4).
Among them, the F ST between RIR and LBO was of the largest value (F ST = 0.051). Population stratification was detected by PCA method.
The 2D graphs for the first three PCs of these 121 individuals showed that the four populations were stratified, and only LJD and LUL showed a little bit overlapped. Among them, the RIR population had the highest genomic diversity ( Figure 5).

| Gene flow and spread history analysis
Gene flow results showed that the population splitting was from the RIR population into two branches, one was the LBO population, and the other continued to split into the LUL and LJD populations  Figure 7). When Treemix sets the migration from 0 to 4, the topology of the tree was the same, and only one migration signal from LBO to LUL appeared for all of migration from 1 to 4. Regarding gene flow, our results support migration signal from the LBO to LUL population (m = 1).

| Effective population size estimates
We used two different software (Moments and NeEstimator) to es-

| D ISCUSS I ON
Northern pike is widely distributed and have strong adaptability to various ecological types of habitats (Nordahl et al., 2019). However, it is interesting that there is a low genetic diversity among the populations (Bekkevold et al., 2015;Wennerström et al., 2016). The commonly used genetic diversity parameters consistently show low genetic diversity in northern pike, whether based on the results of low-density markers (such as isozyme, mitochondrial DNA, microsatellite), or high-density SNP markers (Pierce, 2012;Sunde et al., 2020). Our results showed that the Ho value of these pike populations in Xinjiang ranged from 0.299 to 0.323, which was higher than North American (0.110) and Siberian pike populations (0.140) (Senanan & Kapuscinski, 2000), which was comparable to Sweden pike populations (0.208-0.289) (Sunde et al., 2020), but it was obviously lower than the average Ho of 0.460 for freshwater fish in previous reports (Dewoody & Avise, 2000). In some water-scarce areas of Xinjiang, northern pike was particularly vulnerable to loss genetic diversity. Meanwhile, freshwater fish often suffer from habitat loss or fragmentation, small population mating, and illegal overfishing (Closs et al., 2015). Moreover, as the top predator, northern pike usually was the largest and most successful spawner in a particular aquatic ecosystem in Xinjiang, and the offspring were also most likely to survive (Pedreschi et al., 2014;Pierce, 2012). As long as a few individuals successfully expand their habitats, they can establish  The genomic coancestry coefficients and IBD of these four populations were all low, but coefficients of RIR and LBO were significantly higher than LUL and LJD. The low value of genomic coancestry coefficient and IBD for inter-population between RIR and other three populations, along with PCA and genetic structure results visually displayed, RIR population was relatively distant from the other three populations. This structure was a bit confusing because it did not match the field observation records of population spread. In this respect, we performed demographic history analysis using Fastsimcoal software (Excoffier et al., 2013), and it did not seem to give the possibility of another source of the populations.

RIR
The timing of the possible historical events did not match the fisheries observation data (results shown in supplementary materials).
One possible explanation was that the high level of inbreeding had caused rapid genetic drift in RIR population, leading to major changes in the genetic structure, while the other three populations still maintained similar allele frequencies when the population was established. This result suggested that the genetic structure of the other three populations may be closer to the original population of the Irtysh River than the current RIR population.
Besides, the ADMIXTURE cross-validation result showed that K = 2. When K = 2, researchers should pay attention to the substructures in the two branches (Janes et al., 2017). In order to prevent the RIR from being extremely different from the other three populations, the substructures of the three populations were compressed to form one group. We removed RIR and analyzed the genetic structure of the other three populations of LUL, LJD, and LBO. The results showed that no genetic structuring in or across these three populations (the results shown in the supplementary material).
In this study, there were some negative values in the genomic inbreeding coefficient and coancestry coefficient. In classical quantitative genetics, the estimated values of these two parameters would not appear negative. However, when using genomic information to estimate two genetic parameters above, sometimes negative values were generated. Usually, these small negative estimates were considered as small values close to zero or "unrelated" (Garbe et al., 2016).
The reasons for these negative values could be attributed to the following two points. On the one hand, in terms of definition, a negative genomic inbreeding coefficient value should not be interpreted as a probability as in classical genetics, but it should be interpreted as a correlation, so the F-estimation equation still applied (Powell et al., 2010). From the perspective of gamete IBD and identity by state (IBS), the average value of inbreeding coefficient between alleles in the current population was defined as 0, and the inbreeding coefficient between some pairs of gametes could be negative (Powell et al., 2010). One the other hand, the bias of the estimates might be caused by insufficient density and uniformity of SNP markers dataset obtained by simplified genome sequencing method such as SLAF-seq (Ackerman et al., 2017;Sun et al., 2013;Zhou et al., 2016).
LUL had the largest effective population, which might be partially attributed to the diversion channels that allow individuals of RIR and surrounding small lakes to have the opportunity to migrate into LUL population. At the same time, gene flow results showed some individuals from LBO were also brought into LUL population.  Ehrenfeld, 2010;Simberloff et al., 2013). In the conservation biology approach, invasive northern pike should be cleared from the lake ecosystems. However, given the fact that northern pike has been widely distributed and large effective population size in Xinjiang's natural water systems, our policy recommendation to the local fishery department is to rationally develop capture fisheries of northern pike under the condition of comprehensive consideration of commercial and ecological factors.

| CON CLUS IONS
In summary, the evidence we obtained from this study suggests that the current wild populations of northern pike in Xinjiang should be originated from the Irtysh River. At present, a high level of inbreeding was found for the northern pike population in the Irtysh River, and the genomic relatedness among individuals within the population was also very close. Meanwhile, the northern pike in Xinjiang has established stable populations in many waters of Xinjiang outside the original habitat. In terms of conservation genetics, the extinction risk of northern pike was very low in Xinjiang of China. Under the conditions of evaluating the amount of fishery resources and the genetic structure of the populations, the controlled capture fishery of northern pike could be developed reasonably.

ACK N OWLED G M ENTS
This research was funded by a special project on agricultural financial funding "Survey of fishery resources and environment in key waters in Northwest China" from the Ministry of Agriculture and Rural

Affairs of China, and the Central Public-interest Scientific Institution
Basal Research Fund (2019XT0605).

CO N FLI C T O F I NTE R E S T
None declared.

DATA AVA I L A B I L I T Y S TAT E M E N T
All the raw sequence data were submitted to BIG database (