Strong spatial genetic structure in a Baltic Sea herbivore due to recent range expansion, multiple bottlenecks and low connectivity

In the Baltic Sea, recent range expansions following the opening of the Danish straits have resulted in a low-diversity ecosystem, both among and within species. However, relatively little is known about population genetic patterns within the basin, except for in a few commercially caught species and some primary producers thought to be ecosystem engineers. Here, we investigate the population genetic structure of the ecologically important crustacean Idotea balthica throughout the Baltic Sea using an array of 33,774 genome-wide SNP markers derived from 2b-RAD sequencing. We also generate a biophysical connectivity matrix, with which we compare the genomic data. We find strong population structure on small scales across the Baltic Sea, and that genomic patterns in most cases closely match biophysical connectivity, suggesting that current patterns are important for dispersal of this species. We also find a strong signal of multiple bottlenecks during the initial range expansion, in the form of reduced heterozygosity along the historical expansion front. The lack of gene flow among sampling sites in the Baltic Sea environmental gradient potentiates local adaptation, while at the same time also increasing genetic drift in low-diversity areas.


Introduction
How species manage to colonize a novel environment remains somewhat of a conundrum in evolutionary biology (Bock et al. 2015). When we observe nature, examples are plentiful of organisms which have been able to successfully colonize and adapt to environments which are very different from their native ranges (e.g. Hill et al. 2011;Cao et al. 2016;Barker et al. The Baltic Sea is a multi-basin marginal sea, with rather strong oceanographic barriers separating the different basins (Leppäranta & Myrberg 2009), and there is also limited oceanographic connectivity between Scandinavia and the mainland European coast (Moksnes et al. 2014) potentially preventing dispersal of coastal organisms. Thus, we might expect to see signs of reduced gene flow among populations in the Baltic. However, studies of gene flow and population structure are scarce in the Baltic Sea. The studies made to date focus mainly on a few commercially caught fish species (Wennerström et al. 2017), where patterns differ substantially. Some fish, such as perch and herring seem to be structured geographically (Olsson et al. 2011;Teacher et al. 2013), as are species with distinct spawning areas such as cod and salmon (Berg et al. 2015;Poćwierz-Kotus et al. 2015). However, in other fish, such as three-spined sticklebacks and turbot this is not the case (except in adaptive loci) (Nielsen et al. 2004;Guo et al. 2015), indicating that population genetic patterns are strongly determined by life history. In addition, a few studies have been made on ecosystem engineering species, mainly the brown algae Fucus vesiculosus and Fucus radicans (Pereyra et al. 2013;Ardehed et al. 2016) where population structure has been observed. However, in Baltic Sea Fucus species, high incidence of asexual reproduction in the Bothnian Sea confuses the overall pattern of genetic structure (Johannesson et al. 2011). These few studies highlight examples of interesting genetic patterns along a colonization front across an environmental gradient such as the Baltic Sea, yet says little about the generality of these patterns. In order to get a better understanding of population genetic mechanisms involved in colonization and range shifts, it is necessary to also study other types of organisms with different life history traits.
Idotea is a genus of isopod crustaceans common in the coastal marine environment. Three Idotea species have been able to colonize the low salinity conditions of the Baltic Sea in the 8 balthica, I. granulosa and I. chelipes) (Leidenberger et al. 2012). Idotea are generalist grazers on different algae and sea grasses, but also have the uncommon ability of being able to survive and grow on a diet of brown algae of the genus Fucus (Bell & Sotka 2012). This makes these isopods particularly ecologically important in the Baltic Sea, where the depauperate ecosystem is based on Fucus algae as habitat engineers in large parts of the coastal zone, and where Idotea is the only grazer on Fucus. Indeed, in some areas in the Baltic, densities of I. balthica can rise to astonishing numbers and in some cases this can cause a total loss of Fucus in an area to grazing (Gunnarsson & Berglund 2012). The three isopod species have colonized the Baltic to different extents, with I. balthica going furthest into low-salinity areas in the Bothnian Sea and the Gulf of Finland, followed by I. chelipes and I.granulosa, respectively (Leidenberger et al. 2012), which suggests differences in adaptive capability to low salinity conditions. A recent phylogenetic study of the Idotea species in the Baltic suggested that all three species have independently colonized the brackish-water habitat (Panova et al. 2016), which might suggest an adaptive pre-disposition.
However, the observed differences in colonization might also be due to limitations to gene flow across oceanographic barriers rather than environmental selection. Idotea species brood their young, and thus have no larval dispersal phase, which likely make dispersal across the Baltic Sea rare. On the other hand, they are strong swimmers, and they have the potential to raft long distances with free-floating algae (Thiel & Gutow 2005), which could break down connectivity barriers. Investigations of within-species (neutral) population genomic patterns of the isopods in the Baltic Sea might shed light on the respective roles of connectivity barriers and colonization history in shaping the observed distribution of the species, yet no such studies have been done to date.
Here, we perform the first population genomic study on I. balthica, and the first study on population genomic patterns across the salinity gradient in the Baltic Sea for a crustacean. We collected isopods from 32 locations spanning from the North Sea to the innermost areas of their distribution range in the Bothnian Sea and the Gulf of Finland. Using 2b-RAD sequencing (Wang et al. 2012) we identify ca. 35 000 SNP markers randomly distributed across their genome, which we use to estimate population differentiation, barriers to dispersal, and areas of high-and low diversity. We also estimate a connectivity matrix using a biophysical model based on the oceanographic circulation, which we then compare to the genomic data. We hypothesize that the life history traits, together with the recent colonization of the Baltic Sea have resulted in strong population fragmentation along the Baltic Sea coast.
We predict that the population genetic patterns will follow an isolation-by-distance pattern, where oceanographic connectivity will largely explain the genetic patterns. Further, we also predict to see a signal of sequential bottlenecks during the recent range expansion in the population genomic data, in the form of reduced genetic variability further into the Baltic Sea. All bioinformatic analyses were run on the University of Gothenburg computer cluster 'Albiorix' (http://albiorix.bioenv.gu.se/). All commands used in the analyses can be found here: https://github.com/DeWitP/BONUS_BAMBI_IDOTEA . An unpublished draft genome assembly for I. balthica was used as a reference for mapping the 2b-RAD data, using bowtie2 (Langmead & Salzberg 2012)(Information for the genome project can be found at https://github.com/The-Bioinformatics-Group/Idotea_genome_project). Then, the standard GATK pipeline (McKenna et al. 2010) was followed for SNP calling (using the UnifiedGenotyper), including InDel realignment, quality score recalibration (BQSR) and variant score quality recalibration (VQSR), using sites identically genotyped across all technical replicates as a "True" training set for the machine learning algorithm. Poorly genotyped individuals and sites genotyped at < 80 % of all individuals were filtered out, as well as highly (>75 %) heterozygous sites. The dataset was pruned in order to keep only one SNP site/locus. Finally, technical replicates were discarded from further analysis.

Summary statistics
Mean pairwise FST, as well as population differentiation p-values using Fisher's exact probability test, were calculated for all population pairs using GENEPOP (Raymond & Rousset 1995;Rousset 2008). Pairwise FST were plotted on metric MDS-plots using R. June-August and December to February, respectively. Summer and winter salinity were found to be strongly correlated, and thus winter salinity was removed as a factor. Also, the asymmetrical distance values "d" (Jost's D), "gst" (Nei's GST) and "Nm" (Alcala et al.

Principal Components Analysis
An identity-by-state (IBS) distance (1 -IBS) matrix was calculated from the full 33,774 SNP dataset, using plink 1.9 (Purcell et al. 2007). A hierarchical cluster dendrogram was produced using the hclust function in R, after which a Canonical Analysis of Principal coordinates (CAP) was performed using the Vegan R package. This method is similar to a PCA, but it rotates the eigenvector axes in such a way as to maximize the among-site differences. Also, a test of the significance of collecting site as a factor was tested through an ANCOVA within the Adonis R package.
In addition, a global FST outlier analysis was performed using Bayescan v 2.1 (Foll & Gaggiotti 2008). IBS-matrices were calculated separately for outliers and non-outliers, and CAP analyses were performed on the two datasets, in order to determine if outliers had disproportionate or distortive effects on the genetic structure analyses. As outliers and nonoutliers showed similar patterns, outliers were not discarded in downstream analyses.

Geographically explicit population inference
Genotype likelihoods for all SNP sites were extracted from the vcf files using bcftools, which were input to NGSadmix (Skotte et al. 2013) for population cluster analysis. NGSadmix uses a probabilistic framework to infer ancestry. K of 2 -32 was used, with 10 replicates of each.
The output q-matrices of NGSadmix were combined for each K and plotted as barplots using CLUMPAK online (http://clumpak.tau.ac.il/index.html). Cross-validation scores were examined for each K to infer the most probable number of population clusters. In addition, separate NGSadmix analyses were run for the Skagerrak/Kattegat/Danish Straits area (K=2-8) and the North part of the Swedish Baltic Sea coast and Finland (K=2-12), as sampling sites in these two areas were non-significantly separated in the CAP analysis described above.
In order to extrapolate ancestry coefficients on geographic scales, the R package TESS (Caye et al. 2016) was used. We chose K = 12 for plotting, guided by cross-validation scores. A map of the Baltic Sea was downloaded from https://maps.ngdc.noaa.gov/viewers/wcs-client/, from which a grid was generated with a constraint on elevation from 1 m to -50 m. Finally, ancestry coefficients for the 12 different clusters were geographically imputed and plotted on the map in different colors.
In addition, variability in genetic diversity and effective dispersal were modelled and extrapolated geographically using the software package EEMS (Petkova et al. 2015). This software uses an isolation-by-distance model with stepping stones ("demes") as a null model, and outputs deviations from the model, which can be inferred to be barriers to (or corridors of) dispersal. A full-rank distance matrix was generated by imputing missing genotype values with observed mean genotype at each SNP site, as implemented by the "bed2diffs_v2" tool distributed with the EEMS software. Geographic coordinates from 86 vertices specifying a polygon enclosing the Baltic Sea were extracted from Google Earth (REF), after which the EEMS software was run in three separate mcmc chains, each with default lengths (2 M iterations, burn-in 1 M iterations, writing every 10 K iterations to file), using 300 demes. The output was examined for convergence, and plotted on a map of Europe using the rEEMSplots R package (distributed with EEMS).

Genetic vs biophysical connectivity
The correlation between genetic differentiation and multigenerational connectivity estimated from the biophysical model was tested with Mantel tests (Mantel 1967). We used four metrics of genetic differentiation: pairwise FST, and calculated with divMigrate the relative migration using GST, relative migration using D and relative migration using nm. Correlations with both untransformed and logarithmically transformed (log10(x+1e-50)) connectivity matrices (64 generations) were tested. Because pairwise FST data are symmetric we extracted symmetric connectivity matrices by using either the minimum connectivity (minimum of i to j and j to i), the maximum connectivity or the mean connectivity. For the asymmetric genetic metrics (from divMigrate) we used the full connectivity matrices. Since most scripts of the Mantel test only accepts symmetric matrices, a script for tests of asymmetric matrices was written using Matlab (v 2018a, Mathworks Inc).

Genotyping
The 2b-RAD libraries generated µ=8.64 M reads ± sd 3.72 Mreads, and mapping rates (not including multiple hits) were µ = 20.23 % ± sd 2.26 % (Supplementary data). Using 25,329 loci which had identical genotypes across all replicate pairs as a "true" training set for the VQSR, we estimated the true transition/transversion ratio (Ti/Tv) to be 1.348. By examining different truth-sensitivity tranches, we could observe a sharp drop in Ti/Tv at above 95 % truth sensitivity, so we used 95 % as the cut-off tranche for SNP filtering. After filtering for highly heterozygous (likely lumped paralogous) loci, and thinning the dataset to one SNP/ RAD locus, we were left with 57,641 SNPs, and after filtering out poorly genotyped sites, the final genotype dataset consisted of 33,774 SNP sites, genotyped at a minimum of 80 % of 595 individuals (Supplementary data), sampled at the 32 different locations (Figure 1).

Summary statistics and population differentiation
Pairwise FST values are summarized in Table 1, above

PCA
The hierarchical clustering ( Supplementary Fig 1) indicated that some pairs of individuals were closely related to each other. In all cases, the pairs consisted of two individuals from the The barrier analysis, which is a way to project the multigeneration connectivity matrix in a geographic dimension, also identified regions with high internal connectivity within the Baltic Sea, which were separated by barriers with high resistance to dispersal (Figure 9). The number of regions varied according to the selected threshold of allowed mean dispersal among regions, and our model with the lowest cut-off value ( Figure 9D) most closely matches our genetic data, in that it identifies the separation between the Swedish Bothnian Sea population and others. However, this cut-off value also separates northern from southern Gulf of Finland populations, which we do not see in the genetic data; here the barriers in Figure 9C better represent the genetic data. The barrier analysis averages connectivity values to and from sites, and does thus not well represent instances of highly asymmetric gene flow, which could explain this discrepancy.
The biophysical model (mean connectivity) is strongly correlated with pairwise genetic distance (FST) (Mantel p < 1e-5, R= -0.45). Asymmetrical distance values provide a slightly lower fit (Nm R = 0.41, GST R = 0.41, D R = 0.39) but are still strongly significantly correlated to the biophysical connectivity (p < 1e-5 for all three). Interestingly, a simple measure of geographic distance (measured along the shortest coastline connecting two sites) is also significantly correlated to the pairwise FST matrix (R = 0.41, p< 1e-5).

Discussion
The population genomic data produced here paints a clear picture of a highly structured metapopulation of I. balthica in the Baltic Sea. In general, gene flow even between neighboring sites seems to be low, sharply contrasting with patterns in fish species such as sticklebacks (DeFaveri et al. 2013). Further, the genetic patterns match the biophysical connectivity model closely, indicating that limitations to dispersal are the most important factor affecting genetic structure in this species. As this species broods its young, there is no pelagic larval dispersal.  Figure 9). The barrier analysis may be viewed as the sequential breaking up into wellconnected regions as the threshold of allowed inter-regional dispersal is continuously relaxed (Nilsson Jacobi et al. 2012). Some dispersal barriers are consistent across a range of dispersal thresholds (e.g. between western Estonia and the Gulf of Finland), whereas other barriers are more labile (e.g. the Swedish east coast). In general, the barrier analysis well predicted the population structure although additional factors, not considered in the biophysical model, e.g.
wave exposure and habitat distribution may further modify the resistance to gene flow. There are a few exceptions to the correlation between the biophysical model and the genetic data.
One of the most striking genetic patterns is that the Swedish Bothnian sea populations (DJU and KUG) seems to be strongly isolated from all other sites, while the biophysical model indicates that dispersal can occur along the Swedish coast except in the most stringent setting of the barrier analysis ( Figure 9D). This is obvious both from pairwise FST values and from the low genetic diversity in these areas. However, when also the possibility for asymmetric gene flow is considered (heat-map in Figure 8), the biophysical model shows that DJU and KUG are almost isolated from incoming gene flow, although they are expected to act as sources to other locations (e.g. in eastern Finland). It seems that the Åland archipelago and the current patterns in this area provide an effective barrier to westward gene flow, instead redirecting dispersal from Sweden to the Finnish archipelago sea, and then northward along the Finnish Bothnian Sea coast, rather than along the Swedish Bothnian Sea coast. Another striking feature is that while the biophysical model indicates potential dispersal from Finland to Sweden in the northernmost part of the Bothnian Sea (the "Kvarken" area), and that I. balthica is quite common on the Finnish side (SOD/HAL), we do not find any isopods on the Swedish side. In fact, our Kuggören (KUG) sampling site is the most northern point of the species distribution on the Swedish coast. There is plenty of available habitat (Fucus belts) also to the north of this site, although not continuous all along the coast, and a recent niche modelling study (Leidenberger et al. 2015) also showed that the northern Bothnian Sea coast should be inhabitable to the isopods. However, perhaps the environmental conditions (low salinity/high ice cover), fragmentation of Fucus belts, and the lack of available habitat in the Kvarken area restrict dispersal from Finland to Sweden, while the prevailing southerly current along the coast effectively limits the isopods' ability to disperse northward along the Swedish coast. Although these animals are tolerant to low salinity conditions, there is a limit beyond which the animals cannot survive (Leidenberger et al. 2012;Rugiu et al. 2018;Kotta et al. 2019). Before this point, however, reproduction might be affected due to low fertilization success (Vuorinen et al. 2015).
The Finnish Bothnian sea coast, on the other hand, is one large mixed population, with gene flow mainly occurring from the south towards the north. As prevailing currents move northward along the Finnish coast, this is perhaps not so surprising. It might be speculated that the northernmost edge of the Finnish population might have low reproductive output, which would explain the lack of dispersal to the Swedish side, while at the same time maintaining high population density due to constant input from more southerly areas. The strong genetic separation of the Swedish and Finnish Bothnian Sea coasts can be contrasted to genetic patterns observed in Fucus, where F. radicans is well-mixed throughout the region, whereas F. vesiculosus is subdivided both among countries and also among sampling sites within both countries (Pereyra et al. 2013). This indicates that different processes are new knowledge about population genetic patterns will help support management efforts in the future. A recent study (Sandström et al. 2019) found that a major reason for local managers not using genetic diversity in day to day operations, is that they find this type of information difficult to interpret. We thus here have tried to provide easy-to-interpret maps of genetic diversity (Figure 7) and population subdivision (Figure 6), which we hope will be useful for managers, e.g. in defining management units. One important such would be the Swedish Bothnian Sea coast, which hosts an isolated population of small size. Establishment of MPAs in that area would probably be very beneficial to the coastal animals there. Updated management advice will be continually posted on the BAMBI project website at: https://bambi.gu.se/baltgene.
In this study, we identify 487 global FST outliers. Population structure in outliers mirror those in putatively neutral loci, indicating that the range expansion and natural selection processes have acted in concert. However, it might be possible to use the independent colonization events into the Bothnian Sea and the Gulf of Finland, in order to identify loci involved in adaptation to low salinity. An interesting avenue for further research would thus be to perform association tests of allele frequencies at individual loci with environmental parameters (primarily salinity), while controlling for the neutral population structure. In addition, further experiments linking genotype and phenotype would be necessary in order to gain more conclusive evidence concerning genes involved in salinity adaptation in the Baltic Sea.
Although we currently do not possess knowledge of the function of outlier loci, it is possible that future genomic information, when available, could help with annotation. Identification of alleles involved in local adaptation will be relevant for predicting range shifts in the face of ongoing climate change. A significant warming and decrease in salinity is expected for the Baltic Sea until the end of this century. Considering the apparent poor dispersal ability of I.
balthica it may be difficult to track the receding salinity gradient by locally adapted genotypes resulting in wide local extinctions. Consequently, it will be important to monitor the northern distribution limit of this important grazer in the future as the environment continues to change, and conservation efforts might need to be focused on facilitating dispersal southward along the salinity gradient, especially along the Swedish Bothnian Sea and northern Baltic Sea coasts.
This work resulted from the BONUS BAMBI project was supported by BONUS (Art 185), funded jointly by the EU and the Swedish research council FORMAS. We would like to thank the Swedish National Genomic Infrastructure (NGI) and the SNP&SEQ Technology platform at Uppsala University for sequencing. All bioinformatics analyses were run on the Albiorix            Pop   SYL TJA ESP VEJ HER FAL BON KIE  NEU SAS KIV  OTT BOR VAS NYN OST DJU KUG HAL SOD STO SAL BJO RAU SKR HAN HEL LET PAK SAR PUL PAN  Nind  20  20  20  20  20  9  16  20  20  20  20  19  20  20  20  20  19  12  20  20  20  20  20  21  19  19  19  16  20  10  20  16  SYL  20