Impact of demography on linked selection in two outcrossing Brassicaceae species

Abstract Genetic diversity is shaped by mutation, genetic drift, gene flow, recombination, and selection. The dynamics and interactions of these forces shape genetic diversity across different parts of the genome, between populations and species. Here, we have studied the effects of linked selection on nucleotide diversity in outcrossing populations of two Brassicaceae species, Arabidopsis lyrata and Capsella grandiflora, with contrasting demographic history. In agreement with previous estimates, we found evidence for a modest population size expansion thousands of generations ago, as well as efficient purifying selection in C. grandiflora. In contrast, the A. lyrata population exhibited evidence for very recent strong population size decline and weaker efficacy of purifying selection. Using multiple regression analyses with recombination rate and other genomic covariates as explanatory variables, we can explain 47% of the variance in neutral diversity in the C. grandiflora population, while in the A. lyrata population, only 11% of the variance was explained by the model. Recombination rate had a significant positive effect on neutral diversity in both species, suggesting that selection at linked sites has an effect on patterns of neutral variation. In line with this finding, we also found reduced neutral diversity in the vicinity of genes in the C. grandiflora population. However, in A. lyrata no such reduction in diversity was evident, a finding that is consistent with expectations of the impact of a recent bottleneck on patterns of neutral diversity near genes. This study thus empirically demonstrates how differences in demographic history modulate the impact of selection at linked sites in natural populations.


| INTRODUC TI ON
The relative roles of selection and genetic drift in shaping genome-wide variation are of great interest in evolutionary genetics. The genetic variation carried by a species is produced by mutation, but in finite populations some neutral variants are lost by chance while others may increase in frequency through genetic drift. The strength of genetic drift is determined by the effective size of the population, N e (Fisher, 1930;Wright, 1931).
In contrast, negative selection decreases and positive selection increases the frequency of the selected variant deterministically reducing variation at the site under selection. Due to linkage disequilibrium, the associated reduction in variation is not limited to the site under selection itself but extends to nearby sites. The effect of selection at linked sites ("linked selection") can be a result of either selection against deleterious mutations, known as background selection (Charlesworth, Morgan, & Charlesworth, 1993), or selection favoring advantageous mutations, that is genetic hitchhiking (Maynard Smith & Haigh, 1974). Since a large proportion of new mutations occurring in functional regions have fitness effects , linked selection is likely of wide importance for determining levels of polymorphism within populations (Charlesworth, 1994;Charlesworth, Charlesworth, & Morgan, 1995;Kim & Stephan, 2002). Linked selection has also been suggested to account for the narrow range of diversity across species despite potentially large differences in census population sizes (Corbett-Detig, Hartl, & Sackton, 2015;Gillespie, 2001). However, the impact of linked selection does not seem to be sufficient to fully explain variation in diversity levels among species (Coop, 2016). Nevertheless, understanding the impact of linked selection, especially in the form of background selection, is important for inference of adaptation, since genomic regions experiencing different magnitudes of background selection are expected to harbor different baseline levels of diversity (Burri et al., 2015;Comeron, 2014Comeron, , 2017Elyashiv et al., 2016;Huber, DeGiorgio, Hellmann, & Nielsen, 2016). Furthermore, background selection causes distortion of gene genealogies such that site frequency spectra may be expected to exhibit an excess of rare alleles (Nordborg, Charlesworth, & Charlesworth, 1996;Zeng & Charlesworth, 2011). If not properly accounted for, linked selection may thus give rise to a false signal of population size expansion in demographic inference (Ewing & Jensen, 2016) and result in incorrect model selection (Schrider, Shanku, & Kern, 2016). As linked selection can affect a large proportion of the genome (Pouyet, Aeschbacher, Thiéry, & Excoffier, 2018;Woerner, Veeramah, Watkins, & Hammer, 2018), characterizing its effects is of great importance.
Since the extent over which selection at linked sites can affect polymorphism depends on the amount of linkage disequilibrium, linked selection causes a positive correlation between diversity and recombination rate (Hudson & Kaplan, 1995;Nordborg et al., 1996). In regions with high recombination rates, the effect of selection at linked sites is less pronounced (Charlesworth et al., 1993;Kaplan, Hudson, & Langley, 1989;Thomson, 1977), a pattern observed in multiple species (reviewed by Cutter & Payseur, 2013;Slotte, 2014). The first empirical evidence supporting an important effect of linked selection on patterns of genetic diversity was in Drosophila melanogaster, where Begun and Aquadro (1992) found a correlation between diversity and recombination rate. However, recombination may have a mutagenic effect itself that can lead to such a correlation even in the absence of linked selection. This effect can be tested by taking into account mutation rate variation along the genome, for example by using synonymous divergence as a proxy (Begun & Aquadro, 1992;Hellmann, Ebersberger, Ptak, Pääbo, & Przeworski, 2003). Other factors may also impact the diversity distribution along the genome and the effect of linked selection. For example, the impact of linked selection may depend on variation in the density of functional sites, which may further be correlated with recombination. If recombination rate and gene density are correlated (Flowers et al., 2011;Slotte, 2014), the relationship between recombination rate and diversity may be more complex.
Furthermore, a difference in the pattern and magnitude of linked selection is expected in populations with different demographic histories, as for instance in humans and chimpanzees (Pfeifer & Jensen, 2016). At equilibrium, the rate of random loss of variation due to genetic drift is lower in populations with a large N e , and hence, they can carry more variation (Wright, 1931). In addition, selection efficacy depends on the effective population size and selection is only efficient relative to drift if |N e s| > 1, where s is the selection coefficient.
Thus, variants with a small fitness effect behave neutrally in small populations (Ohta, 1973). Similarly, at equilibrium, the strength of background selection scales with N e , with stronger linked selection in large populations (Nordborg et al., 1996). However, the magnitude of linked selection may vary depending on the combination of population parameters such as selection coefficient and N e (Nam et al., 2017). Furthermore, if population size fluctuates over time, alleles of different age may experience different magnitude of linked selection, as was found in a study that contrasted diversity in maize and teosinte (Beissinger et al., 2016). Thus, under nonequilibrium demographic models, diversity patterns may differ greatly from predictions based on classical models assuming equilibrium (Torres, Stetter, Hernandez, & Ross-Ibarra, 2019). An improved understanding of the interplay between demographic history and linked selection is thus of broad importance.
To empirically investigate genome-wide patterns of linked selection in populations with contrasting nonequilibrium demographic histories, we studied genome-wide diversity patterns in two herbaceous outcrossing Brassicaceae species, Arabidopsis lyrata and Capsella grandiflora. We chose these species because they have similar mating systems (both are self-incompatible outcrossers) and genome structure, and because previous research suggests considerable differences in the demographic histories of some populations. Indeed, the recent colonization history of A. lyrata especially in northern Europe has likely been associated with strong population bottlenecks or effective population size reduction (Mattila, Tyrmi, Pyhäjärvi, & Savolainen, 2017;Pyhäjärvi, Aalto, & Savolainen, 2012;Ross-Ibarra et al., 2008;Savolainen & Kuittinen, 2011), which is in contrast to the large and relatively stable population of C. grandiflora (Douglas et al., 2015;Slotte, Foxe, Hazzouri, & Wright, 2010;Slotte et al., 2013;St. Onge, Källman, Slotte, Lascoux, & Palmé, 2011).
Hence, this species pair offers an opportunity to investigate the interplay between demographic history and linked selection. Here, we first revisit the demographic analysis, investigate patterns of genome-wide variation, and quantify the magnitude of linked selection using whole-genome resequencing data from natural populations.
Our primary assumption was that linked selection is stronger in the population with a larger long-term effective population size. We also discuss our results in the light of expected effects of linked selection after population size change.
We chose to use the C. rubella genome as it is considerably less fragmented than currently available C. grandiflora assemblies, and because C. rubella and C. grandiflora are very closely related (split time estimated to <200 kya; Koenig et al., 2019;Slotte et al., 2013; approximately 84% of polymorphisms in C. rubella are shared with C. grandiflora; Foxe et al., 2009). Our sequence processing and mapping approach followed a pipeline developed in Mattila et al. (2017), with slight modifications. Briefly, sequence data were first processed with Trimmomatic 0.32 (Bolger, Lohse, & Usadel, 2014) in pairedend mode with the TrueSeq2 paired-end adapters and the following options-ILLUMINACLIP/TruSeq2-PE.fa:5:20:10 -LEADING 10 -TRAILING 10 -SLIDINGWINDOW 10:20 -MINLEN 50. Both read pairs were required to pass the trimming for further analysis.
Reads were then mapped to reference genomes using bwamem (Li, 2013;Li & Durbin, 2009) with options -M -r 1. Duplicates were marked, read groups added, and alignment statistics calculated with PICARD TOOLS v. 1.113 (http://broad insti tute.github.io/ picard). Overlapping read pairs were trimmed using BamUtil v. 1.0.13 (https ://genome.sph.umich.edu/wiki/BamUtil) using the option unmapped. Indels were marked and realigned with Genome Analysis Toolkit v. 3.2.2 (De Pristo et al., 2011). We utilized the pipeline tool STAPLER (Tyrmi, 2018) for automatic creation and management of scripts for running processes on a computer cluster. The proportion of mapped reads was very similar for both C. grandiflora and A. lyrata (approximately 90%). Mapping statistics and additional summary per individual data are available at supporting information S2.

| Quantifying the patterns of polymorphism
Site frequency spectra (SFS) for both species were calculated using ANGSD v. 0.913-56-g5b7889d (Korneliussen, Albrechtsen, & Nielsen, 2014;Nielsen, Korneliussen, Albrechtsen, Li, & Wang, 2012). The estimated SFS were used as priors for estimating nucleotide diversity (π) (Nei & Li, 1979) and Watterson's theta (Watterson, 1975)  For diversity calculations all repeat annotated regions, sites with fixed heterozygosity (allowing 20% of missing data) ±200 base pairs and conserved noncoding regions (Haudry et al., 2013) were masked in both species. Additionally, regions aligning against Arabidopsis thaliana organelle genomes or regions with discordant sequence between the NCBI and Ensembl reference genomes were excluded from A. lyrata data. Species alignments were obtained from Ensembl Compara for A. lyrata.
The following filtering settings were used for the diversity calculation in ANGSD: -remove_bads -unique_only -minMapQ 30 -minQ 20 -only_proper_pairs 1 -trim 0. For both species, mean depth plus 3 * SD was used as individual maximum depth. We further required 80% of data presence ensuring that the analyses are not based on single or a few individuals. This filters out poorly covered regions some of which may contain structural or indel variants. Such regions may suffer from allele dropout biasing diversity estimates. Doing so we potentially exclude some fast-evolving regions but we were not, however, able to cover the variation in these regions with the current dataset.
Moreover, this filter should not bias comparisons between species.

| Demographic inference
While previous studies have investigated demographic history in both A. lyrata and C. grandiflora, earlier studies have not always used sets of sites that are robust to the impact of linked selection. We therefore opted to revisit this question and conduct demographic inference using our data. To do so, we used the diffusion approximation method implemented in the Python library ∂a∂i (Gutenkunst, Hernandez, Williamson, & Bustamante, 2009). Five different demographic models covering a variety of possible single population demographic histories were considered: constant, two-epoch, three-epoch, growth, and bottlegrowth ( Figure S1). For each model, ∂a∂i was run with 500 starting points followed by a grid search around the best likelihood values. The best model was chosen using the Akaike information criterion (AIC). For demographic inference, we used site frequency spectra obtained for intergenic sites at least 2,000 bp from genes (using BEDtools slop) and within the upper quartile of recombination rates, as measured in 50 kb windows. Such genomic regions have previously been suggested to be least affected by linked selection and thus useful for demographic inference (Coop, 2016). These sites are denoted by intergenic high-rec hereafter. After filtering, all eight chromosomes were represented in the C. grandiflora dataset, but in the A. lyrata dataset windows only from chromosomes 1, 2, 3, 5, and 7 were included. Unfolded SFS and 100 bootstrap SFS were estimated with ANGSD as described above. We opted for using the folded SFS for demographic inference to avoid potential biases due to ancestral state misinference. Confidence intervals for demographic parameter estimates were calculated from analyses of 100 bootstrap replicates.

| Divergence
Divergence at 4-fold and 0-fold sites (d 4 and d 0 ) was estimated from a whole-genome alignment of C. rubella-A. lyrata (Steige et al., 2017). Window-based divergence (50 kb) was calculated by dividing the sum of divergent sites by the total number of sites at which divergence could be reliably assessed within each window, using the BEDtools map option.

| Recombination rate estimates
The recombination map for A. lyrata was obtained from Hämälä, Mattila, Leinonen, Kuittinen, and Savolainen (2017). For C. grandiflora, we used the genetic map from Slotte, Hazzouri, Stern, Andolfatto, and Wright (2012) which is based on an interspecific F2 mapping population between C. grandiflora and C. rubella. As the F2 population did not exhibit signs of genetic incompatibilities, and as the two species are very closely related (split time < 200 kya; Koenig et al., 2019;Slotte et al., 2013), we expect this genetic map to provide a reasonable estimate of broad recombination rate variation in C. grandiflora. Maps were thinned to include only markers at least 10 kb apart by dropping all markers for which distance to the previous included marker was less than the given threshold, starting from the beginning of the chromosome. In addition, ascending order of marker physical positions in relation to genetic positions was required. Hence, a large inverted region in A. lyrata chromosome 7 was excluded (approximately 4.3 Mb sized region). Recombination rates per base pair (Figures S2 and S3) were estimated from the recombination map using R (R Core Team, 2016) function smooth spline interpolation with smoothing parameter 0.7.

| Codon usage bias
We quantified codon usage bias as the effective number of codons, estimated using a method accounting for background base composition (Novembre, 2002). Base composition corrected effective number of codons (N c ') was calculated as implemented in the program ENCprime (https ://github.com/jnove mbre/ENCprime).

| Transposable element content
Transposable element content was estimated using RepeatMasker (Chen, 2004) using an Arabidopsis TE database for both species.
From the raw repeat annotation, simple repeats were excluded to limit the analysis to transposable element repeats only.

| Distribution of fitness effects and estimate of positive selection
We estimated the distribution of fitness effects of new mutations and the proportion of substitutions fixed by positive selection at 0-fold degenerate sites using DFE-alpha v.2.15 (Eyre-Walker & Keightley, 2009;. Mutations at these sites are amino acid changing and may thus have an effect on fitness. For these analyses, we considered 4-fold degenerate sites as neutral. Since demographic changes may have an effect on DFE estimation, we estimated DFE taking into account the demographic history estimated from the neutral site category as implemented in DFE-alpha . The population-based selection coefficients (N e s) were categorized in three groups: nearly neutral (N e s < 1), intermediate (1 < N e s < 10), and strongly deleterious (N e s > 10). Divergence at 4-fold and 0-fold sites was calculated as described above and used to estimate the proportion of positively selected 0-fold degenerate fixations (α), and ω a , a measure of the proportion of positively selected 0-fold fixations relative to neutral divergence (Gossmann et al., 2010). SFS and 100 bootstrap replicates for both site categories were estimated with ANGSD as described above. The run with the highest likelihood out of five independent runs was retained. This procedure was repeated for the 100 bootstrap SFS to get 95% confidence intervals for the estimated parameters.

| Factors explaining neutral diversity
In order to assess the impact of linked selection on the level of neutral diversity, we fit a model with diversity at 4-fold degenerate sites (π 4 ) as response variable and recombination rate and the number of functional sites, measured as the number of base pair annotated as part of an exon (exonic bp), as explanatory variables using the "glm" function in R. To account for mutation rate and other factors possibly correlated with diversity, we included divergence at 4-fold degenerate sites (d 4 ), codon usage bias (N c '), and transposable element (TE)% in the models as additional explanatory variables. All variables were calculated in 50 kb nonoverlapping windows. The additional covariates were included in the model since recombination rate has also been shown to correlate with transposable element (TE) density (Bartolomé, Maside, & Charlesworth, 2002;Rizzon, Marais, Gouy, & Biémont, 2002;Tenaillon, Hollister, & Gaut, 2010)  To allow reasonably robust divergence estimates, windows with <500 sites to estimate π 4 (callable sites) and d 4 (aligned sites) were excluded. In addition, we excluded windows with d 4 > 0.3 and recombination rate > 9 × 10 -6 cM/bp (single outlier peak in Capsella) which might indicate alignment or genotyping errors, respectively.
A model selection procedure was applied to evaluate the importance of each explanatory variable in the model using stepAIC from R package MASS (Venables & Ripley, 2002) with Bayesian information criterion (BIC). All variables were centered and scaled prior to analysis. The effect of multicollinearity of the explanatory variables was evaluated with variance inflation factor (vif) analysis using the vif function of the R car package (Fox & Weisberg, 2011) with √vif < 2 was required for each variable. Divergence at 0-fold degenerate sites (d 0 ) was excluded from the model due to high variance inflation in A. lyrata. A single outlier in the A. lyrata dataset with exceptionally high diversity (which may indicate paralogous mapping) was removed.

| Diversity around genes
Diversity was first calculated in 1 kb windows for intergenic sites.
The distance to the nearest gene of each window was calculated with BEDTools closest option. The recombination maps were used to convert the physical distances into genetic distance (cM). The genome-wide ratio of per window and global intergenic diversity (π/ mean π) was estimated with the smooth.spline R function from the observed window-based data, and 95% confidence regions were generated using 100 bootstrap datasets. These distributions were used to calculate significance in each distance category.

| A. lyrata and C. grandiflora differ in levels and distribution of diversity
After quality filtering and region masking, we retained 60.5 million sites in A. lyrata and 61.8 million sites in C. grandiflora for analyses of polymorphism levels. Of these sites, 1.3 million were variable in the A. lyrata population and 4.3 million in the C. grandiflora population. Without quality filtering, there were (the masking settings were not changed) 73.4 million sites in A. lyrata, 2.1 million of which were variable, and 70 million sites in C. grandiflora, 5.2 million of which were variable. The largest difference after filtering was observed at intergenic sites (Table 1).
The average nucleotide diversity at 4-fold degenerate sites (π 4 ) was higher for the C. grandiflora population than for the A. lyrata population (0.0224 vs. 0.0101, respectively; Table 2, Wilcoxon rank sum test, W = 839,980, p-value < 0.001). Higher π in C. grandiflora than in A. lyrata was also observed at the other site categories. In A. lyrata, the highest nucleotide diversity (π) was observed at 4-fold degenerate sites while in C. grandiflora intergenic sites with additional filtering (intergenic high-rec sites that were distant from genes, in regions with high recombination rates) had the highest π (Table 2). Watterson's θ estimates were slightly higher than π estimates in the C. grandiflora dataset while in A. lyrata the opposite pattern was observed. The ratio of 0-fold to 4-fold degenerate nucleotide diversity (π 0 /π 4 ) was 0.2187 for the C. grandiflora population while in the A. lyrata population it was 0.2846 (Table 2), which may indicate stronger purifying TA B L E 1 The number of sites analyzed with (F) and without filtering (NF) for 0-fold, 4-fold, intergenic (i), and intergenic high-rec sites selection in the C. grandiflora population or faster recovery of π 0 after a population bottleneck in the A. lyrata population (Comeron, 2017).
The two datasets also differed with respect to their site frequency spectra. Specifically, our A. lyrata population had a higher proportion of SNPs at intermediate frequency in comparison with C. grandiflora (Figure 1), likely reflecting differences in the demographic history of the study populations. In addition to differences due to demographic history, a higher proportion of low frequency variants at sites affected by background selection is expected (Zeng & Charlesworth, 2011). Hence, the highest proportion of low frequency variants were expected at 0-fold sites and lowest at intergenic sites. However, both species had the highest proportion of singletons at 0-fold degenerate sites and the lowest at 4-fold degenerate sites, while the frequency of singletons was intermediate at intergenic sites. Moreover, the frequency of doubletons was highest at the intergenic sites (Figure 1).

| Demographic histories
We next assessed the demographic history of these two populations using SFS estimated from the intergenic high-rec sites. The demographic inference suggests that the best model for our A. lyrata population was a three-epoch model with strong effective population size decrease from 626,000 (340,000-983,000, 95% C.I.) to 195,000 (126,000-275,000, 95% C.I.). We estimated that the bottleneck length was approximately 520,000 (7,000-818,000, 95% C.I.; Figure 2; Table S1) generations. The population size shrunk further to 6,000 (3,000-27,000, 95% C.I.) approximately 1,000 (500-10,000, 95% C.I.) generations ago (Figure 2; that occurred approximately 1.4 million generations ago and persisted over approximately 1,114,000 generations (800-11,617,000, 95% C.I.; Figure 2; Table S1). A second slight increase to 1,578,000 (1,431,000-3,051,000, 95% C.I.) was estimated to have occurred approximately 302,000 (6,000-1,375,000, 95% C.I.) generations ago ( Figure 2; Table S1). Our timing estimates and the ancestral population size estimate especially for C. grandiflora contain considerable uncertainty ( Figure 2; Table S1), but nevertheless suggest that there have been more recent drastic population size changes in the A. lyrata population than in the C. grandiflora population. Hence, a difference in the state of recovery after population size change is expected.

| Weaker purifying selection in A. lyrata
The difference in the ratio of π 0 and π 4 may reflect a difference in the genome-wide efficacy of purifying selection between our C. grandiflora and A. lyrata populations. To assess purifying selection in TA B L E 2 Diversity estimates for 0-fold, 4-fold, intergenic (i), and intergenic high-rec (i2) sites (95% C.I. based on 10,000 bootstrapped datasets in parentheses) in Arabidopsis lyrata and   Figure S4). Instead of using the N e s value per se, it is more meaningful to describe the shape of the distribution with frequency of N e s values in bins .

Capsella grandiflora populations
We found significant differences in the estimated DFE between species (Figure 3). In general, larger proportions of new mutations in 0-fold sites were estimated to be effectively neutral (N e s < 1) in A. lyrata (A. lyrata, 30%; C. grandiflora, 15%). Conversely, in C. grandiflora intermediate fitness effect (1 < N e s < 10) or strongly deleterious (N e s > 10) mutations were estimated to be more common; in A. lyrata, 5% and 65% of mutations were assigned to intermediate and strongly deleterious categories, respectively, while in C. grandiflora, the corresponding proportions were 14% and 71% (Figure 3).

| Factors affecting genome-wide diversity
The best multiple linear regression model with model selection (BIC) explained 47% of the variation in diversity in the C. grandiflora dataset and 11% in the A. lyrata dataset (Table S2). In both species, recombination rate had significant positive effect on neutral diversity (Table 4), whereas exon density had significant negative effect on neutral diversity in C. grandiflora. In the A. lyrata dataset exon density also had negative effect on neutral diversity (Table S3) but it was not included in the best model (Table 4; Table S2). TE% had a positive effect whereas codon bias had a negative effect on neutral diversity in both populations studied. Comparing models including each explanatory variable separately indicated that for C. grandiflora recombination rate and for A. lyrata TE% explained the highest proportion of variance in neutral diversity and had the lowest BIC among the single variable models (Table S2). We observed significant correlation between the explanatory variables (Table S3) but variance inflation factor analysis did not indicate severe effect of multicollinearity (Table 3). Further, the distributions of residuals suggest that the model assumptions are met reasonably well ( Figures S5 and S6), and only a small deviation from normality was observed (Table 4).

| Diversity around functional regions
Selection on genes can also decrease diversity at nearby loci due to linkage disequilibrium. At equilibrium, a stronger decrease in diversity around genes is expected due to stronger efficacy of selection in populations of larger effective population size (Nordborg et al., 1996). In addition to this basic expectation, simulation results by TA B L E 3 Estimates of population size rescaling backward in time (N1/N2), mean selective effect (N e s), and shape parameter (β) of fitness effect distribution

| D ISCUSS I ON
In this study, we examined patterns of genome-wide diversity in coding and intergenic regions in populations of two outcrossing Brassicaceae species, A. lyrata and C. grandiflora. Our aim was to investigate the role of selection in shaping genome-wide variation and more specifically to compare patterns of linked selection in two populations of species with similar mating system but differing in their demographic history.
The basic finding was that C. grandiflora had higher diversity in comparison with A. lyrata. In the A. lyrata dataset, we observed lower intergenic diversity in comparison with the diversity at the 4-fold degenerate sites. This was not expected since under the assumption that intergenic regions evolve neutrally, the effect of direct and linked selection is expected to be weak. We also observed a higher proportion of low frequency variants at intergenic sites in both species which is not expected assuming that background selection affects the coding silent sites (4-fold degenerate sites) more than sites distant from genes. A possible confounding factor that may cause these unexpected patterns is selection at intergenic sites.
Indeed, many studies have suggested some degree of selection also in intergenic regions, especially close to genes (Andolfatto, 2005;Williamson et al., 2014;Woerner et al., 2018) although the strongest selection is usually found in coding regions. In addition, if mapping rate in the intergenic regions is low due to high divergence, this could cause frequent allelic dropout and hamper diversity estimation in these regions. This might be likely given the high repeat content of A. lyrata genome (Hu et al., 2011). However, after restricting our analyses to A. lyrata samples with only a slightly lower mapping rate, we still estimated a lower level of diversity in intergenic regions (Tables S4 and S5). Hence, it does not seem that this issue is responsible for the observed results.
In our demographic analysis, we attempted to limit the impact of linked selection by focusing on intergenic sites in regions of high recombination that were not close to genes. The most likely models suggested a strong effective population size decline start-  TA B L E 4 Statistics for the best multiple regression models, based on a BIC model selection, explaining variance in π 4 for populations of A. lyrata and C. grandiflora F I G U R E 4 Diversity level as a function of distance from the nearest gene for A. lyrata (coral) and C. grandiflora (green) (95% C.I. based on 100 bootstrap replicates) and corresponding two-sided pvalues (right y-axis) for species difference in π based on comparison of 100 bootstrap pairs. π estimates are plotted as line,  Slotte et al., 2010, St. Onge et al., 2011. Hence, this dataset offered a good contrast for studying the effect of demography on genome-wide selection in these species.
We contrasted the strength of purifying and positive selection in these populations by comparing patterns of diversity and divergence at 0-fold and 4-fold degenerate sites. The higher π 0 /π 4 in A. lyrata suggested relaxed efficacy of selection but can also be due to differences in dynamics of diversity recovery after population size change (Comeron, 2017;Torres et al., 2019). For instance, although at equilibrium we expect a decreased ratio of diversity at selected versus neutral sites, after a population size increase sites under stronger background selection reach equilibrium diversity faster than neutrally evolving sites (Comeron, 2017) and thus this selected versus neutral ratio may even increase at first.
To explicitly test selection strength, we estimated distribution of fitness effects of new mutations at 0-fold degenerate sites using a method that takes into account the demographic history. We found that the C. grandiflora population had more efficient purifying selection than the A. lyrata population. This is in line with the general prediction on the impact of effective population size on the efficacy of selection Ohta, 1973). The general observation of reduced purifying selection associated with smaller effective population size has been found also in other plant species, for example in Populus, Helianthus and Lactuca (Strasburg et al., 2011;Wang, Street, Scofield, & Ingvarsson, 2016). Other factors such as life-history traits have also been shown to correlate with the DFE (Chen, Glémin, & Lascoux, 2017). In plants, the strongest correlates were mating system and longevity, with selfing and long-lived species having weaker purifying selection in comparison with outcrossing annual species. In addition, longevity has also been found to be associated with lower diversity (Chen et al., 2017;Romiguier et al., 2014) which may be due to different life-history strategies among short-and long-lived species such as between annual C. grandiflora and perennial A. lyrata. Nevertheless, we found large differences in the DFE as well as in diversity between two outcrossing species both having relatively short lifespan but strong differences in the demographic histories. We hence suggest that demographic history is likely the ultimate cause of these observed patterns al- in C. grandiflora (Slotte et al., 2010(Slotte et al., , 2013Williamson et al., 2014) and A. lyrata Gossmann et al., 2010) and support generally stronger efficacy of selection in C. grandiflora.
We note that although the species are similar in many respects except the contrasting demographic histories, they also have differences with respect to their genome content, with, for example, shorter intergenic distances and fewer TEs in the Capsella genome (Hu et al., 2011;Slotte et al., 2013). Additionally, the effect of positive selection has been found to be stronger in C. grandiflora (Gossmann et al., 2010) which may also affect the diversity patterns. In turn, the recent colonization history of the studied A. lyrata population has been associated with local adaptation (Leinonen et al., 2009) which is likely to have shaped patterns of diversity at adaptive loci (Hämälä & Savolainen, 2019;Mattila et al., 2016;Mattila et al., 2017;Toivainen, Pyhäjärvi, Niittyvuopio, & Savolainen, 2014). However, the estimated proportion of adaptive substitutions relative to neutral divergence was <5% in both species. Hence, we suggest that background selection has likely stronger impact on the diversity in these study populations.

| Linked selection shapes the patterns of genome-wide variation in A. lyrata and C. grandiflora
Understanding the magnitude and factors contributing to the linked selection is important for evolutionary inferences using population genetics data (Burri, 2017;Comeron, 2017). We first investigated the role of linked selection for patterns of genomewide diversity using linear modeling, while accounting for the impact of a set of genomic features that might also affect diversity

levels. Previous theoretical modeling of background selection in
A. lyrata has suggested that the impact of background selection is expected to vary across the A. lyrata genome (Slotte, 2014).
The basic theoretical prediction on linked selection is a positive correlation between diversity and recombination rate (Begun & Aquadro, 1992;Kaplan et al., 1989;Thomson, 1977). In addition, variation in the frequency of selected sites along the genome, for example due to variation in gene density, can give rise to a negative relationship between diversity and gene density (Slotte, 2014). In this study, regression analysis confirmed independent effects of recombination rate and exonic bp on neutral diversity in both populations studied, suggesting that diversity is shaped by linked selection. The variables explaining the highest proportion of variance in neutral diversity were for C. grandiflora recombination rate and TE% for A. lyrata, both of which were associated with increased diversity. The large effect of TE% may be due to generally lower functional constraint in these regions, which allows accumulation of TEs as well as higher variation. In both study populations, effective number of codons was negatively associated with diversity, which may be due to differences in the strength of selection for optimal codon usage. In C. grandiflora, the number of exonic bp per window had a negative effect on diversity, as expected under background selection (Slotte, 2014), but it was not included in the best model for A. lyrata. This is likely due to overall lower diversity in A. lyrata and hence decreased explanatory power in the regression analysis.
In contrast to our results, a previous study testing for linked selection in A. lyrata found no evidence for a correlation between neutral diversity and recombination rate (Wright et al., 2006). However, the correlation in A. lyrata was relatively weak in comparison with that in C. grandiflora. Hence, the difference between previous work and the current study may be due to the small number of loci studied previously. Therefore, our large dataset allowed us to detect small effects undetectable in previous studies.

| Diversity around genes
To inspect the effect of linked selection further, we studied the level of diversity as a function of distance from annotated protein coding although the first population size decrease happened relatively long ago, our A. lyrata population has not had much time to recover from the second decrease which was also estimated to be very drastic (N e decrease from almost 200,000 to only 6,000). Immediately after a strong population size decrease, the expected reduction in diversity due to linked selection around deleterious loci is expected to be weak and it disappears over time (Torres et al., 2019). Since the diversity in our A. lyrata population is likely impacted by an ancient and a more recent bottleneck, a flat relationship of distance from genes and diversity is at least qualitatively in line with the simulation results and suggests that drift is mainly dominating diversity patterns at intergenic sites in A. lyrata. In addition, the population has had short time to recover from the more recent population size change and the effect of selection on diversity is still expected to decrease if the population size remains the same.
For C. grandiflora, the scaled time since the most recent population size change (a modest population size increase) was 0.6 N ANC , suggesting that this population is closer to equilibrium.
However, for C. grandiflora the uncertainty is much larger and the estimated scaled time since the last increase ranged from 0.01 to 1.7 N ANC . In such a demographic history, diversity is expected to dip around genes (Torres et al., 2019). In C. grandiflora, the observed magnitude of diversity reduction is comparable with results from simulations (Torres et al., 2019) and empirical results in teosinte and maize (Beissinger et al., 2016). It should however be noted that after population size expansion, diversity in regions under background selection may actually increase relative to that under neutrality, despite the theoretical expectation that population size increase should lead to a stronger impact of linked selection (Torres et al., 2019). Indeed, in their simulations, Torres et al.
(2019) observed that it could take up to 10 N ANC generations to reach equilibrium diversity level after a population size expansion.

| CON CLUS IONS
Despite the similar mating system of A. lyrata and C. grandiflora populations studied here, they differ drastically in their patterns of variation, demographic history, and selection; we estimated that the C. grandiflora population had stronger purifying selection, higher proportion of sites under positive selection, and more pronounced decrease of diversity around genes due to linked selection. In the A. lyrata population, we do not observe a reduction in diversity near genes in comparison with the regions distant from genes. This observation is consistent with simulation-based results and we hence conclude that nonequilibrium demographic history is likely to have modified the typical signature of linked selection close to genes in the studied A. lyrata population. However, the positive correlation of recombination rate and neutral diversity suggests that linked selection has had an effect on diversity to some degree in both study populations. Hence, examining further the effects of linked selection will be of utmost importance to understand differences in genetic diversity in natural populations.

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
TMM, BL, OS, and TS designed the study. TH provided data. TMM, BL, and RH conducted the data analysis. TMM and TS wrote the manuscript with input from all the other authors.

DATA AVA I L A B I L I T Y S TAT E M E N T
The analyzed sequence data are available at the European