Low concordance of short‐term and long‐term selection responses in experimental Drosophila populations

Abstract Experimental evolution is becoming a popular approach to study the genomic selection response of evolving populations. Computer simulation studies suggest that the accuracy of the signature increases with the duration of the experiment. Since some assumptions of the computer simulations may be violated, it is important to scrutinize the influence of the experimental duration with real data. Here, we use a highly replicated Evolve and Resequence study in Drosophila simulans to compare the selection targets inferred at different time points. At each time point, approximately the same number of SNPs deviates from neutral expectations, but only 10% of the selected haplotype blocks identified from the full data set can be detected after 20 generations. Those haplotype blocks that emerge already after 20 generations differ from the others by being strongly selected at the beginning of the experiment and display a more parallel selection response. Consistent with previous computer simulations, our results demonstrate that only Evolve and Resequence experiments with a sufficient number of generations can characterize complex adaptive architectures.


| 3467
LANGMÜLLER ANd SCHLÖTTERER Orozco-Terwengel et al., 2012) and even hundreds of generations (Burke et al., 2010). Computer simulations showed that the number of generations has a strong influence on the power of E&R studies in sexually reproducing organisms, and increasing the number of generations typically increased it (Baldwin-Brown, Long, & Thornton, 2014;. A larger number of generations increase the opportunity for more pronounced allele frequency changes and more recombination events, which results in a more refined mapping resolution (Baldwin-Brown et al., 2014;Kessner & Novembre, 2015;. Since simulations make simplifying assumptions, it is important to scrutinize their conclusions with empirical data. Until recently no suitable data sets of obligate outcrossing populations were available, which included multiple time points, were replicated and had starting allele frequencies matching natural populations. We use an E&R experiment , which reports allele frequency changes in 10 replicates over 60 generations in 10 generation intervals, to investigate the impact of the experimental duration on the observed genomic response. The time-resolved genomewide polymorphism data of this experiment allow to contrast putative selection targets, which are inferred at different time points, on three analysis levels (candidate SNPs, candidate windows, candidates SNPs shared with reconstructed haplotype blocks, Figure 1). By comparing selection signatures from different time points of the experiment, we show that only a subset of the selection targets are detected at earlier generations, which are not representative of the underlying adaptive architecture.

| Experimental Drosophila simulans populations
A detailed description of the Drosophila simulans E&R experiment can be found in Barghi et al. (2019) and Hsu et al. (2019). Pooled individuals (Schlötterer, Tobler, Kofler, & Nolte, 2014) from the evolving populations were sequenced every 10th generation starting with the founder population (generation 0) until generation 60 resulting in seven sequenced time points. This E&R experiment started from 202 isofemale lines, which were collected in Tallahassee, Florida.
10 replicate populations evolved in the laboratory at a "cycling hot" temperature regime (12 hr light and 28°C, 12 hr dark and 18°C).

| Genomic analysis hierarchy
To look for patterns of selection on different scales, we investigated the genomic response of the experimental Drosophila populations on three different levels: candidate SNPs, candidate SNPs in a window with a fixed number of SNPs and candidate SNPs shared with reconstructed selected haplotype blocks. A detailed description for each level is given below, and the different hierarchies are depicted in Figure 1. We performed the same analysis steps at different time points and compared the resulting time point-specific selection responses-either pairwise, or across multiple time points-to test the congruence in selection patterns. We evaluated the similarity of two time points with the Jaccard index-a dimensionless parameter ranging from 0 (no overlap) to 1 (sets are identical)-for both candidate SNPs and candidate windows. F I G U R E 1 Genomic analysis hierarchy: candidate SNPs (a), candidate SNPs in a window spanning a fixed number of SNPs (b) and candidate SNPs shared with reconstructed haplotype blocks (c). (a) Candidate SNPs (squares) were determined for each generation and were compared either pairwise (Figure 2), or across multiple time points (Figure 4). (b) Window-based approach to detect regions enriched for candidate SNPs. The same number of random SNPs as candidate SNPs were sampled onto the genome repeatedly (1,000 iterations). Windows (marked by vertical dashed lines) enriched for candidate SNPs contain at least as many candidate SNPs as the 99th percentile of the randomly sampled SNPs. Sets of candidate windows were either compared pairwise (Figure 2), or across multiple time points (Figure 4). (c) Haplotype block discovery rate (HADR) = the fraction of candidate SNPs (haplotype block marked by vertical dashed lines) that were also discovered at a given time point (framed squares) [Colour figure can be viewed at wileyonlinelibrary.com]

| Identification of candidate SNPs
In the original study, Barghi et al. (2019) applied various filtering steps to obtain a robust SNP set from the ancestral population, resulting in 5,096,200 SNPs on chromosomes X, 2, 3 and 4. A summary of the applied filtering steps can be found in theSupplementary Methods.
We used the already determined SNP set of the ancestral population to study the selection response at different time points. For this, we identified "candidate SNPs" based on the frequency difference between the ancestral and evolved populations for each of the six time points (Figure 1a). To identify SNPs with pronounced allele frequency change, we tested (as in Barghi et al. (2019)) replicates separately (Fisher's exact test) and jointly (Cochran-Mantel-Haenszel test, CMH) using PoPoolation2 (Kofler, Pandey, & Schlötterer, 2011 (Jónás, Taus, Kosiol, Schlötterer, & Futschik, 2016) using estimates of the effective population size (N e ) between generation 0 and the focal time point (Tables S1-S3). The neutral simulations further used the empirical starting allele frequencies and read depths. For the CMH test, N e estimates were averaged across replicates for autosomes and the X chromosome separately. For Fisher's exact test, we used replicate-specific N e estimates of the autosomes. Based on these neutral simulations, we determined candidate SNPs with a false discovery rate smaller than 5% .
We identified 56,166 candidate SNPs in generation 60, compared to 55,199 in Barghi et al. (2019). This small discrepancy can be explained by stochastic differences arising from the neutral simulations used to determine the significance threshold. We excluded six out of 99 haplotype blocks from Barghi et al. (2019) with <90% of the previously reported candidate SNPs.

| Identification of candidate windows
Because of linkage in the experimental populations (Nuzhdin & Turner, 2013;Tobler, Franssen, Nolte, & Schlötterer, 2014), the first genomic analysis level-the number of candidate SNPs ( Figure 1a)-will most likely suffer from an excess of candidate loci.
We used a window-based approach (Figure 1b) as second genomic analysis hierarchy to account for a potential excess of candidate SNPs. We split the main chromosomes (X, 2 and 3) into non-overlapping windows of 5,000 SNPs that are segregating in all generations and replicates. We chose SNPs instead of base pairs as window size measure in order to allow for variation in SNP density along the genome. To determine whether a given window displays a potential selection response(= it contains more candidate SNPs F I G U R E 2 Similarity measures for candidate SNPs, candidate SNPs in a window with a fixed number of SNPs and candidate SNPs shared with reconstructed selected haplotype blocks (a) Jaccard index (J) for pairwise comparisons of candidate sets. The top triangle shows candidate window sets, the bottom triangle candidate SNP sets. Significant similarities (p-value < .05 after multiple testing correction, 10,000 bootstraps) are written in bold. (b) The rate at which selected SNPs of 93 haplotype blocks from generation 60 were already discovered at earlier generations (haplotype discovery rate, HADR) [Colour figure can be viewed at wileyonlinelibrary.com] than expected), we sampled the same number of random SNPs as candidate SNPs in this window (1,000 iterations) onto the whole genome. "Candidate windows" contained at least as many candidate SNPs as the 99th percentile of randomly sampled SNPs. We received time point-specific candidate window sets (Figure 1b, Figure S1), by applying the procedure described above independently to candidate SNPs from all time points. To check for the robustness of candidate window patterns, we varied the window size and allowed SNPs to fix during the experiment, which both resulted in qualitatively similar results (Tables S4 and S5).
Candidate windows (=the number of candidate SNPs in a window) is a summary statistic which ignores the significance of the candidate SNPs. If a signal is robust between two time points, we expect the same p-value-based ranking of candidate SNPs assuming homogeneous read depth for all sites and time points and that candidate SNPs do not fix during the experiment. Although read depth heterogeneity among sites will change the confidence in the estimates of the allele frequency change, the average dynamics of relative significance allow us to determine whether the robustness of a putative selection response increases with time. Thus, we also evaluated whether candidate SNPs in a given window had a similar significance rank. For each candidate window, we created a ROC-like curve (similar to Jakšić and Schlötterer (2016)) by ranking the candidate SNPs by their p-values-the most significant SNP was assigned rank 1-and calculating the overlap in top-ranked SNPs between two time points.

| Haplotype block discovery rate (HADR)
The third genomic analysis level is based on reconstructed selected haplotype blocks of the original study ( Figure 1c): Barghi et al. (2019) clustered candidate SNPs from generation 60 into selected haplotype blocks based on similar allele frequency trajectories over time and replicates (Franssen, Schlötterer, & Barton, 2016) to assess the underlying haplotype structure in the experimental populations. The reconstructed haplotype blocks were further validated with experimentally phased haplotypes from ancestral and evolved populations , and 96% of the reconstructed haplotype blocks could be confirmed by the experimentally derived haplotypes. This suggests that reconstructed haplotype blocks provide a robust set of linked candidate SNPs that can be used to investigate time pointspecific patterns of selection.
Taking advantage of this additional confirmation of the candidate SNPs in a selected haplotype block, we developed a third measure of similarity between time points. We determined the fraction of candidate SNPs comprising a haplotype block that were also discovered at a given time point (haplotype block discovery rate, HADR, Figure 1c) using the poolSeq R-package (Taus, Futschik, & Schlötterer, 2017).
We note that the inference of selected haplotype blocks at each generation does not provide a good alternative to the HADR measure. It has been shown that the ability of the clustering method to group SNPs into haplotype blocks is dependent on the number of time points (Franssen et al., 2016). This results in less power at early generations where fewer time points are available.

| Early Detected Haplotype blocks (EDHAs)
We applied various clustering methods (hierarchical clustering (Pollard & Laan, 2005), PCA and k-means (Hartigan & Wong, 1979)) to group reconstructed haplotype blocks by their HADR patterns over time. The hyperparameter k (which determines the number of clusters) in the k-means clustering procedure was set to 5 based on the gap statistic (Tibshirani, Walther, & Hastie, 2001). Both the k-means clustering and the PCA revealed a clearly distinguishable group of 10 haplotype blocks with elevated HADR early on in the experiment ( Figures S2 and S3). We refer to the haplotype blocks in this cluster as early detected haplotype blocks (EDHAs). We investigated whether EDHAs have distinct characteristics that might explain why they are early detectable by comparing the following features between EDHAs, and non-EDHAs: haplotype block length, median starting allele frequency, average recombination rate (Howie, Mazzucco, Taus, Nolte, & Schlötterer, 2019), selection coefficient (s, estimated with poolSeq (Taus et al., 2017)) in generation 20, s in generation 60, selection coefficient ratio r s = s 20 /s 60 and number of rising replicates in generation 20 and ingeneration 60. To avoid that non-responding replicates bias the selection coefficient estimates downwards, we averaged selection coefficients over replicates that passed a certain allele frequency change threshold. Following Barghi et al. (2019), we classified a haplotype block as replicate-specific, if the allele frequency of candidate SNPs from a haplotype block increases on average by at least 10%. To test the robustness of the replicate-specific rising behaviour, we also used 5%, 15% and 20% thresholds, which all resulted in qualitatively similar results.

| Simulation with linkage
Despite the selected haplotype blocks were inferred with high confidence by Barghi et al. (2019), we were interested to confirm our conclusions with a simulated data set for which the selection targets are known. Because the distinction between true causative SNPs and linked polymorphisms is challenging (Nuzhdin & Turner, 2013;Tobler et al., 2014), we obtained a simulated data set that models linkage and resembles the Barghi et al. (2019) study using MimicrEE2 (version 206) (Vlachos & Kofler, 2018). Our ancestral population was built from 189 ancestral haplotypes published in Barghi et al. (2019) and Howie et al. (2019). We simulated a selective sweep scenario with 99 independent selection targets, where each of them is located in the inferred selected haplotype block, has the same starting frequency, and selection coefficient (both estimated as the median from all selected SNPs in the block; Barghi et al., 2019). Since we were primarily interested in the early genomic responses after the populations have been exposed to a new environment, we reasoned that selective sweeps approximate the selection trajectories for a polygenic trait with stabilizing selection (as presented in Barghi et al., 2019) rather well (Chevin & Hospital, 2008;Höllinger, Pennings, & Hermisson, 2019;Jain & Stephan, 2017a, 2017b. Using the Drosophila simulans recombination rate map (Howie et al., 2019), we simulated 10 outcrossing populations with constant census size adapting for 60 non-overlapping generations. We sampled the read depth for each site from a Poisson distribution with λ equal to the average read depth for each population and time point as reported in Barghi et al. (2019). For each site, we applied binomial sampling with the determined read depth and true allele frequency to mimic the sampling of reads out of a DNA pool (Jónás et al., 2016). This forward simulation results in a data set very similar to the original study that allows us to analyse selection patterns on all three genomic analysis hierarchy levels, but in contrast to the empirical data, the causative SNPs are known. The identification and analysis of candidate SNPs and windows of the simulation followed the protocol of the empirical data (as described above; Figure 1a

| Simulation without linkage
To explicitly test whether the highly parallel selection signature of EDHAs at generation 20 can be explained by synergistic effects of drift and selection, we simulated 100,000 codominant, unlinked loci in 10 replicate populations with equal starting allele frequency (10%, median starting allele frequency reported in Barghi et al. (2019)), and selection coefficient (s = 0.05, median selection coefficient reported in Barghi et al. (2019)) using the poolSeq R-package (Taus et al., 2017). Based on neutral simulations (conducted with the poolSeq R-package (Taus et al., 2017), we determined candidate loci with a false discovery rate smaller than 5% after 20 generations (CMH test; Barghi et al., 2019) and compared the parallelism of candidate loci to the parallelism of non-candidate loci. If the synergistic effect of drift and selection is causing elevated parallelism, we would expect candidate loci to rise in significantly more replicates than their non-significant counterparts, despite having the same starting allele frequency and selection coefficient.

| Subsequent time points are more similar for advanced generations
We studied the similarity of selection signatures for different time points using 10 replicates of a D. simulans population, which evolved for 60 generations to a novel hot environment . With Pool-Seq  data from every 10th generation, we evaluated the selection signature on three different levels: candidate SNPs, candidate SNPs in a window spanning a fixed number of SNPs, and candidate SNPs shared with reconstructed selected haplotype blocks (Figure 1). The similarity of inference between two time points was determined by the Jaccard index (J), ranging from 0 (no overlap between two SNP sets) to 1 (sets are identical). We found that all candidate SNP sets are more similar than expected by chance. The Jaccard index ranged tions. While this suggests that outliers at generation 60 provide the most reliable selection signature, we were interested to confirm this with computer simulations where all selection targets and the type of selection are known. We simulated an E&R experiment that resembles the data from Barghi et al. (2019) and repeated our analysis on simulated, time-resolved allele frequencies. Again, we observed that candidate SNP sets are more similar than expected by chance, the majority of subsequent time points are more similar than those separated for more than 10 generations, and the similarity of candidate SNP sets from subsequent time points increases with time ( Figure S4).
Since the analysis of single SNPs suffers from considerable stochasticity, and neighbouring SNPs are not independent (Howie et al., 2019;Tobler et al., 2014), we repeated the analysis of different time points using non-overlapping windows of 5,000 SNPs ( Figure 1b). Reasoning that windows containing a target of selection will harbour multiple candidate SNPs, we defined selected windows as those, which harbour more candidate SNPs than expected by chance. Consistent with higher stochasticity at the SNP level, a higher similarity was observed for candidate windows (from J = 0.26  Figure 2a). In contrast to the SNP level, the set of selected windows after 10 generations is only significantly similar to generation 20, but not to any other generation. Thus, the pattern of reduced similarity of selection targets in the early generations is confirmed at the window level, albeit with different significance levels. The same pattern was noticed for the simulated E&R data ( Figure S4).
For an alternative measure of similarity, we used the ranking of candidate SNPs in a specific window based on their p-values and compared it between different time points. If a signal is robust between two time points, we expect the same SNP ranking of segregating SNPs in a selected window. Consistent with the other tests, we found that the congruence in candidate SNP ranking increases with time in both empirical and simulated data (Figure 3, Figure S5).
To rule out that rare SNPs are responsible for the dissimilarity between early and late time points, we calculated similarity measures based on SNPs that are segregating at all generations and time points. Nevertheless, including SNPs which were lost in at least one replicate during the experiment did not result in a pronounced decrease in similarity ( Figure S6).
The analysis of selected haplotype blocks provides another possibility to control for non-independence of single candidate SNPs. We calculated the haplotype block discovery rate (HADR,  (Table S6). While this suggests that earlier time points harbour more false positives, we would like to point out that the low concordance in selection patterns between early, and late generations in the experiment could also be caused by a change (or even a reversal) in selection. It may be possible that some targets were only selected during the first generations and that the strength of selection changed later on. This may be caused for example either by epistatic interactions, or an unobserved change in our experimental environment. While this cannot be ruled out for the empirical data, this cannot explain the similarity dynamics in the sweep simulation with additive selection, and no epistasis modelled ( Figures S4 and S7).
It is important to note that the sweep simulation did not have a similar number of candidate SNPs over time, rather the number of false positives increased with the duration of the experiment (Table S7). We attribute this discrepancy of the simulated and empirical data outcome to the selection regime employed, which affects mainly later generations. Sweep simulations result in haplotype blocks with many selection targets (Barghi & Schlötterer, 2020), which in turn influences the number of candidate SNPs. Simulations of polygenic adaptation can become quite complex and rich in parameters (Thornton, 2019), which limits the power of computer simulations to scrutinize this result further. Rather, experimental F I G U R E 3 The rank of candidate SNPs becomes more congruent with time. In this ROC-like graph, the ranking of all candidate SNPs in candidate windows is compared. Each panel shows one intermediate time point compared to generation 60. The overlap (in percent) for each candidate window is indicated by a separate line. The median overlap (solid black line) monotonically increases with experimental duration, demonstrating that the ranking of candidate SNPs is more robust for advanced generations. The black, dashed lines show the expected overlap in SNP ranking if every variant at generation 60 is recapitulated in a previous time point validation of selection targets in secondary E&R studies (Burny et al., 2020) may be used to confirm selection signatures beyond statistical testing and could serve an important role to test early selection signatures that cannot be confirmed at later time points.

| Only few selection targets are shared across all generations
More than 27,000 candidate SNPs can be identified at each time point (Table S6), but only a small (5%) subset is consistently detected at every generation (Figure 4; including rare SNPs see Figure S8).
The small subset of consistently detected candidate SNPs cannot be explained by fixation of candidate SNPs of early generations before generation 60-the ratio of fixed candidate SNPs does not exceed 3.5% at any consecutive time point ( Figure S9). Apart from highlighting the more robust selection signatures with an increasing number of generations, this analysis raises an important concern about the usefulness of meta-analyses on the SNP level. With less than 5% of the SNPs being shared in the same selection experiment, it will be extremely difficult to compare studies that started from different founder populations and were selected for a different number of generations.
We repeated the analysis for windows and determined the number of selected windows that are shared across all generations. With 18 out of 74 candidate windows in generation 60 (24.3%, Figure 4) being detected at all generations, the window analysis shows more consistency across time points than a SNPbased analysis. This observation is independent of window size (Tables S4 and S5) and the inclusion of rare SNPs into the analysis ( Figure S8). Because we observed a similar trend in our simulated experiment ( Figure S10), we propose that meta-analyses of E&R data should be performed on the level of windows, or based on F I G U R E 4 Five percent of candidate SNPs in generation 60 are detected consistently at every generation. The bars depict the fraction of candidate SNPs (black) and candidate windows (white) at generation 60, which are candidates in all subsequent generations (e.g. 40% of generation 60 candidate SNPs are candidates in generation 50 and 40). Candidate windows are more consistent than candidate SNPs. Figure S8 depicts the ratios for candidate sets that are not restricted to SNPs segregating in all generations and time points F I G U R E 5 Early detectable haplotype blocks (EDHAs) differ from the other selected haplotype blocks. (a) The ratio of selection (r s ) coefficients determined for early generations (generation 20, s 20 ) and late generations (generation 60, s 60 ) is significantly higher for EDHAs. (b) EDHAs rise in more replicates than other haplotype blocks after 20 generations. Both observations are robust to different allele frequency change thresholds. Values above the boxplots represent the two-tailed Mann-Whitney test p-values corrected for multiple testing with the Benjamini-Hochberg procedure [Colour figure can be viewed at wileyonlinelibrary.com] selected haplotype blocks to avoid false negatives due to the high stochasticity of SNP-based analyses.

| Selection signatures detected early in the experiment are not representative of the underlying adaptive architecture
This study focused on the comparison of selection targets detected at early and late time points. Since analyses based on single SNPs are very stochastic, we investigated the fraction of candidate SNPs comprising a haplotype block that were also discovered at earlier time points (HADR, Figure 1c). We detected 10 haplotype blocks with elevated HADR in generation 20 (early detected haplotype blocks, EDHAs, Figures S2 and S3). We found that EDHAs do not differ in their starting allele frequency, haplotype block length, average recombination rate, absolute selection coefficients or number of rising replicates after 60 generations from other haplotype blocks ( Figure S11). EDHAs are, however, more strongly selected at the beginning of the experiment, but are equally strongly selected as the remaining haplotype blocks at later generations ( Figure 5a). Consistent with stronger selection at earlier time points, the selection signature of EDHAs is significantly more parallel across replicates after 20 generations of adaptation in both empirical and simulated data.
( Figure 5b, Figure S12). We attribute this observation to a phenomenon similar to the "winners curse," that is that loci where stochastic effects increased the frequency in multiple replicates to enhance the contribution by selection are more likely to be detected. We explicitly tested this interpretation with computer simulations. For this, we simulated unlinked loci with identical starting frequencies and selection coefficients. As expected not all selected loci are detected after 20 generations, but detected loci show a more parallel selection signature than not detected ones despite having the same starting allele frequency, and selection coefficient ( Figure S13).
All statistical tests, which are evaluating a parallel selection signature across replicates, are more likely to detect selection signatures shared across replicates, even with only moderate allele frequency changes. This "enhanced" parallelism could result in wrong or incomplete conclusions about the underlying genetic architecture (Höllinger et al., 2019;Jain & Stephan, 2017a;Thornton, 2019). The analysis of selection signatures in replicated experiments running for only a moderate number of generations is more likely to detect parallel than replicate-specific selection signatures. This bias is not restricted to our study, but also an experimental study of D. simulans populations adapting 10 to 20 generations to a new temperature regime (Kelly & Hughes, 2018) found more parallel selection responses. We propose that additional analyses contrasting selection signatures of early and late time points are needed to confirm the enrichment of parallel selection signatures in short-term experiments.

ACK N OWLED G EM ENTS
We thank the members of the Institut für Populationsgenetik for fruitful discussion and support. Special thanks to Neda Barghi, Sheng-Kai Hsu and Claire Burny for helpful comments on earlier versions of the manuscript. This work was supported by the Austrian Science Fund (FWF, grant number DK W1225-B20) and an European Research Council (ERC) grant (ArchAdapt).