In the presence of population structure: From genomics to candidate genes underlying local adaptation

Understanding the genomic signatures, genes, and traits underlying local adaptation of organisms to heterogeneous environments is of central importance to the field evolutionary biology. Mixed linear mrsodels that identify allele associations to environment, while controlling for genome-wide variation at other loci, have emerged as the method of choice when studying local adaptation. Despite their importance, it is unclear whether this approach performs better than identifying environmentally-associated SNPs without accounting for population structure. To examine this, we first use the mixed linear model GEMMA, and simple Spearman correlations, to identify SNPs showing significant associations to climate with and without accounting for population structure. Subsequently, using Italy and Sweden populations, we compare evidence of allele frequency differentiation (FST), linkage disequilibrium (LD), fitness variation, and functional constraint, underlying these SNPs. Using a lenient cut-off for significance, we find that SNPs identified by both approaches, and SNPs uniquely identified by Spearman correlations, were enriched at sites showing genomic evidence of local adaptation and function but were limited across Quantitative Trait Loci (QTL) explaining fitness variation. SNPs uniquely identified by GEMMA, showed no direct or indirect evidence of local adaptation, and no enrichment along putative functional sites. Finally, SNPs that showed significantly high FST and LD, were enriched along fitness QTL peaks and cis-regulatory/nonsynonymous sites showing significant functional constraint. Using these SNPs, we identify genes underlying fitness QTL, and genes linking flowering time to local adaptation. These include a regulator of abscisic-acid (FLDH) and flowering time genes PIF3, FIO1, and COL5.


Introduction
are functional and influence fitness. SNPs showing significant associations to climate were found 128 to be significantly enriched among nonsynonymous, but also synonymous variation (Hancock, et 129 al. 2011;Lasky, et al. 2012). The enrichment among synonymous variation (which largely 130 evolve neutrally) maybe the result of linkage disequilibrium due to neutral processes but also 131 background selection (Charlesworth, et al. 1993) and/or hitchhiking (Gillespie 2000). A stricter 132 enrichment test will be one that controls for sequence conservation along coding and non-coding 133 sites. Sites that are highly conserved among species (Miller, et al. 2007;Haudry, et al. 2013;134 Hupalo and Kern 2013), are assumed to be under functional/selective constraint and functionally 135 important -that is, due to purifying selection the number of tolerated mutations is limited 136 (Graur 2016). Therefore, SNPs showing significant evidence of local adaptation across highly 137 constraint sites are more likely to be true positives. 138

139
In the current study, we first examine how accounting for population structure in genome-wide 140 associations to climate may affect our ability to provide a comprehensive picture on the genetic 141 basis of local adaptation; and secondly, we use an approach that accounts for genetic signatures 142 of selection and functional constraint to detect potential genes underlying local adaptation.  Gienapp, et al. 2017;Oakley, et al. 2018)). Using the two sets of SNPs, we 152 first separated them into ones identified by both methods (referred to as "Common" hereafter), 153 and ones uniquely identified by each approach (referred to as "GEMMA" and "Spearman" 154 hereafter). Thereafter, we compared evidence of local adaptation and function underlying the 155 three sets of SNPs. More specifically, using Italy and Sweden re-sequenced genomes, and QTL (c) how they were distributed along nonsynonymous and cis-regulatory sites showing significant 161 functional constraint among plants species of the Brassicaceae family (Haudry, et al. 2013 (Lasky, et al. 2012). We also filtered out accessions that were likely laboratory 184 escapees or contaminants (Pisupati, et al. 2017), leaving 875 accessions. After we filtered for 185 biallelic SNPs with minor allele frequency >0.05, we tested association with home climate of 186 ecotype and tested for potential confounding effects of population structure using the software 187 Wald test (default) to test for significant associations to climate. We tested models where home 190 climate was a function of SNP allele, and the association p-values we report are for the null 191 hypothesis that the mean climate occupied by the two alleles is equal (Lasky, et al. 2014 − . |) and linkage disequilibrium (LD) between a SNP and its neighboring SNPs 206 with a 20kb window. LD was measured using the package 'PLINK' (Purcell, et al. 2007 To narrow down SNPs to ones that are more likely to underlie differences in function/expression 214 of protein-coding genes we focused on cis-regulatory and nonsynonymous variation that was 215 found along sites showing significant functional constraint. We regarded cis-regulatory SNPs as 216 those found within 1 kb upstream from the transcriptional start site of a gene (Zou, et al. 2011;217 Pass, et al. 2017), unless these sites were found in transcribed regions of other genes (in which 218 case they were excluded). To call nonsynonymous variation we used bi-allelic sites, we used a 219 publicly available python script (callSynNonSyn.py; archived at https://github.com/kern-lab/), 220 and gene models downloaded from the TAIR database (TAR10 genome release) (Berardini, et al. were grown in growth chambers with the following settings: after 6 days of stratification in the 268 dark at 4ºC, constant temperature of 16°C with 16 hours light / 8 hours darkness, 65% humidity. obtained by the two methods using different percentiles of the p-value distributions as cutoffs for 293 significance (10 th , 5 th , 1 st percentiles). For significant SNPs that were segregating between North 294 Sweden and South Italy populations the percent of SNPs that were identified by both approaches 295 ("Common") at the three significance levels was ~20% (Fig. S2). When comparing the average absolute allele frequency differentiation | .
− . | and 310 LD ( 2 ��� ) across SNPs that were unique to each approach ("Spearman" or "GEMMA") and SNPs 311 identified by both methods ("Common") the set of "Common" and "Spearman" SNPs showed 312 significantly higher allele frequency differentiation and LD than the genome average (Fig 2A-313 2B). On the other hand, SNPs that were unique to "GEMMA" did not show any large differences 314 in FST and LD from the genome average (Figs 2A-2B). Furthermore, across "Common" and 315 "Spearman" SNPs we see a decrease in (| . − . |) and LD as the threshold of 316 significance becomes less stringent (Figs 2A-2B). This is indicative that stronger associations 317 capture SNPs showing stronger evidence of local adaptation and recent selection. Therefore, for 318 any further tests we used the set of SNPs that were significant using the 1 st percentile of the p-319 value distributions as the cutoff. "Common" and "GEMMA" SNPs near QTL peaks was very low, while the proportion 372 "Spearman" SNPs was greater than the 80 th percentile of the distribution but less than the 95 th .

374
Given that "Common" SNPs did not show an enrichment within 100kb of the 20 fitness QTL 375 peaks examined (Supplementary data), we searched whether any of these were within the six 376 tradeoff QTL that each span a large genomic region (Supplementary data). As shown in Fig. S4, 377 most of the high FST and high LD "Common" SNPs (181/209) were found within a 60kb region 378 (14,719-14,781Mb) of a single genetic tradeoff (GT) QTL (GT QTL 2:2) (Fig. S4). Apart from 379 being limited to a single GT QTL, given the multiple fitness QTL within GT QTL 2:2 (which 380 spanned a length of ~4Mb); the multiple regions showing significant recent sweeps in Sweden; 381 and the large number of windows showing a high proportion of high FST and LD SNPs; it is 382 highly unlikely that the "Common" set of SNPs cover all the causative genetic variation within 383 GT QTL 2:2. genomes were called in a previous study (Price, et al. 2018). 428 429 Our analysis resulted in 25 genes within 100 kb of fitness QTL peaks and spanning three genetic 430 tradeoff QTL (2:2, 4:2, 5:5) ( Table S2). Many of these were involved in interesting biological 431 processes such as: response to different abiotic stress factors and the abscisic-acid signaling 432 pathway which is important in abiotic stress response (Tuteja 2007) (Table S2). Among these 433 genes, two of them (AT4G33360 (FLDH), AT4G33470 (HDA14)) showed strong expression 434 GxE interactions (GxE interactions were identified in a previous study (Price, et al. 2018)) when 435 Italy and Sweden plants were grown under cold acclimation conditions (4 °C) for two weeks 436 (Gehan, et al. 2015). Interestingly, FLDH is a negative regulator of the abscisic acid signaling 437 pathway (Bhandari, et al. 2010) . As shown in Fig. 5A this gene was within a region of a genetic 438 tradeoff QTL that showed a significantly high proportion of high FST and high LD SNPs. 439

Expression of FLDH under control and cold acclimation conditions was significantly lower in 440
Sweden than Italy plants (Fig. 5A). 441 442 Among the set of flowering time genes, we identified three (AT1G09530 (PIF3), AT2G21070 443 (FIO1), and AT5G57660 (COL5)) that contained high FST and LD nonsynonymous SNPs within 444 conserved coding regions. Among the three genes, PIF3 was found along a chromosomal region 445 that showed the highest CLR for a recent sweep in Sweden and a high density of SNPs showing 446 high FST and high LD (Fig. 5B). In the quest to study the genetic basis of local adaptation using genome wide associations to 490 environment, linear mixed models have emerged as a powerful tool given their ability to account 491 for population structure while testing for significant associations (Yu, et al. 2006;Kang, et al. 492 2008; Kang, et al. 2010;Zhou and Stephens 2012). Although they provide a robust statistical 493 framework for removing many false positives, the current study shows that such an approach 494 may significantly limit our ability to understand the polygenic basis of local adaptation. alleles having a small effect on fitness. A more likely explanation is that accounting for 501 population structure using genetic relatedness estimated from genome-wide SNPs, can lead to a 502 significant number of false negatives. Arabidopsis thaliana however, is a simple, highly inbred 503 species -to correctly assess the impact of accounting for population structure when examining 504 the genetic basis of local adaptation there needs to be examination of other species with more 505 complex life-history traits and evolutionary dynamics. and regulation of physiological responses to temperature (Jiang, et al. 2017). This highly 532 conserved transcription factor was found within a large region that showed significant evidence 533 of local adaptation. This chromosomal region may involve a single causative variant, or a group 534 of linked genes that interact with PIF3 and were under selection because they contributed to 535 building an advantageous phenotype (Barton and Bengtsson 1986; Yeaman and Whitlock 2011). Finally, the current study shows that we need a new statistical framework to examine genome 548 wide associations to environment, and furthermore, it increases our understanding on the genes 549 and traits that may underlie local adaptation in A. thaliana