Population genomics and the causes of local differentiation


Correspondence: Stephen J. Tonsor, Fax: 412 624 4759; E-mail: tonsor@pitt.edu


Exactly 50 years ago, a revolution in empirical population genetics began with the introduction of methods for detecting allelic variation using protein electrophoresis (Throckmorton 1962; Hubby 1963; Lewontin & Hubby 1966). These pioneering scientists showed that populations are chock-full of genetic variation. This variation was a surprise that required a re-thinking of evolutionary genetic heuristics. Understanding the causes for the maintenance of this variation became and remains a major area of research. In the process of addressing the causes, this same group of scientists documented geographical genetic structure (Prakash et al. 1969), spawning the continued accumulation of what is now a huge case study catalogue of geographical differentiation (e.g. Loveless & Hamrick 1984; Linhart & Grant 1996). Geographical differentiation is clearly quite common. Yet, a truly general understanding of the patterns in and causes of spatial genetic structure across the genome remains elusive. To what extent is spatial structure driven by drift and phylogeography vs. geographical differences in environmental sources of selection? What proportion of the genome participates? A general understanding requires range-wide data on spatial patterning of variation across the entire genome. In this issue of Molecular Ecology, Lasky et al. (2012) make important strides towards addressing these issues, taking advantage of three contemporary revolutions in evolutionary biology. Two are technological: high-throughput sequencing and burgeoning computational power. One is cultural: open access to data from the community of scientists and especially data sets that result from large collaborative efforts. Together, these developments may at last put answers within reach.

The raw material for Lasky et al.'s (2012) study consists of spatial, environmental and genetic information from more than one thousand genotypes of Arabidopsis thaliana collected from 447 locations throughout the Eurasian range of A. thaliana (Fig. 1). Each accession is genotyped for more than 200 000 single nucleotide polymorphisms or SNPs (see Horton et al. 2012 and http://bergelson.uchicago.edu/regmap-data/regmap.html/). This SNP density provides gene-level marker resolution, meeting the goal of surveying the entire genome for its participation in geographical patterning. In addition, the authors make use of three geo-referenced climate databases to provide a robust description of climate in each locale. Lasky et al. (2012) determine the extent of genome-wide geographical differentiation and dissect the two major causes of range-wide patterning. The first cause they address is in situ adaptation driven by local differences in the environment, approximated by Lasky et al.'s ‘climate’. Their second cause, ‘space’, includes all spatial patterns driven by forces other than local adaptation, including both random genetic drift and phylogeography driven by ice-age vicariance and subsequent spread. Space also may represent uncharacterized spatially autocorrelated environmental gradients that are not detected by the climate variables, including factors affecting microclimate such as slope and aspect of a site as well as soil and biotic community characteristics.

Figure 1.

Arabidopsis adapts to a broad variety of climates, from population sites with prolonged winters (Pyrenee Mountains, left panel) to sites that rarely freeze and experience hot, dry springs (Mediterranean shore near the French–Spanish border, right panel). Arabidopsis in natural populations (centre panel, author's ring for scale) exhibits phenotypes that can be vastly different from what is typically observed in laboratory settings.

Teasing apart this mix of potential causes across both the species geographical range and its genome presents an analytical challenge because of the high dimensionality of the system. Lasky et al. (2012) essentially follow on Wright's (1932) seminal vision of genomic variation as describing a multidimensional space in which each gene represents one or more axes of variation. For the genome of A. thaliana, this space has well over 30 000 dimensions if all transcribed and additional non-transcribed regulatory sequences are counted. For Lasky et al., this space is described by the allelic variation in 200 000 SNPs. Geographical space may seem to be simpler, describable in two dimensions, but what matters here is the dimensionality of the matrix of distances between genotype locales on Earth's approximately spherical surface. For n = 447 locales in this study, this geographical space has n (n−1)/2 = 99 681 dimensions, nearly as daunting as the genetic space. And environmental variation, even if confined strictly to climatic variation, has also many dimensions. With each type of data, however, the many dimensions are not entirely orthogonal. What to do? Lasky et al. (2012) use sophisticated approaches to first find the important dimensions in each data matrix and then dissect key relationships amongst them.

Using methods that have previously been applied in community ecology (see Dray et al. 2006), and building on Manel et al.'s (2010) approach, Lasky et al. (2012) first reduce the distance matrix using principal coordinates of neighbour matrices (PCNM—Borcard & Legendre 2002). Much like principal components analysis (and Moran's eigenvector maps; see Dray et al. 2006), PCNM calculates eigenvectors that reflect the hierarchy of spatial scales in which the genotypes are clustered. This provides the spatial framework in which associations of climate patterning and genetic variation can be addressed. The authors then use redundancy analysis (RDA—see Makarenkov & Legendre 2002) to ask how much of the SNP variation can be predicted by climate and space. RDA is to canonical correlation as linear regression is to simple product-moment correlation; it predicts values of a multivariate set of response variables (SNPs) based on the predictor variables (space and climate), taking into account covariation in both predictor and response variables. RDA does this by maximizing the variance explained in the response variables by the predictors and creates orthogonal axes of linear combinations of response and predictor variables. The variance in response variables accounted for by predictor variables can be partitioned into components attributable to climate, space and climate–space confounded. Thus, this study provides an invaluable primer of multivariate statistical approaches for dissecting population genomic spatial patterning and its causes.

Lasky et al. (2012) show that climate and spatial associations together explain 23% of SNP variation. Climate variation that is independent of spatial variation accounts for about 25% of the total adjusted r2 of 0.23 for climate and space considered jointly, that is, about 6% of the total SNP variation. Spatial patterning that is independent of climate accounts for a nearly equal amount, 31% of the combined r2 and 7% of the total SNP variance. The remainder of the explained variation, a little under half, is associated with both climate and space, covarying in such a way that they cannot be teased apart (see Lasky et al.'s supplemental materials for details). Climatically, minimum growing season temperatures and summer precipitation explained the greatest amount of SNP variation, in agreement with the Hoffman's (2005) previous bioclimatic study. Spatially, the greatest variation was explained by an axis that separates Northern from Western European genotypes.

The magnitude of the variation that cannot be separated into the two focal components illuminates the difficulty of separating the effects of population structure from that of environmental gradients. A complementary approach to that of Lasky et al. (2012) is to focus on a finer spatial scale in which strong environmental gradients can partially or wholly dissociate these factors as in the study by Montesinos-Navarro et al. (2011) and Lewandowska-Sabat et al. (2012).

How much variation in SNPs should we expect to be explained by associations with climate and spatial patterning? There is no theoretical expectation available. That 23% of the SNP variation, spread around the genome, is predicted by climate and space suggests geographical differentiation throughout the genome. Widespread differentiation within the genome because of drift and phylogeography is not surprising. Adaptive differentiation to climate is another story. If this was a highly vagile outcrossing species, one could conclude that something in the range of 6–16% of the genome is included in the genetic architecture of climatic adaptations (i.e. variation strictly associated with climate or associated with both climate and space respectively; see Lasky et al.'s Table S4), suggesting a highly polygenic and highly genomically distributed basis to adaptation. However, A. thaliana is highly inbred. It is therefore unknown how much the linkage disequilibrium amongst SNPs inflates the apparent extent of genomic involvement in the genetic architecture of adaptation. Some light is shed on this question by the observation that there is an excess of SNP variation explained by climate in coding vs. non-coding regions. This suggests that much of the detected association with climate is adaptive rather than being a result of many SNPs hitchhiking on a few adaptive differences. Lasky et al. (2012) have shown us that many genes are involved in climate adaptation in this widespread species.

The PCNM eigenvector associated with the smallest spatial scale explained little SNP variation. This contrasts with other recent studies. Manel et al. (2010), using very similar approaches to Lasky et al. but in the closely related Arabis alpina, found the highest number of loci to be adaptively differentiated at their most local scale. Two studies also in A. thaliana are noteworthy in demonstrating local adaptive divergence in functional traits, suggesting that one will find underlying differentiation in SNPs as well. Montesinos-Navarro et al. (2011, 2012) observe populations on Spanish Mediterranean to Pyrenean climate and altitude gradients that are strongly adaptively differentiated in life history, morphological and functional traits. Lewandowska-Sabat et al. (2012) report substantial variation in vernalization requirements on coastal-inland and elevation gradients in Norway. Lasky et al. (2012) suggest that the lack of climate and space patterning of SNPs at their lowest spatial scale may be due to the preponderance of European accessions, where the greatest modern admixture has occurred. Montesinos-Navarro et al. (2011, 2012) and Lewandowska-Sabat et al. (2012) studied populations at the southern and northern limits of the range, respectively, where admixture is greatly reduced (see Picó et al. 2008 for the Spanish populations), and the opportunity for local adaptive differentiation therefore may be enhanced. Lasky et al. (2012) similarly find the greatest degree of independent explanatory power for climate in Scandanavian genotypes. It may be that adaptation to local conditions is more strongly evolvable at the range limits because of differences in population isolation, strength/divergence of selection or both.

The next step is perhaps the most difficult: elucidating the functional genomics of climate-associated differentiation. Whether conducted bottom-up by starting with candidate genomic regions identified through the approach of Lasky et al. (alternative top-down approaches: Hancock et al. 2011; Fournier-Level et al. 2011) or top-down by locating genes associated with functional differentiation across climate gradients, the biggest challenge will be to bring field phenotyping to a level of throughput that matches the impressive prowess of collective genotyping and statistical computation demonstrated in this study.

Although S.J.T. was the sole author, he is indebted to too many people to enumerate for any ideas that would otherwise appear to be original.