Models are simplifying constructs that attempt to make the complexity of the real world amenable to mathematic exploration. In evolutionary biology, the neutral model of molecular evolution (Kimura 1968) is the null against which alternative hypotheses are tested. In the standard neutral model, natural populations are large, of constant size and panmitic, and segregating mutations are selectively neutral. Both demography and selection can create deviations from the expectations of the neutral model, but while the former is expected to create deviations throughout the genome, natural selection affects specific loci (and linked sites) underlying advantageous traits. Thus, in addition to its inherent interest, establishing a baseline demographic model is often a crucial first step prior to scanning for loci under selection.
Under neutrality, the amount of genetic variability in a population at a given locus depends only on the rate at which mutations arise and the effective population size. Two commonly used estimators of nucleotide variability are the proportion of segregating sites (Watterson 1975) and the average number of pairwise differences (Nei & Li 1979). At equilibrium, these two estimators should yield similar values, even though they have different properties. Namely, the proportion of segregating sites is highly influenced by rare alleles in the population, while the average number of pairwise differences is more influenced by intermediate frequency variants. Tajima’s D (TD, Tajima 1989) computes the difference between the two to detect deviations from the expected SFS. The test statistic will be zero if the SFS conforms to neutral expectations but will take negative values when the SFS is skewed towards rare variants (as would be expected under strong positive directional selection, the presence of mildly deleterious mutations or population expansions), and a positive value if the SFS has an excess of intermediate frequency alleles (as expected under balancing selection or a population bottleneck). Other tests have been designed to detected deviations from the expected SFS, among them Fay and Wu’s H (FWH, Fay & Wu 2000) which is designed to detect the excess of high frequency derived variants expected after an allele hitchhikes to fixation under positive directional selection.
Sitka spruce (Picea sitchensis) is the largest species of spruce and the third tallest conifer. It is endemic to the pacific coast of North America. The range of the species is strikingly one-dimensional as it extends from California to Alaska covering 20º of latitude, yet it is not found more than 80 km east of the Pacific Ocean or its inlets (Harris 1990). During the last glacial maximum, it could be found from just south of San Francisco bay, CA, to Puget Sound, WA (Daubenmire 1967). Within the last 13 000 years, it has retreated north slightly from its southern limit and advanced remarkably far north, having reached Kodiak Island, AK within the last 400 years (Fig. 1). This rapid expansion has resulted in low to moderate genetic differentiation among populations (0.11 > Fst>0.03, Yeh and El-Kassaby 1980, Gapare et al. 2005, Mimura and Aitken 2007) but strong phenotypic differentiation for many growth traits (0.89 > Qst>0.28). In particular, individuals from southern populations are characterized by growing taller, having a longer growing season and being less cold tolerant (Mimura and Aitken 2007). This suggests that Sitka spruce populations have adapted rapidly to living in extreme northern environments and suggests that studies of this species can be very informative in understanding and predicting how this and other species may react to future environmental changes, either owing to natural range expansions, human-mediated introductions, or climate change.
In an effort to better understand the tempo and mode of this rapid colonization, Holliday et al. (2010) re-sequenced 153 nuclear genes in 24 trees from six populations (n = 4 per population studied), from the southernmost to the northernmost limits of the distribution of Sitka Spruce (see Holliday et al.Fig. 1). The authors provide estimates of average silent site diversity and average TD and FWH for each population and the entire sample (Fig. 1). For the entire sample, the proportion of segregating sites (1.52%) is higher than the number of average pairwise differences (1.20%), indicating a skew in the SFS towards rare variants resulting in a negative average TD value (−0.56). This by itself would suggest population growth since the last glacial maximum, as seen in other tree species (e.g. Ingvarsson 2008). Using publicly available data for a closely related spruce species (Picea glauca), the authors determined that there was also an excess of high frequency derived variants in relation to intermediate frequency variants resulting in a negative average FWH (−0.36). When each population is analysed independently, a different picture emerges, and we see substantial variation among populations. Both TD and FWH show strong clinal variation, with TD becoming more positive from south to north and FWH becoming increasingly negative (Holliday et al.Fig. 2). A possible explanation for the observed pattern is a northward range expansion through successive bottlenecks from a refugium in the southern part of the distribution. The two southernmost populations have negative average TD, the two central populations have average TD near zero, and the two northern populations have positive average TD values, consistent with bottlenecks during the northward expansion. When the data is further dissected and we look at the distribution of TD values across the 153 loci in each population, we see that multiple factors are likely in play (Fig. 2a). Every population has a bimodal distribution (with the possible exception of one southern population, which appears to be close to equilibrium), which is the pattern expected if selection is driving at least a part of the pattern. In the case of Sitka spruce, though, the pattern is likely to reflect demography because the pattern was resilient to the removal of loci with extreme values (the top 10% of the distribution). FWH does not show such high intra-population variation (Fig. 2b). It could also be that the bimodality in TD within populations is simply stochastic, reflecting the variance in the coalescent process. Indeed, likely due to the low number of individuals sampled per population, the authors did not test whether TD and FWH were significantly different from zero.
Following up on the hypothesis that demography is the best explanation for the observed patterns, the authors use approximate Bayesian computation analyses. They show that the pattern of nucleotide polymorphism for the three northernmost populations fits with the expectations under a recent bottleneck scenario, while for the three southernmost populations a model of an equilibrium population offers the best fit. Furthermore, the authors estimate the time since the bottleneck from each population and as expected the most northern populations have the most recent estimates, while more southern populations have much older estimates, again consistent with a scenario of a northward expansion with successive bottlenecks. While the relative estimates of bottleneck time are in good agreement with the fossil record, the absolute values are not. To estimate the age of the bottleneck, Holliday and colleagues used a mutation rate per year from Pinus. Differences in mutation rate between species could in part explain the discrepancy for the northernmost populations. However, estimating the timing of a bottleneck for the southern populations might be altogether inappropriate given that a bottleneck model was a poor fit to these populations: they may not have experienced a bottleneck in the first place or, if they did, the bottleneck is so old that the SFS fails to detect it. This finding highlights the fact that natural selection and demographic shifts have only transient effects on patterns of DNA polymorphism, which are eroded as populations move towards equilibrium.
This study shows that it is increasingly feasible to move from locus-centric studies to genomic studies even in non-model species where genomic resources, let alone genome sequences, are not yet available. It also highlights the importance of the sampling scheme in population genetics and how the pooling of samples from diverse geographical areas can influence the inferences about the demographic history. Additionally, it reinforces the idea that the genetic and fossil records are complementary and that together they can provide information regarding the mode (through consecutive bottlenecks), timing and the tempo of range expansions and the colonization of new habitats. Finally, the study shows the importance of using as many loci as possible, as individual loci show many conflicting patterns, and only when viewed in aggregate do the overall trends emerge.
Despite the strengths of this work, questions remain. Looking at individual loci reveals many layers of unexplained complexity, seemingly indicating that non-demographic factors are shaping the variation at loci throughout the genome. The natural history of this (and most) species is not simple, with bottlenecks, population expansions, selection and interspecific hybridization all playing a role. Such complex demographic history makes it harder to identify ‘outlier’ loci that may have been subject to natural selection, as demography by itself creates considerable variability in patterns of nucleotide variation throughout the genome.