Evolutionary inference from QST


Michael C. Whitlock, Fax: (604) 822-2416; E-mail: whitlock@zoology.ubc.ca


Q ST is a commonly used metric of the degree of genetic differentiation among populations displayed by quantitative traits. Typically, QST is compared to FST measured on putatively neutral loci; if QST=FST, this is taken as evidence of spatially heterogeneous and diversifying selection. This paper reviews the uses, assumptions and statistics of QST and FST comparisons. Unfortunately, QST/FST comparisons are statistically challenging. For a single trait, QST must be compared not to the mean FST but to the distribution of FST values. The sources of biases and sampling error for QST are reviewed, and a new method for comparing QST and FST is suggested. Simulation results suggest that the distribution of neutral FST and QST values are little affected by various deviations from the island model. Consequently, the distributions of QST and FST are well approximated by the Lewontin–Krakauer prediction, even with realistic deviations from the island-model assumptions.


When a species is spread over a heterogeneous landscape, spatial variation in selection pressure has two large effects. First, individuals in different parts of a species range are likely to experience different environments, which is in turn likely to cause differences in the selective pressures the organisms face. Second, heterogeneity across the landscape may be so severe that habitats that are capable of sustaining a population of a particular species are spatially disjunct, causing that species to be subdivided over space. Local adaptation is enhanced by strong selective differences among populations (which creates genetic differences among populations) and opposed by migration (which reduces the genetic differentiation of populations).

The study of local adaptation is complicated by the fact that genetic differentiation among populations can arise by genetic drift. To account for this alternative hypothesis, Spitze (1993) used the work of Wright (1951) and Lande (1992) to develop a measure of the genetic differentiation among populations for quantitative traits. This measure was called QST, as it is intentionally defined to parallel the single locus measure of a metapopulation's genetic differentiation, FST. Lande (1992) showed that, under some circumstances, the mean of QST would be the same as the mean FST, if the trait and the locus were neutral with respect to selection.

In principle, the QST of a trait or of a suite of traits can be compared to the FST calculated from a set of loci thought to be selectively neutral. If the QST of the trait(s) is significantly greater than the FST of the neutral loci, this would provide evidence that the trait has diversified more than would be expected by genetic drift alone. Conversely, if QST were significantly lower than the FST values would predict, we would have evidence that the traits were under stabilizing selection that maintained the same value in each population even in the face of drift. If QST values are approximately the same as FST, then we would have little evidence that selection acted in a spatially heterogeneous way. Thus, a growing number of authors have followed the leads of Spitze (1993) and Prout & Barker (1989) to measure and compare FST and QST in a large number of systems (see reviews by Lynch et al. 1999; Merilä & Crnokrak 2001; McKay & Latta 2002; Howe et al. 2003; Leinonen et al. 2008). In the past year alone, there have been a large number of QST- and related studies (e.g. Evanno et al. 2006; Jorgensen et al. 2006; Leinonen et al. 2006; Johansson et al. 2007; Knopp et al. 2007; Manier et al. 2007; Notivol et al. 2007; Raeymaekers et al. 2007). Approximately 70% of reported QST values are greater than the average FST in the same study (Leinonen et al. 2008).

In practice, the estimation of QST and FST are technically challenging, and comparing them effectively is difficult. This paper reviews the challenges of using QST and discusses the implications for when these challenges are insufficiently met. I shall start by discussing the conclusion that neutral FST should equal QST for neutral traits. A major challenge is that the values of FST for neutral loci and QST for neutral traits are expected to be extremely variable, and it therefore becomes essential to show not only that a given QST value is greater than the mean FST, but also that it can be found in the tail of the distribution of possible values. I’ll discuss the sources of variation in FST and QST among loci and traits. I will examine some of the ways in which QST can be used and some of the alternative approaches for the questions addressed by QST studies. The comparison of QST and FST is a very useful technique for some questions, but for other questions, other methods are more efficient.

Does FST = QST for neutral traits?

The calculation of QST for a trait requires two quantities: the additive genetic variance of the trait within a population (VA,within) and the genetic variance among populations (VG,among). For diploids, QST is calculated as


For haploids, the same equation applies, but without the ‘2’ in the denominator. [That ‘2’ for the diploid case comes from the fact that the quantitative genetic variance among populations is proportional to two times FST (Wright 1951).]

For neutral traits that are genetically controlled by purely additive genes, the mean QST is equal to the mean FST of neutral genetic loci. This has been shown for the island model by Lande (1992) and extended to any population structure by Whitlock (1999). These analyses, however, make several assumptions. First, they assume that the traits and genes are neutral; this assumption is the null hypothesis for the traits, but it is a usually untested assumption for the loci being measured for FST (Whitlock & McCauley 1999).

Second, these analyses assume that the traits are determined by genes that interact additively both within (i.e. no dominance) and between (i.e. no epistasis) loci. The simplest case of epistasis (additive-by-additive epistasis) was considered briefly by Whitlock (1999), who found that neutral traits exhibiting additive-by-additive epistasis would on average have QSTs less than neutral FST. In this case, the comparison of QST and FST to look for diversifying selection would be conservative, but testing for spatially uniform selection would be biased anticonservatively.

With dominance, QST is equal to or less than FST for neutral traits and loci, if the population is evolving by the assumptions of an island model (Goudet & Büchi 2006; Goudet & Martin 2008). However, other population histories may not give the same result. López-Fanjul et al. (2003, 2008) have shown that under a pure drift model of divergence from a common ancestral population (without any migration), neutral QST can be expected to exceed FST under some circumstances, although this is unlikely for traits affected by multiple loci (Goudet & Martin 2008). Other models of population demography have yet to be considered for the effect of dominance on QST. In general, it seems that dominance and simple epistasis are unlikely to cause a large increase in QST for a neutral trait, but these nonadditive genetic factors can easily cause the QST of a neutral trait to be much lower than FST. Given that dominance and epistasis affect a large fraction of traits (Crnokrak & Roff 1995; Whitlock et al. 1995), this is a major obstacle to interpreting low QST as evidence for spatially uniform stabilizing selection.

Other genetic complications can also make the comparison between QST and FST difficult. Quantitative traits may be affected by genes located on the sex chromosomes or by cytoplasmic factors. Genes on sex chromosomes and cytoplasmic factors that are inherited uniparentally are subject to greater genetic drift than autosomal loci. If the nonautosomal genetic component of a neutral trait is large, then on average its QST will be larger than expected by FST of neutral autosomal loci. Linkage disequilibria can contribute strongly to QST (Latta 1998, 2003), but on average the effects of linkage disequilibria are zero for neutral traits. As a result, under the null hypothesis that traits are neutral, linkage disequilibrium should not affect the mean value of QST.

Experimental artefacts can also cause FST and QST to differ. As we shall discuss below, FST— and particularly QST— can be estimated with bias. Given that the sources of bias are different in the two measures, in practice, estimates of FST are unlikely to equal estimates of QST even for neutral loci and traits. Many of these biases can be avoided with careful experimental design. It is also essential that the methods used to calculate FST and QST both calculate variance among groups in the same way, e.g. by dividing by the number of populations minus one. Weir & Cockerham's (1984) method will match a QST based on a typical analysis of variance, for example.

Moreover, FST and QST should not be considered an intrinsic property of the species being studied; the demographic parameters that affect one part of a species range are not necessarily the same as those that affect another region. Therefore, it is crucial for valid comparisons that the FST and QST measurements be taken on exactly the same collection of populations.

Finally, the FST and QST of neutral traits and loci are expected to vary greatly, even for a given mean value. Figure 1 shows the distributions expected from simulations of a simple island model of neutral genes and traits. While the mean FST and QST values in these simulations are as expected by theory, any given locus or trait can be very different from that expectation. These simulations (along with all others reported in this study) used exclusively biallelic loci, with FST measured by the Weir & Cockerham (1984) method. The next two sections will explore the sources of variation in FST and QST, and the remainder of this reviewwill examine the consequences of this variation.

Figure 1.

The distribution of neutral FST and QST. Each value in these histograms is derived from a single neutral locus or a single neutral trait in an island model with 10 demes sampled. The solid lines are the χ2 distribution predicted by Lewontin & Krakauer (1973) and are the same in both figures. For these simulations, the local population size was N = 100, and the migration rate was m = 0.05. The traits were controlled by five unlinked loci with mutational effects chosen from an exponential distribution.

Distribution of FST

Even though FST is often assumed to be the same for all loci, in fact, estimates of FST are extremely heterogeneous among loci. For example, Fig. 2 shows the distribution of FST over a large number of loci from three human populations (Akey et al. 2002) and 12 Drosophila populations (Singh & Rhomberg 1987). In both cases, the range of observed values is extremely large, and the distribution is skewed to the right.

Figure 2.

Distributions of FST from (left panel) 25 549 autosomal SNPs from humans (Akey et al. 2002) and (right panel) 61 allozyme loci from Drosophila melanogaster (Singh & Rhomberg 1987).

F ST estimates vary among loci for four reasons: direct selection, indirect effects of selection, sampling error during observation, or genetic drift. The first two of these cause FST to differ on average among loci, the third causes noise in our estimates of FST, and the fourth implies that even with perfect knowledge of the allele frequencies of a given set of populations, FST as a measure of the genetic differentiation among populations will differ from locus to locus. We will consider each in turn, with particular emphasis on the variation induced by heterogeneity in the genealogical history of the populations.

Selection on the measured loci

If some loci are exposed to spatially heterogeneous selection but other loci are not, then FST will be heterogeneous among loci. Such heterogeneity is often used as a signal of selection. For example, of the 61 allozyme loci surveyed by Singh & Rhomberg (1987), eight show significant evidence of geographically heterogeneous selection, and two of the 21 loci studied in Atlantic cod by Pogson et al. (1995) had FST values much greater than expected by a neutral model (Beaumont & Nichols 1996; Beaumont & Balding 2004). However, for most of these loci, it is not known whether selection is directly on the genotyped locus or at a closely linked site (see the next point).

Selection is only expected to systematically and strongly affect FST when the fitness differences in alleles across populations is greater than the migration rate (Whitlock 2002). Exceptionally high or low values of FST caused by selection will only occur for strong selection relative to migration.

Selection on linked loci

Selection on linked loci can affect the FST of a locus (Charlesworth et al. 1997). The magnitude of the effect depends on the strength of selection, the recombination rate between the selected site and the marker locus, and the migration rate (Ingvarsson & Whitlock 2000). Selection must be fairly strong on tightly linked loci for the FST of a particular locus to be substantially increased by indirect selection.

Sampling error

The samples we use to estimate FST are almost always a small subset of the total population, and as a result the estimates are an imperfect reflection of the true state of the population. The measured variance among loci in their FST will be decreased by increasing the number of individuals sampled per population, increasing the number of populations sampled, and/or choosing loci with higher heterozygosity (Beaumont & Nichols 1996). In addition, we will have a more robust estimate of the expected variance in FST if the number of loci used is higher, because this will increase the precision of the estimate of the mean FST. In general, to get reliable estimates of FST it is better to include more demes (> 10) and more loci (> 20) than to have large (i.e. > 25) numbers of individuals for each deme (Beaumont & Nichols 1996).

The variance in FST among neutral loci due to sampling and evolutionary history can be estimated using coalescent simulations (Beaumont & Balding 2004). Given that the distribution of FST is roughly equivalent for different demography (for a given expected FST — see below), this coalescent process is an efficient means to calculate the null distribution in the absence of selection.

Drift and heterogeneous coalescent histories

F ST is a function of local allele frequencies, and genetic drift can cause any function of allele frequency to change stochastically. Different loci may share the same demographic history, but each locus will have its own idiosyncratic genealogical history. As a result, FST can vary from locus to locus because of heterogeneity in the details of the coalescent process for each locus. Even for loci that are unaffected by selection and that are exhaustively sampled, variation in evolutionary history causes heterogeneity among loci in FST.

This heterogeneity may be illustrated with a simple example. Imagine two populations connected by limited gene flow and with small effective population sizes (Fig. 3). By chance, sometimes the allele frequencies drift to be more different than expected on average, which leads to a higher than expected FST. At other times in the history of the two populations, the allele frequencies in the two populations at a locus will sometimes converge, causing the FST at that time point to be small. Over time, the FST at the locus will average out to a predictable value, but at any given moment, FST will differ by drift from that expectation. Similarly, when we measure FST at a given point in time at multiple loci, some loci will be caught at an FST value higher than the overall mean, and others will be caught at a point of low FST. This heterogeneity in FST caused by drift is in many cases the largest component of heterogeneity in FST among loci.

Figure 3.

Drift in allele frequency causes FST to vary. During the period of time marked by the left gray area, these two demes had allele frequencies that were very different, and so FST was high. During the time marked by the gray box on the right, allele frequencies happened to be quite similar, resulting in a low FST. Parameters were held constant at N = 100 and m = 0.01.

The distribution of FST depends on the number of populations in the sample. As the number of local populations included in a study increases, we get a larger sample of the possible range of the evolutionary process, and the resulting FSTs become less heterogeneous. The variance can be reduced substantially by increasing the number of demes sampled (see Fig. 4). However, if the number of demes in the real population is limited, then there is a minimum variance in FST caused by heterogeneity in the populations’ histories of drift which cannot be avoided by larger samples. If the number of populations is relatively small, then this heterogeneity is usually the dominant source of variance in FST among loci. For example, in the human data of Akey et al. (2002) shown in Fig. 2, only three populations (Africa, Europe and Asia) were sampled, and at that spatial scale these represent the bulk of human history. Due to the small number of populations sampled, nearly all of the variance among loci that they observed is likely to be due to variance in genealogical history among loci.

Figure 4.

With larger numbers of demes, FST will not vary as much over time or over loci. The dashed lines show the FST measured across two demes in successive generations (using the same data as Fig. 3). With 20 demes, FST is much more constant over time and across loci. Parameters were held constant at N = 100 and m = 0.01.

Lewontin & Krakauer (1973), looking for a test to discriminate loci that may be under geographically heterogeneous selection, found that the distribution of FST among loci can be predicted with a χ2 distribution. More specifically, inline image has a χ2 distribution with (ndemes – 1) degrees of freedom, where ndemes is the number of demes in a study and inline image is the mean FST value. As a result, Lewontin and Krakauer predict that the variance among neutral loci should be inline image.

However, the Lewontin–Krakauer test has attracted much criticism. Nei & Maruyama (1975) and Robertson (1975) argued that if some of the demes included in a study were correlated (thereby reducing the effective number of demes), or if some demes had a substantially different demographic history than others, then the Lewontin–Krakauer calculations would underestimate the expected variance among neutral loci in FST. Lewontin & Krakauer (1975) agreed, and as a result the method quickly fell from favour. Recently, however, Beaumont & Nichols (1996, see also Beaumont 2005) have shown that for many reasonable models of population structure, FST is approximately distributed as predicted by Lewontin–Krakauer, even if the conditions specified above do not strictly hold. They considered several models of population subdivision, including the island model, a pure colonization model, ‘heterogeneous’ models in which population size and/or migration rate varied greatly among local populations, and a two-dimensional stepping-stone model. The stepping-stone model in particular addressed the specific concerns of Nei & Maruyama (1975); the variance in FST was not much greater with the stepping-stone model than expected by Lewontin–Krakauer's result. If the sampling scheme avoids choosing adjacent populations, then the Lewontin–Krakauer method does very well at predicting the distribution of FST among loci, even with the two-dimensional stepping-stone model. I conducted new simulations of a one-dimensional stepping-stone model (see Fig. 5, top row, and Table 1). The one-dimensional stepping-stone model has an even stronger correlation structure than the two-dimensional case, but the results confirm the conclusions of Beaumont & Nichols (1996).

Figure 5.

F ST and QST distributions for neutral loci and traits under alternative metapopulation demographies. Parameters were chosen so that the expected FST would be approximately equal to 0.045 for each case. The top row show results for a one-dimensional stepping-stone model, with every third population sampled (N = 100, migration among adjacent demes m′ = 0.08, migration among random demes m = 0.01). The middle row shows a ‘two-type’ model: a heterogeneous island model with two types of demes (N1 = 100, m1 = 0.2, N2 = 30, m2 = 0.067, where m defines the immigration rates). The bottom row shows results for a model with local extinctions and colonizations (N = 100, m = 0.1, extinction rate = 0.1, propagule pool colonization with k = 12; see Whitlock & McCauley 1990). In all cases, the variance among traits and among loci is slightly greater than predicted by the Lewontin–Krakauer χ2 distribution, but the deviation is small. The greatest discrepancy is for the ‘two-type’ model.

Table 1.  Summary statistics for FST and QST from various demographic models, based on 105 replicates per mode
Demographic ModelMean FSTMean QSTVariance FSTVariance QSTLewontin–Krakauer- predicted varianceType I error rate* (α = 0.025)Predicted 97.5% percentile QST (based on Lewontin–Krakauer)Observed 97.5% percentile QST
  • *

    For a one-sided 2.5% rejection region for high values.

  • †Island model with 10 demes of 10 individuals each with migration rate equal to m = 0.05.

  • ‡Island model with 10 demes of 10 individuals each with migration rate equal to m = 0.01.

  • §

    Island model composed of five demes of size 100 with m = 0.2 and five demes of 30 individuals with m = 0.067.

  • Island model with addition of local extinction with probability 0.1 per generation per population. Each deme has 100 individuals and m = 0.1; colonization happens in extinct demes with twelve individuals by the propagule pool model.

  • **

    Island model with addition of local extinction with probability 0.08 per generation per population. Each deme has 100 individuals and m = 0.1; colonization happens in extinct demes with two individuals by the propagule pool model.

  • ††

    One-dimensional stepping-stone model with 100 individuals for each of 20 demes, m = 0.08 for migration among adjacent demes, and m = 0.01 random migration. Every third deme was sampled.

Island model0.0480.0470.000470.000450.000500.01950.1010.097
Island model0.1950.1890.006560.00540.00840.00380.4120.350
Two-type island model§0.0500.0490.000750.000710.000550.03850.1060.115
Stepping stone††0.0520.0510.000920.000860.000880.02440.1240.123

The only demographic model considered by Beaumont & Nichols (1996) that did not match the Lewontin–Krakauer prediction was the ‘heterogeneous’ model, in which demes were either expected to have a high FST (0.67) or a lower FST (0.048). With this heterogeneous demography over space, loci were more heterogeneous than expected by the Lewontin–Krakauer calculations. With such a population structure, the distribution of allele frequency across populations should be leptokurtic, which should cause the heterogeneity of FST to increase. However, with less extreme variation among populations than those studied by Beaumont and Nichols (FST= 0.012–0.12), the Lewontin–Krakauer expectation is not far off the true distribution (Fig. 5, middle row; Table 1).

One model of population structure that has not been considered from this perspective is the extinction/recolonization model of Slatkin (1977) and Whitlock & McCauley (1990). Under this model, some local populations go extinct and other new populations are founded in each generation. Fig. 5 (bottom row; see also Table 1) shows the distribution of FST over neutral loci and QST for neutral traits. While the variance over traits or loci is slightly higher than predicted by the Lewontin–Krakauer χ2 predictions, the difference is not great.

These simulations imply that, at least for a wide range of plausible demographic histories and for typical intraspecific values of FST, the distribution of neutral FST can be fairly well predicted from the mean FST of neutral loci. This is true even in the absence of specific knowledge about the evolutionary and demographic history of the populations. These predictions will do well for populations that match the assumptions of Lewontin–Krakauer, with roughly equal demographic histories for each population. The predictions will provide only an approximate guide for populations that deviate from these assumptions, when the demes vary greatly in their demographic histories, but these approximate predictions for the distribution of FST will be close enough for some kinds of inference. The distribution among loci becomes less variable if the number of populations sampled is increased and is more variable if the mean FST value is larger. For situations where FST is greater than 0.1, however, Lewontin–Krakauer performs less well (Table 1). The use of Lewontin–Krakauer for high FST values should therefore be avoided.

Distribution of QST

For neutral traits, the distribution of possible QST values is approximately the same as the distribution of neutral FST values for a single locus (see Figs 1 and 5). Even though a trait may be controlled by many genes, the variation in evolutionary history does not reduce the heterogeneity in QST relative to FST, because drift also induces linkage disequilibrium at random among loci affecting a trait (Rogers & Harpending 1983). In effect, the increases in trait value in a deme caused by drift at one locus are balanced by decreases by drift at other loci; the consequence is that QST of one trait has the same amount of information about evolutionary history as does the FST of a single locus.

While this generalization is approximately true, simulations show that the variance in neutral QST is slightly lower usually than that of a neutral FST, when sampling error within demes in estimating FST and QST is ignored (see Table 1). For the island model, the mean FST and mean QST match fairy well, but the variance in QST is very slightly smaller than predicted from the distribution of FST. For extinction–colonization models, stepping-stone models, and heterogeneous island models, QST has a distribution with somewhat lower variance than the distribution of FST in the cases that have been simulated (See Fig. 5) In some cases, the variance of QST is somewhat larger than predicted by the Lewontin–Krakauer χ2 distribution, but the deviation is lower for QST than FST (see Fig. 6) The cases that I simulated cover a broad range of possible deviations from the simple island model, with large variation in demographic parameters over space and time, high growth rates of some populations and not others, and geographical isolation by distance (sampling every third deme). In all cases, however, the deviation of the simulated distribution of QST is quantitatively very similar to the χ2 distribution expected by the Lewontin–Krakauer result. In particular, the critical values demarcating 5% in the extreme tails of the distribution are reasonably accurate. As a result, the expected distribution of neutral QST can be relatively reliably inferred from the mean value of FST calculated from several neutral loci. To be precise, the inference is that for neutral traits inline image has an approximately χ2 distribution with (ndemes – 1) degrees of freedom.

Figure 6.

Comparing a single QST value to the distribution of FST. The QST value for Trait 1 is greater than the mean FST (= 0.0475), but it is not unrepresentative of the distribution of FST values. It would be quite likely to generate a QST value like QST,1 from a neutral trait. Trait 2 has a QST value that is in the tail of the distribution of FST; this trait has a QST value that would be very unusual for a neutral trait.

This distribution does not account for sampling error in the estimation of QST, however; and QST is very difficult to measure precisely (O’Hara & Merilä 2005). Various methods have been proposed to estimate the error of QST estimates, and many fail to capture the true heterogeneity of the samples (O’Hara & Merilä 2005). Precision is improved by sampling more demes and more families per deme, and by carefully controlling rearing conditions for the experimental organisms.

Sources of bias in QST estimation

Unfortunately, there are several biases which may affect QST estimation. Most of these are avoidable, but the fixes are labour-intensive. Many published studies on QST are regrettably not done at a standard that prevents these problems.

One difficult issue is that traits chosen for QST analysis are often traits already thought to be under spatially varying selection (J. R. Miller, D. E. McCauley, personal communication). This can, under some circumstances, introduce an ascertainment bias to the measurement of spatially heterogeneous selection. If only traits that are known to be spatially variable are analyzed, then a statistical model that assumes a random selection of traits will be meaningless. The extent of the bias depends on the question being asked and the specific manner the traits are chosen. If a trait is chosen for QST analysis based on the observation that it seems to be geographically variable, then the statistical comparison of its QST to FST cannot reliably tell us whether selection has caused the divergence. Traits chosen in this way will almost automatically have higher QST than average; and without knowledge of the specific process of choosing traits, we cannot correct for this bias. In cases when an average QST is calculated to describe the species as a whole, this ascertainment bias is particularly important. Because the distribution of QST of neutral traits is expected to be so broad, it will always be possible to choose a set of traits that have higher than average QST values. Traits chosen in this way cannot reliably be used to infer the extent of spatially heterogeneous selection. Examination of the traits chosen for many QST studies makes one wonder whether traits are in fact always chosen without previous knowledge of the likely results.

However, if a trait is chosen because it is known to be under geographically variable selection in another species or if alternative information is used to predict that the trait is subject to spatially heterogeneous selection, then it is appropriate to test for spatially heterogeneous selection via QST.

Most additional biases associated with QST are well known from classical quantitative genetics. Estimating QST requires unbiased estimates of both the additive genetic variance within populations and the genetic variance among populations. Any positive bias in the additive genetic variance (VA,within, which appears only in the denominator) will cause QST to be biased downwards. For example, in some studies the total phenotypic variance (VP) within a population is used instead of the additive genetic variance in the denominator. [This quantity is sometimes called PST (e.g. Storz 2002; Baruch et al. 2004; Streisfeld & Kohn 2005; Leinonen et al. 2006; Raeymaekers et al. 2007); but unfortunately, some authors confusingly also refer to this as QST.] The resulting estimate is lower than QST:


where h2 is the narrow sense heritability (VA,within/VP) of the trait being measured. Given that heritabilities are often less than a half, this is a substantial source of downward bias if PST is interpreted as QST.

The additive genetic variance within populations has to be estimated with a breeding design that allows the phenotype to be correlated with the relatedness of individuals. Many studies use parent–offspring correlations, which are biased upwards if there is any dominance variance in the population (Falconer & MacKay 1996). Dominance variance is less often measured than additive variance, but reviews have shown that dominance variance is comparable in magnitude to the additive genetic variance (Crnokrak & Roff 1995). It is rare in QST studies to account for maternal effects, which can upwardly bias estimates of VA,within in parent–offspring regression studies. Common environmental effects of course also have to be experimentally controlled, or the estimate of VA,within will also be biased upwards. Given that most experimental biases will artificially increase the estimate of VA,within, these factors are likely to cause QST estimates to be too low.

The genetic variance among populations is also challenging to measure accurately. The phenotypic variation among populations is potentially affected both by differences in the mean genotypes of the populations and by plastic responses to varying environmental cues. It is important that the measurements of VG,among used for QST estimates include only the genetic differences, because if we are to compare the resulting QST to FST the null expectation of equality assumes that only genetic effects were measured. As a result, individuals from all populations must be raised in a ‘common garden’ where the same environmental conditions are applied to individuals from all populations being measured.

In some ‘QST’ studies, individuals are measured after being raised under their native local conditions (e.g. Storz 2002; Baruch et al. 2004; Streisfeld & Kohn 2005; Leinonen et al. 2006; Raeymaekers et al. 2007). This experimental design confounds the effects of plasticity with the genetic variance among populations, but the effects are complicated to predict. The most obvious effect of such a design would be the possibility that local environmental cues cause more divergence in mean phenotypes among populations via phenotypic plasticity. In such a case, the measured QST would be an overestimate of the true value.

Common-garden experiments are not perfect either, however. It is possible that plastic responses to a habitat are part of the mechanism for local adaptation. In that case, the genetic divergence among populations may not be properly expressed in a common-garden setting. For example, maternal effects in moor frogs contribute to local adaptation (Räsänen et al. 2003). If there is plasticity for the trait under study, then the trait may respond unpredictably to the common-garden environment. A typical common-garden experiment involves growing the organisms in the lab or greenhouse, which may not evoke typical development. Locally evolved plasticity will be missed (see above), and the response to the novel environment may prove to be so idiosyncratic that the results are not easy to interpret. A common (but certainly not universal) finding is that genetic variance is increased in novel environments such as the lab (Hoffmann & Parsons 1991; Hoffmann & Merilä 1999); these increases are not predictable and may lead to biased estimates of QST.

Moreover, our best wishes to the contrary, lab environments sometimes provoke quite severe selection on wild-caught organisms. This selection can potentially change gene frequencies; substantially and in the short-term. This effect is likely to reduce real differences between populations, as selection in the lab changes all populations to match the optimum for the new conditions. This artefact can be ruled out if the mortality in the lab is small.

Methods of testing QST vs. FST

Q ST has been used to answer two types of questions, which are not mutually exclusive: (i) to ask whether particular traits are under spatially divergent (or spatially uniform) selection; and (ii) to ask whether a series of populations relates to their environment in such a way as to produce local adaptation in general. For the first type of question, we are interested in asking whether a particular trait has local adaptation, while for the second type of question, we may wish to know whether traits on average adapt to local conditions. These two questions, while related, require quite distinct statistical methods. Consider the second type of question first, as it is simpler.

Mean QST vs. mean FST

Some studies have measured whether the mean QST of several traits is greater than the mean FST of several loci. It is not always clear what this sort of mean QST might tell us, but it can be viewed as a measure of the overall importance of local adaptation in the species. For example, in conservation genetics, if a species that lives in a subdivided habitat has high QST values, it may be more important to account for local adaptation in making conservation plans.

To ask whether a species is significantly affected by local selection in general, we might compare the mean FST of putatively neutral independent loci to the mean QST of a series of independent quantitative traits. To do so requires no special statistical techniques, because by the null hypothesis of no effect of selection the distribution of FST and QST should be nearly identical. As a result, a Mann–Whitney test (or even perhaps a two-sample t-test) is appropriate, treating each locus and each trait as a replicate.

There are two problems with this approach. First, if either mean QST or FST is estimated with bias, then of course the comparison is meaningless. Sources of bias in QST were reviewed in the last section; of particular concern is the bias caused by selection of traits thought a priori to be under spatially divergent selection.

Another problem with a comparison of mean QST and FST is that quantitative traits normally are not independent of each other, which means that they are not the random sample assumed by simple statistical methods. This nonindependence arises because of pleiotropy and linkage of the genes affecting these traits. This issue may be minimized by choosing traits that have low genetic correlation. It is tempting to believe that using QST calculated on principal component scores could solve this problem; the difficulty is that the phenotypic values of traits may have different correlation structure than the genetic components, and that the genetic components within populations may be correlated in a different way than the genetic components among populations. Finding a transformation that leaves QST measures independent for all traits is a challenging task and a fruitful area for future exploration.

QST of single traits

When we are studying the evolution of particular traits, it may be interesting to know whether those traits are locally adapted. For a single trait, the question becomes: ‘is the QST of this trait significantly greater than (or less than) we might expect for a trait evolving neutrally’? In practical terms, this means comparing the QST of the trait to the distribution of FST of putatively neutral markers. However, even neutral loci are very heterogeneous in their FSTs (Figs 1 and 5, left sides), and QSTs of neutral traits are expected to be just as variable (Rogers & Harpending 1983; Figs 1 and 5, right sides).

As a result, if the QST of a trait is greater than the mean FST of neutral loci, this cannot be interpreted as sufficient evidence for local adaptation in that trait. We expect nearly half of all neutral traits to have a QST value that is greater than the mean FST. Instead, what is important is to ask is whether the QST for a trait is unlikely to have arisen from the distribution of neutral FST values. This implies that we need to ask whether the QST of a particular trait is in the tail of the distribution of possible values, under the null expectation of neutrality (Fig. 6).

To do so requires an unbiased estimate of QST of the trait, information about the sampling properties of QST, and an estimate of the distribution of neutral FST values. There are two known ways to estimate the distribution of FST. First, the distribution of FST can be empirically determined, by measuring FST at enough loci (> 50) to generate a good description of the true distribution. Alternatively, the distribution of FST can be predicted theoretically, either by simulation or by the χ2 distribution predicted by the Lewontin–Krakauer test and discussed above.

The most straightforward procedure, on the surface, is the direct empirical approach; but this approach leads to two problems. First, the number of loci required exceeds the scale of most QST studies by an order of magnitude. This problem is surmountable, especially as genotyping becomes less expensive with improving technology. The second problem is more difficult, because the distribution of FST observed empirically includes variation due to the sampling error in estimating each FST value. Given that the sampling error for FST and QST is quite different, even when calculated on similar sample sizes, we must account for this sampling error in predicting the true distribution of FST. This is a potentially difficult problem that has not yet been solved.

The alternative approach generates the expected distribution of a neutral FST from the χ2 distribution, which requires a reasonably good estimate of the mean FST of the system. As we have seen above, this distribution predicts the distribution of FST reasonably well for most biologically plausible scenarios, even for metapopulations that do not match the assumptions of the derivation well. Currently, it is not clear how many loci are required to get a sufficiently good estimate of FST for this approach to work reliably, but it must be far fewer than required by the direct approach.

If the value of QST for a particular trait fell into the tail of the predicted distribution of FST, we could infer that that trait is likely to be under selection. Unfortunately, QST is never estimated without error, and usually the error in estimating QST is relatively large. To account for the sampling error in QST, a bootstrapping approach may be the most successful means of making this comparison. The appropriate method would be, for each bootstrap replicate, to resample a number of loci to generate the mean FST and resample populations, families and individuals within families to generate the quantitative genetic data. For each bootstrap replicate, the mean FST value should be calculated from the neutral loci sampled, and from that the predicted χ2 distribution of FST can be calculated from the Lewontin–Krakauer approach. A QST bootstrap replicate can be calculated from the resampled genetic data, and we record the probability of getting a QST value that is large or larger under the predicted FST distribution (the ‘tail probability’). The resulting bootstrap distribution gives the probability distribution of this tail probability of QST. The average tail probability of the bootstrap replicates give the one-tailed P-value of the test of neutrality for that trait. O’Hara & Merilä (2005) have shown that nonparametric bootstrapping for confidence intervals of QST gives exaggerated estimates of the confidence we should place in an estimate, particularly if the number of demes is small. The results of the proposed method should be taken with some caution until tested. A parametric bootstrap or simulation approach may be better (O’Hara & Merilä 2005). Further testing by simulation is necessary to ensure that this method works adequately.

Identifying traits that have QST significantly in the tail of the distribution will be extremely difficult, or even impossible, for studies that consider only a few populations. For example, with two populations, the critical value of the distribution of FST associated with a two-tailed test of QST with 95% confidence is 5.02 times the mean FST value, even with no uncertainty in the estimates of FST and QST per locus. With an FST of 0.1, a trait would have to have a QST value greater than 0.5, even with perfect measures, to be considered to be in the tail of the distribution of FST. With 10 populations, this critical value is only about twice the mean FST value, although with realistic error in estimating these values, the critical value could be much higher.

It is important to note that QST has all of the same problems that plague other techniques for the estimation of natural selection in the wild (see, e.g. Endler 1986 for a discussion of these difficulties). In particular, if spatially divergent selection is acting on a trait correlated to the measured trait, then a neutral trait may have a high QST value simply because of this correlation.

Finally, QST may be used in an exploratory manner, and indeed it has been suggested that this is its main function (Leinonen et al. 2008). If the QSTs of several traits are measured, then the traits with the highest QST values may be good candidates for further study measuring genetic differentiation by selection. If the data are used to provoke ideas rather than to test a priori hypotheses, then the difficult statistical properties of QST become less important. Traits with high QST values are more likely to be under spatially diversifying selection than traits with low QST values. Such a comparative, exploratory approach does not absolutely require much information about FST, because the candidate traits can be compared relative to other traits.


While useful, QST is a crude measure of the amount of genetic differentiation of a trait caused by local adaptation. It is difficult to estimate accurately and precisely, and its value depends on comparison with another statistically difficult quantity, FST. However, in some circumstances the comparison of FST and QST is a valuable technique that allows us to examine the null hypothesis of neutral divergence among populations.

If a random selection of traits on average has QST values greater than the mean FST value, then we might infer that the sampled populations experience geographically variable selection sufficiently strong relative to migration so that local adaptation is possible. Thus, genetic variation for those traits is structured geographically, and spatial population structure is an important part of the maintenance of genetic variation in that system.

If the QST of a particular trait is sufficiently large relative to the distribution of FST, we may similarly infer (with some caveats) that selection acting on that trait is likely to be geographically heterogeneous. Having made that conclusion, the next steps may include finding the mechanistic basis of that selection, including the environmental axes that control the variation in selection pressure. QST should be only a beginning for these trait-specific studies; QST may be a good screen to discover traits that experience spatially heterogeneous selection, but to understand the nature of that selection requires other techniques, including correlation of environmental features with morphology, reciprocal transplants, biomechanical or physiological studies, local selection measures à la Lande & Arnold (1993) etc.

Alternative techniques for studying heterogeneous selection offer other kinds of information that we cannot get from QST. For example, we may identify that the mean phenotype (for individuals grown in a common garden) varies systematically with an environmental measure; for example, root-to-shoot ratio may be higher in plants from relatively arid areas. If we find such correlations in the data (provided that we have suitably corrected for multiple comparisons), then we learn not only that the pattern is caused by selection but also what the nature of that selection might be. The value of QST compared to FST is that it can eliminate drift as an explanation for divergence, but drift also can be ruled out when there are systematic differences among populations. A correlation of a trait with an environmental gradient, if measured over spatially independent replicates, provides such evidence of a systematic pattern.

Similarly, reciprocal transplants can tell us not just that there is spatially heterogeneous selection, but they can begin to quantify the intensity of that selection, unlike current inference based on QST. If reciprocal transplants are done in concert with a QST study, then we would be able to measure the effects of plasticity on the expression of the traits under study.

The main value of QST may be as a screen allowing us to identify promising traits for further study. If this is the goal, then it may be worth considering the less labour-intensive approach based on PST. PST, based on measuring phenotypes grown in the common garden, gives a value that is biased downwards relative to QST and FST. However, this bias can be partially corrected if we have sufficient information about the heritability of the trait. PST can be measured more rapidly than QST with less expense, and, as such, may be suitable for use as an exploratory technique to screen traits to find those that are most likely to be under heterogeneous selection. However, this approach would be error-prone because of the biases of PST, and therefore it would only be suitable as a first screen. Errors in heritability estimates (either due to sampling error or to differences in heritability among populations) or plastic response differences among populations in nature could significantly bias PST away from the expected QST.

Q ST is a useful quantity, but to infer selection by comparing QST to FST requires careful control of potential sources of bias and large sample sizes. Most contemporary studies of QST do not reach these standards.


This paper was inspired by the European Science Foundation workshop on Adaptive vs. Neutral Genetic Variability in Conservation Genetics, hosted by the University of Helsinki. My thanks go to all of the participants at that workshop for interesting discussions on QST. Fred Guillaume, Sally Otto, Bob O’Hara and three anonymous reviewers have all kindly offered very useful comments on drafts of this paper. This work was funded by a grant from the Natural Science and Engineering Research Council (Canada) and a sabbatical leave at the National Evolutionary Synthesis Center (NSF #EF?0423641).

Michael C. Whitlock studies the effects of population subdivision on the evolutionary properties of populations.