Ever since the paper of Lewontin & Krakauer (1973), there has been debate about the ability of surveys of genetic loci to tell us about the action of natural selection. Since that time, our understanding of the effect of natural selection on gene genealogies has increased (Kaplan et al. 1989; McVean 2007), leading to the development of methods of increasing sophistication and power (Nielsen et al. 2005). Many of these methods require a fine-scale linkage or physical map in which the power to detect selection is increased through patterns of disequilibria. Typically, they are based on the idea of a ‘selective sweep’ in which a novel mutation increases rapidly in frequency, leading to reduced variability of linked markers. Other approaches in population genetics can trace their ancestry back to that of the original Lewontin–Krakauer test, and are based on the assumption that local selection will act as if to reduce the immigration rate of nonadaptive alleles (Petry 1983; Charlesworth et al. 1997), and thereby increase genetic differentiation among populations. This is much more of an ‘equilibrium’ view of selection in which the populations are held in migration-selection–drift balance. The study by Mäkinen et al. (2008a) uses a recent version of this type of test (Beaumont & Balding 2004) and another test that compares levels of gene diversity in pairs of populations (Kauer et al. 2003). It is interesting that these two types of test encapsulate the two rather different views of selection described above.
A useful framework for interpreting such studies is provided by the many-demes approximation (Wakeley 1999). The idea here is that in the genealogy of a sample of genes from a metapopulation, there are only two possible types of event: either a pair of genes share a common ancestor within the deme (coalesce) or a gene is an immigrant (i.e. migrates out of the deme, backwards in time). Events occur until all genes have migrated out. It is then possible to show that the genealogy of all these migrant genes is exactly the same as for the standard coalescent in a single population. The probability that two genes coalesce within a deme is FST. Knowledge of FST, and the scaled mutation rate in the metapopulation genealogy are the only parameters that determine the allele frequency distribution.
However, we can be more general than this: if we knew the metapopulation gene frequencies exactly, then it is possible to show that the genealogical process described above will lead to a particular formula — a multinomial-Dirichlet distribution — for the allele frequency distribution within the demes (Balding & Nichols 1995). Interestingly, this formula can also be obtained from the diffusion theory of Sewall Wright (Rannala & Hartigan 1996). This formula underlies the method by Beaumont & Balding (2004), used in the present paper (see also Riebler et al. 2008). It is also used in other contexts for the analysis of population structure (e.g. Ciofi et al. 1999; Holsinger et al. 2002; Foll & Gaggiotti 2006).
As noted by Nei & Maruyama (1975), variability in mutation rate among markers may lead to variability in FST and therefore increase the false-positive rate. In the many-demes setting, the key assumption is that there are only migration and coalescent events within demes; genetic variability arises at the metapopulation level. It is not possible to introduce a within-demes mutation rate in the many-demes setting because this would imply an infinite mutation rate at the metapopulation level. However, the continent-island model, in which all the lineages are drawn from a single gene pool, is conceptually very similar to the many-demes model, and gives the same multinomial-Dirichlet formula. In this framework, it is possible to define an extra mutation rate term (Nichols & Freeman 2004) in which FST = 1/(1 + 4Nm + 4Nµ). If the stationary distribution of the mutation process is known, then it is straightforward to jointly infer the variation in mutation rate among loci and the migration rate (Nichols & Freeman 2004). If µ << m, then ignoring the mutation rate has little effect. However, if µ is large for a particular locus, and ignored, then estimates of FST for that locus will tend to be depressed.
Note that this does not imply any straightforward relationship between total genetic variability and FST because the amount of total variability depends jointly on the mutation rate, the stationary distribution of the mutation process, and the frequency distribution in the continent. It is by no means inevitable that loci with many alleles should have low FST. Hedrick (2005) notes that FST within a population cannot be greater than the homozygosity within the population. This has been interpreted to imply that highly variable loci must necessarily have low FST. Hedrick's argument applies to homozygosity within a population. However the many-demes approximation predicts that there is no relationship between FST and genetic diversity between populations. Indeed, as the diversity between populations increases then FST (probability of identity by descent) and homozygosity within populations (probability of identity by state) become the same.
Where is this tortuously convoluted argument leading us? Well, simply to point out that the same, or very similar, likelihood formulae have been used to identify variability in mutation rate among loci (Nichols & Freeman 2004), and to identify putative signatures of balancing or local selection (Beaumont & Balding 2004; Riebler et al. 2008) — but not both, jointly. That is, under one modelling construction, you view ‘outliers’ to have an unusual mutation rate, and under the other, you view them to be under selection. Does this matter? It is unlikely to vitiate the conclusions drawn by Mäkinen et al. (2008a) concerning directional selection because the highly divergent microsatellites are inferred to have similar levels of FST to the indel markers, which presumably are unique events. Given the known relationship of the Eda gene to morphological differentiation, it is reasonable to discount the hypothesis that these give the ‘background’FST, while the remainder are affected by high mutation rate. However, as Mäkinen et al. (2008a) rightly point out, it is possible that high mutation rate may be responsible for the inference of balancing selection. Clearly, further research is needed to tease out more conclusively the differing effects on genetic differentiation at microsatellite loci.
When surveying many genetic markers, care must be taken to filter out the false-positives that necessarily occur when performing a large number of statistical tests — a problem that may otherwise damage the credibility of such studies. A nice feature of the current paper is that it tries hard to be cautious about levels of significance, and works with a false discovery rate of 5%. It is worth noting that the Bayesian P values used in the method of Beaumont & Balding (2004) are somewhat problematic with a view to determining the false-positive rate. They are not posterior probabilities that a locus is under selection, yet neither can they be viewed as frequentist P values [the probability of obtaining as extreme or more extreme a value of some statistic if the null hypothesis (of no selection) were true]. The recent study by Riebler et al. (2008) extends the method of Beaumont & Balding (2004) to directly compute the posterior probability that a locus is under selection which may be easier to interpret. The simulation experiments in Beaumont & Balding (2004) indicate that the 0.05 level had a false-positive rate of 0.003, which would make the conclusions of Mäkinen et al. (2008a) more conservative.
Looking to the future, it would be nice to get a better understanding of how far the regions of increased differentiation extend around a locally selected gene. The approximation of Petry (1983) is typically used, but simulations suggest that it is not particularly accurate, and tends to predict that the regions of increased differentiation can extend over a centimorgan or more (Charlesworth et al. 1997). However the current paper shows that only the markers within the intronic region of the Eda gene are strongly differentiated, and a microsatellite 1.5 cm away showed an FST level similar to the background. Mäkinen et al. (2008b) describe a more detailed survey of microsatellites around the Stn90 locus that indicates a small region of increased differentiation, extending 20–90 kb, depending on the test used. A similar narrow region is reported in the recent study by Wood et al. (2008), and further findings are discussed in Mäkinen et al. (2008a). The narrow range makes detection more difficult and raises the question of why so many studies appear to show the effects of local selection. Perhaps the false-positive rate is higher than believed because of demographic or mutational effects. Perhaps selection is indeed rampant throughout the genome. In any case, this observation then raises the question of what one can do with these markers, once identified. It seems doubtful that construction of linkage maps with anything other than thousands of markers is going to be particularly useful for interpreting the results because the regions of differentiation are so narrow. For nonmodel organisms, the construction of bacterial artificial chromosome (BAC) libraries may be a useful approach, combined with 454 sequencing. If indeed the region of differentiation extends over only a few kilobase, as suggested by the study of Wood et al. (2008), it may be relatively straightforward to identify genes involved in the adaptation, and this may become more routine in future.