This paper reviews how statistical tests of neutrality have been used to address questions in molecular ecology are reviewed. The work consists of four major parts: a brief review of the current status of the neutral theory; a review of several particularly interesting examples of how statistical tests of neutrality have led to insight into ecological problems; a brief discussion of the pitfalls of assuming a strictly neutral model if it is false; and a discussion of some of the opportunities and problems that molecular ecologists face when using neutrality tests to study natural selection.
The fields of evolutionary biology and population genetics have been dominated by molecular studies for the last quarter century (Lewontin 1974; 1991). Over much of that time, there has been a persistent debate about whether natural selection or random drift is the dominant force in molecular evolution (Kimura 1983; Gillespie 1991). In recent years, considerable progress has been made in the ability to detect natural selection from patterns of DNA sequence variation, and the ‘selectionist/neutralist debate’ has matured into an effort to estimate the distribution of selective effects on genetic variation (Kreitman 1996; Hey 1999). The strict neutral theory, on the other hand, has become the standard null hypothesis used to approach the study of molecular evolution, even if it is often rejected.
In this work, how molecular inferences about natural selection have been and can be used to address questions in ecology and conservation biology are reviewed. The fields of conservation genetics and molecular ecology make heavy use of molecular techniques, but generally do so under the assumption that essentially all observed variation is selectively neutral. For example, of the papers published in Molecular Ecology over the last 2 years, only a handful made use of statistical tests of neutrality. The primary goal of this review is to convince molecular ecologists that molecular population genetics can be a fruitful way to study natural selection and adaptation.
There have been several recent reviews of the statistical techniques used to test for selection on DNA sequences (Kreitman & Akashi 1995; Hughes 1999; Kreitman 2000). Therefore, these techniques are reviewed in sufficient detail only to make this paper reasonably self-contained (Information box), and interested readers are referred to the reviews cited above and the primary literature for more detailed treatments. Similarly, there have been numerous reviews of selection on protein electrophoretic variation (e.g. Gillespie 1991; Mitton 1997); in this review I therefore focus exclusively on DNA sequence variation and, in particular, on those aspects of adaptive molecular variation likely to be of most interest to molecular ecologists.
This review consists of four parts. First, the status of the neutral theory after 15 years of DNA sequence surveys is reviewed briefly. Second, I review how neutrality tests have been used to study ecological questions, focusing on what types of genes have been found to be subject to selection resulting in rates of evolution or levels of diversity greater than expected under neutrality (positive selection) and discussing in detail several examples likely to be of particular interest to molecular ecologists. Third, the review briefly discusses how failure to test the neutrality hypothesis adequately can produce misleading results focusing particularly on recent developments such as the use of microsatellites. Finally, the review concludes with a discussion of the positive implications of using DNA sequence data to study natural selection, and discusses some of the obstacles that will limit its near-term utility.
The current status of the neutral theory
Studies that use molecular markers to address questions in ecology and conservation biology often assume a strictly neutral model of molecular evolution as the basis for analysing and interpreting results. It is therefore worthwhile to review briefly the status of the neutral theory after nearly 20 years of studies of DNA sequence variation within and among species. This has been the topic of several recent reviews (Kreitman 1996; Hey 1999; Hughes 1999), so this section will concentrate only on a few key findings.
The strictly neutral theory proposes that the vast majority of new mutations fall into one of two categories: deleterious or selectively neutral (Kimura 1983). Deleterious mutations are expected to be eliminated rapidly due to natural selection against them and therefore presumably contribute little to variation within or among species. Mutations that are selectively equivalent to the allele(s) already present in the population, on the other hand, are expected to have dynamics governed by genetic drift and to make up the vast majority of the observed variation both within and among species. Beneficial mutations are expected to be extremely rare and to contribute little to observed patterns of DNA sequence variation.
Overall, patterns of DNA sequence variation generally support one key aspect of the neutral theory — that many of the observed differences within and between species are nonadaptive (Kimura 1983; Hughes 1999). For example, there is a broad negative correlation between a nucleotide site's functional importance and its substitution rate. Sites that involve a change in a protein or a change in gene regulation are on average more conserved both within and among species than nonfunctional sites, exactly as predicted by the neutral theory. Additionally, vast parts of the genomes of many organisms contain noncoding DNA that serves no known purpose, and most mutations in these areas are presumably neutral.
A second key aspect of the neutral theory, that most variation has dynamics governed predominantly by genetic drift, is not supported by observed patterns of DNA variation. An important general finding from studies in Drosophila is that there is a positive correlation between a gene's average recombination rate and its level of variability within species (Begun & Aquadro 1992; Aquadro et al. 1994). Similar correlations have been found in humans (Nachman et al. 1998, 2001), sea beets (Kraft et al. 1998) and maize (Tenaillon et al. 2001). The difference cannot be explained by differences in the neutral mutation rate because there is no correlation between a gene's interspecific divergence and its recombination rate (Begun & Aquadro 1992). The explanation for the correlation appears to be due to the ‘hitchhiking’ of putatively neutral variation with sites that are under selection. One hitchhiking hypothesis that fits fairly comfortably within the paradigm of the neutral theory is that hitchhiking is due mainly to selection against new deleterious mutations that never reach high frequency (the background selection hypothesis —Charlesworth et al. 1993; Charlesworth 1996). The other hypothesis is that the hitchhiking is due mainly to selection of new beneficial mutations that, because of their rapid fixation, ‘selectively sweep’ away large swaths of neutral variation in regions of low recombination (Begun & Aquadro 1992; Aquadro et al. 1994; Hudson 1994; Stephan 1995). The relative degree to which selective sweeps and background selection shape patterns of variation across the genome remains unknown, and distinguishing between the two modes is active area of research (e.g. Begun & Whitley 2000; Fay & Wu 2000; Kim & Stephan 2000; Carr et al. 2001; Kauer et al. 2002; Wang et al. 2002).
The correlation between recombination and variation has two important implications for the neutral theory as it is used typically by molecular ecologists. First, regardless of whether the background selection or selective sweep hypothesis is more important, it means that a gene's recombinational environment is a large contributing factor in determining patterns of variation within species. Second, if the selective sweep hypothesis emerges as a major factor in creating the correlation, this would mean that even if much observed variation is selectively neutral, the dynamics of this neutral variation could well be governed more by linkage to selected sites than genetic drift. This is almost certainly true of genes in regions of low recombination, but linkage to selected sites can in theory be important even in regions of high recombination (Gillespie 2000).
Another aspect of the neutral theory that is not well supported by patterns of DNA sequence data is the hypothesis that adaptive DNA variation is expected to be exceedingly rare. Strictly speaking, this may be true on a per nucleotide site basis, considering that many changes in third codon position, introns and noncoding regions may often be selectively equivalent. However, the fraction of genes for which there is evidence of positive selection may be larger than postulated by the neutral theory. The only published study to attempt to estimate systematically the proportion of genes subject to positive selection is that of Endo et al. (1996), who estimated nonsynonymous/synonymous substitution rate ratios (dn/ds) for 3595 groups of genes extracted from public databases. A dn/ds ratio > 1 provides evidence for positive selection (see Information box). Of the 3595 homologous groups of genes, Endo et al. (1996) found only 17 with evidence for positive selection — certainly a small fraction. However, their criteria for positive selection, dn/ds > 1 averaged over entire genes, was extremely stringent, considering that in most cases only a small fraction of the sites in a gene may be subject to positive selection (Yang & Bielawski 2000). Simple dn/ds ratios are also far from the only method of detecting adaptive evolution at a gene (Information box). Other methods of detecting selection, particularly those that compare dn/ds ratios within and between species, are expected to be more powerful (Charlesworth et al. 2001; Fay et al. 2001), and have been used to find selection in many genes with simple dn/ds ratios < 1 (Table 1). In addition, there are several examples of selection on noncoding sequences (e.g. Schulte et al. 1997; Wang et al. 1999) which would not be detected with any sort of dn/ds comparison.
Table 1. Genes with evidence for positive selection
The proportion of genes subject to positive selection therefore remains unknown, but it is sufficiently high that it has become relatively common to find evidence for positive selection. For example, in a literature search for papers documenting positive selection using DNA sequence data (Table 1), three studies were found published in the 1980s, 57 published in the 1990s and 45 published in 2000 and the first half of 2001. The rapid increase in the number of papers documenting positive selection is due both to advances in DNA sequencing techniques that allow more rapid acquisition of data and advances in statistical methods for detecting positive selection (Kreitman & Akashi 1995; Yang & Bielawski 2000).
In summary, the ‘neutralist/selectionist’ debate of the 1970s and 1980s has died away (Kreitman 1996; Hey 1999). It has been replaced by a more complicated view in which selection, both positive and negative, appears to leave deep footprints in the genome and evidence for positive selection at many specific genes is accumulating. The strictly neutral theory, rather than being a simple explanation for patterns of genetic diversity, has instead become the primary null hypothesis used to test for the effects of natural selection.
Using inferences about selection to study ecology
The field of ecology encompasses a broad range of questions about why species and populations are where they are, and how they interact with other organisms and their environment. The fields of molecular ecology and molecular conservation genetics have involved primarily the application of molecular data to questions of biogeography, gene flow, hybridization and relatedness among individuals (Avise 1994, 1995). These questions are important and can be addressed to greater or lesser degrees within the framework of the neutral theory. Other ecological questions address adaptation and natural selection, however. If inferences about natural selection could be made readily from molecular data, this would be of enormous importance to the field of molecular ecology.
What types of genes are subject to positive selection?
To address this question, a literature search was conducted for studies documenting positive selection. Only studies that used statistical tests on DNA sequence variation, to infer the action of natural selection at particular genes, were included; direct functional studies, studies documenting hitchhiking over broad genomic areas and studies based solely on protein variation were not included. The search resulted in over 100 studies published in the last 15 years documenting statistical evidence for positive selection on over 119 genes or gene groups (Table 1). Table 1 is not an exhaustive list of all studies documenting positive selection, but certainly includes the bulk of studies published prior to early 2001. The only types of genes that are probably systematically underrepresented in Table 1 are microbial and viral genes, at which observations of high dn/ds ratios are common and therefore difficult to compile comprehensively. Studies documenting positive selection are being published at an increasing rate (over half of the studies cited in Table 1 are from 2000 and 2001), and the list of positively selected genes can be expected to grow dramatically in the coming years.
Of the 119 positively selected genes or gene groups, 92 had dn/ds significantly > 1 for all or part of the gene, nine had significant McDonald–Kreitman tests and 25 had statistically non-neutral patterns of overall polymorphism or divergence indicative of natural selection (e.g. significant HKA, Tajima’s, or similar tests — Information box I).
Of the genes that show evidence for positive selection, 47 have been shown or hypothesized to be involved in host–parasite interactions, either as part of the host's defence system or the parasite's system for evading host defences (Table 1). Examples of host immune response genes with evidence for positive selection include multiple loci in the major histocampatibility complex (MHC) in various mammals and fishes; interleukin-2, ribonucleases and CD45 in mammals, transferrin in salmonids; and chitinases, a polygalacturonase inhibitor, and various other disease resistance genes in plants (Table 1). With regard to pathogens, there is evidence for positive selection in multiple viral, bacterial, fungal and protist genes (Table 1).
A second large group of genes with evidence for positive selection consists of genes that are, or have been hypothesized to be, involved in sexual reproduction. In particular, there is evidence for positive selection in at least 35 reproductive genes (Table 1). Positive selection on sex-related genes has been reported across a wide variety of taxa, including three sperm proteins from several marine invertebrates (sea urchins; abalone; sea snails); a large number of Drosophila ejaculatory proteins; a mouse androgen binding protein; genes related to interspecific mating isolation in Drosophila; mammalian egg surface proteins; and genes involved in self-mating avoidance in plants and fungi (Table 1). The specific functions of these genes vary dramatically, and selective mechanisms have been documented directly in only a few cases (e.g. S-loci in plants, Clark & Kao 1991).
Genes whose products code for enzymes involved in energy metabolism form a third group of genes with statistical evidence for positive selection (reviewed by Mitton 1997; Eanes 1999). There is statistical evidence for positive selection in least 15 metabolic genes (Table 1), most of which have been surveyed extensively in many species using protein electrophoresis. Most of the studies documenting positive selection at these genes have focused on variation within and among closely related species (Table 1), and violations of the neutral model at these genes generally involve non-neutral patterns of overall polymorphism and divergence or significant McDonald–Kreitman tests (see Information box). Many of the genes targeted for DNA studies had been hypothesized previously to be subject to selection based on either clinal patterns of protein variation or function studies of protein variants (Mitton 1997; Eanes 1999).
The remaining genes with strong evidence for positive selection do not fall into any obvious functional category. These genes include lysozyme and K-casein, which are involved in digestion; genes for various animal toxins; a melanocortin receptor gene involved in hair and skin colour; regulatory genes involved in plant morphology; several genes expressed in artiodactyl placentas; haemoglobin in Antarctic fishes; the cod pantophysin gene; odour receptors in humans and catfish, and the pyrin gene in primates (Table 1).
What are the selective mechanisms?
The number of reported cases of positive selection has grown remarkably over the last several years, but in most cases the actual source of selection remains unknown. In some cases, even the functions of the genes in question are obscure. This situation has come about because it has become relatively easy to collect and analyse DNA sequences but it is much more difficult to elucidate a gene's functions, let alone determine how variation in those functions was influenced by natural selection in current or past populations. In some cases, particularly when a gene's function(s) are at least partially understood, it is possible to postulate a plausible adaptive hypothesis for why a gene has been subject to positive selection. These hypotheses can then suggest further experiments or observations that could be used to confirm or refute them as explanations for observed patterns of genetic variation. On the other hand, some of the cases of positive selection reported in the literature are inferred by comparing sequences from species that have diverged tens or sometimes even hundreds of millions of years ago. In most of these cases, a detailed understanding of the selective mechanisms leading to observed patterns of variation may well be unattainable (Hughes 1999).
Even in cases where there is little information on the ecological or functional mechanisms leading to patterns of DNA variation indicative of positive selection, some useful information can be gleaned directly from the DNA sequence data. For example, patterns of DNA sequence polymorphism can be used to determine if selection acting on a gene is directional (e.g. G6pd: Eanes et al. 1993), balancing (e.g. PgiC: Liu et al. 1999) or both (e.g. Adh: Kreitman & Hudson 1991). High dn/ds ratios among genes that make up a multigene family provide strong evidence that there has been selection for new functions after gene duplication (reviewed by Hughes 1999). The observation of strong directional or balancing selection at a gene or gene region can be a useful tool for determining whether the gene is a good candidate for future functional studies. This approach is particularly powerful if specific nucleotide sites that have been the target of positive selection can be identified. Molecular biologists have long made use of a key paradigm of the neutral theory — that functionally important parts of a gene will be conserved — to identify candidate gene regions for further functional study. A similar approach can now be taken by identifying specific variable sites that show statistical evidence for adaptive evolution (Nielsen & Yang 1998; Suzuki & Gojobori 1999; Fig. 1).
Examples of ecologically interesting positive selection
Selection on allozyme loci
A basic goal of physiological ecology is understanding how an organism's metabolism allows it to live in its environment. Many of the proteins that can be studied electrophoretically are involved in energy metabolism. After high levels of allozyme variation were discovered in the late 1960s (Harris 1966; Lewontin & Hubby 1966), many studies were aimed at determining the functional and adaptive significance, if any, of observed variation (reviewed by Mitton 1997). Functional studies have documented that variation in metabolic proteins can affect ecologically important traits such as flying or swimming performance and mating ability (see Powers et al. 1991; Mitton 1997; Watt & Dean 2000 for reviews), and DNA sequence studies have found that many metabolic genes have patterns of variation inconsistent with neutrality (Table 1). Studying allozyme loci at the DNA level is attractive because many of the genes code for proteins whose biochemical roles are well understood and which may contain variation important for understanding the distribution or behaviour of organisms (Mitton 1997; Eanes 1999). Studies of DNA sequence variation at alloyzme genes are particularly useful for acquiring additional insight into the evolutionary history of previously documented allozyme variation, although the inferred histories can complicated (reviewed by Eanes 1999). For example, the first allozyme locus to be studied extensively at the DNA level, alcohol dehydrogenase in Drosophila (Adh), has a peak of synonymous site polymorphism centred on the site that causes the allozyme polymorphism (Fig. 2: Kreitman & Hudson 1991). This pattern of variation is inconsistent with a simple neutral model, and consistent with a balancing selection model (Kreitman & Hudson 1991). In a more recent analysis of additional Adh sequences, however, Begun et al. (1999) suggest that the evidence of an old balanced polymorphism at Adh is not compelling, in part because the peak of synonymous polymorphism appears to predate the alloyzme polymorphism. Other Drosophila protein polymorphisms also have quite complicated inferred histories based on patterns of DNA sequence variation. For example, Sod, G6pd, Pgm, Tpi, Pgd and Amy all have patterns of variation that are inconsistent with neutrality but that are not suggestive of old balanced polymorphisms (see Table 1 for references). The type of selection occurring at these loci remains obscure, but may involve transient or regionally varying selection or evolutionarily young balanced polymorphisms. One unifying factor in studies of selection in D. melanogaster is its relatively recent migration from Africa, which may have created new selection pressures in non-African populations (Begun & Aquadro 1995; Takahashi et al. 2001; Kauer et al. 2002). Outside Drosophila, other cases in which DNA sequence studies have found or confirmed evidence of positive selection at allozyme loci include Ldh-B in Fundulus (Powers et al. 1991; Schulte et al. 1997; Powers & Schulte 1998), and PgiC in crickets (Katz & Harrison 1997) and several plant species (Arabidopsis thaliana: Kawabe et al. 2000; Leavenworthia sp.: Liu et al. 1999).
Domestication selection in agricultural systems
Understanding the ecology and evolution of agricultural systems is important for solving many applied agricultural problems. For example, concerns have been raised that modern agricultural plants and animals are becoming increasingly depauperate of genetic diversity and therefore vulnerable to environmental or biological insults their wild ancestors would have resisted (e.g. Notter 1999). Understanding the genetic architecture and history of the domestication process provides useful information for the feasibility and utility of introducing new genetic material from a crop's wild ancestors (Tanksley & McCouch 1997). Agricultural crops have also provided a useful experimental context for addressing fundamental ecological and evolutionary questions applicable to natural systems, including the genetic basis of morphological variation, genetic and ecological interactions between hosts and pathogens and the evolution of phenotypic plasticity. Finally, many agricultural crops are amenable to extensive transmission genetic analysis such that genes of interest can be located in the genome and cloned. This makes agricultural systems particularly attractive for studying the population genetics of genes that control or influence complex traits (Tanksley & McCouch 1997).
The domestication of maize (Zea mays) from its wild grass ancestor teosinte (Z. mays parviglumis) provides a superb example of how molecular inferences about natural selection provided insight into an important ecological and evolutionary question. Recently, Doebley and colleagues used genetic mapping techniques to show that the teosinte branched-1 (tb1) locus is primarily responsible for one of the key morphological differences between maize and teosinte (Doebley et al. 1995, 1997). They then used a prediction of the neutral theory — that loci that have recently experienced strong directional selection are expected to have low levels of intraspecific variation — to see if there was evidence for selection at tb1 during the domestiction of maize (Wang et al. 1999). Wang et al. (1999) tested this expectation by sequencing a sample of tb1 alleles from maize and teosinte. They found that the 5′ untranslated region of tb1 had much lower levels of variability in maize than in teosinte, consistent with a selective sweep of that gene region in maize (Fig. 2). An HKA test ( Information box) confirmed that the reduction in variability in maize was inconsistent statistically with neutral evolution.
The genealogy of the tb1 gene also provided insight into the biogeography of maize domestication. The consistent clustering between maize tb1 5′ alleles and teosinte alleles sampled from a specific teosinte subspecies native to particular river valley provided evidence for the location in which maize was initially domesticated (Wang et al. 1999). In this case, genealogical analysis of a gene under strong selection provided a much clearer biogeographical signal than genes evolving neutrally (Hilton & Gaut 1998). The analysis of genetic variation at the CAULIFLOWER locus (BoCal) in wild and domesticated Brassica oleracea provides another good example of how neutrality tests were used to gain insight into selection at a gene involved in plant domestication (Purugganan et al. 2000).
Cuticular hydrocarbon variation in D. melanogaster
Understanding the reasons for intraspecific variation in behaviour, morphology or physiology is an important goal of ecology. For example, populations of D. melanogaster vary geographically in the components of their cuticles, with African and Caribbean populations having a high ratio of two major hydrocarbon isomers and all other populations worldwide having a low ratio of the same isomers. The functional significance of these differences is not well understood, but sexual selection or adaptation for desiccation resistance have been hypothesized as possible explanations (Ferveur et al. 1996; Takahashi et al. 2001).
Recently, two groups determined independently that the variation in the isomer ratios was due to a difference at a single gene, desat2 (Coyne et al. 1999; Dallerac et al. 2000; Takahashi et al. 2001). Based on patterns of variation at desat2, Takahashi et al. (2001) found that the cosmopolitan low ratio alleles were evolutionarily derived from an African/Caribbean high ratio allele. The low ratio alleles displayed little intra-allelic diversity and had a frequency spectrum inconsistent with an equilibrium neutral model but qualitatively consistent with a recent selective sweep. Takahashi et al. (2001) concluded that the most probable explanation for their data was that the low ratio allele originated recently and increased in frequency in non-African populations due to natural selection either during or subsequent to D. melanogaster’s recent emigration from Africa.
Understanding interactions among species is another important goal of ecology. One type of interspecific interaction that appears to lend itself readily to molecular analysis is the relationship between hosts and parasites. Recently, several investigators have used DNA sequence variation in plant genes involved in resistance to pathogens to gain insight into host/pathogen coevolution (reviewed by Bergelson et al. 2001; Holub 2001). These studies have found that positive natural selection has played a role in the evolution of plant resistance genes in a variety of ways, including diversifying selection among members of resistance gene families (e.g. Botella et al. 1998); directional selection on genes among different species (e.g. Bishop et al. 2000); and balancing selection among alleles within species (Fig. 2; Stahl et al. 1999; Bittner-Eddy et al. 2000).
Stotz et al.'s (2000) study of fungal polygalacturonases (PGs) and their plant inhibitors (PGIPs) is a particularly good example of how neutrality tests can contribute to the understanding of interactions between hosts and pathogens. Stotz et al. (2000) analysed interspecific variation among 19 fungal PGs and 22 plant PGIPs, and found that many comparisons within each group had dn/ds significantly > 1, indicating that both the pathogen proteins and plant inhibitors were evolving by positive selection. More detailed analysis of dn/ds ratios using Nielsen & Yang's (1998) maximum likelihood method allowed Stotz et al. (2000) to identify nine codons in each group of genes with high posterior probabilities of being subject to positive selection.
One strength of the Stotz et al. (2000) study is that the molecular structures of both PGs and PGIPs are known, and the physical locations of sites identified as statistically likely to be subject to positive selection can be visualized three-dimensionally. This three-dimensional physical information can provide additional insight into the mechanism of natural selection and can generate hypotheses for further experimental study (Nielsen & Yang 1998, Fig. 1).
Using neutrality tests to study natural selection
The conclusion that variation at many genes is influenced by positive selection, either directly on the genes themselves or indirectly through the effects of linkage, has important implications for molecular ecology and conservation genetics. For example, the finding of pervasive effects of positive selection suggests that molecular ecologists must be cautious about analyses that rely upon untested assumptions of neutrality (Rand 1995). For example, molecular ecologists and conservation geneticists use molecular genetics routinely to help address questions such as defining conservation units, studying biogeography and estimating effective population sizes or rates and patterns of gene flow. In most cases the genetic variants that are analysed are of no particular interest in themselves, but are used instead as putatively neutral markers of demographic or evolutionary histories (Avise 1994). If a marker, or even an entire group of markers, that is assumed to be evolving neutrally is actually subject to selection, conclusions based on patterns of variation at the marker could be misleading. On the other hand, the ability to make strong inferences about the selective history of a gene opens up exciting new research areas for molecular ecologists to explore, although for most organisms there are several roadblocks standing in the way of research programmes based on studying natural selection at the molecular level. In what follows, the implications are discussed for molecular ecology if many putatively neutral markers turn out to be non-neutral.
Implications for ignoring selection
Rand (1995) warned conservation geneticists that assuming selective neutrality for allozyme and mtDNA variation routinely was problematic due to considerable evidence for the effects of natural selection at these classes of genetic markers. Assuming strict neutrality when in fact variation is affected by selection is particularly worth avoiding when conducting quantitative analyses, such as estimating divergence times, rates of gene flow or effective population sizes. In the 6 years since Rand's review, the number of published examples of positive selection has increased dramatically (Table 1), so the need to examine carefully the consequences of false assumptions of neutrality appears even more justified now than it did in 1995.
One development that has come about after Rand's review is the now ubiquitous use of microsatellites in molecular ecological studies. Variation at these loci is usually assumed to be selectively neutral, in part because microsatellites are often located in noncoding regions. Several recent results suggest that assumptions of strict selective neutrality at microsatellite loci must be viewed with caution, however. For example, some microsatellites are located in coding or promoter regions and may be under direct selection (e.g. microsatellites in the promoter region of Ldh-B in Fundulus heteroclitus: Schulte et al. 1997).
Even if variation at a specific microsatellite locus is selectively equivalent, it may not fit a strict neutral model due to selection at linked sites. Furthermore, because microsatellite mutation rates vary across taxa, the power to detect the effects of hitchhiking selection on equilibrium levels of variation will vary as well (Schug et al. 1998; Wiehe 1998). For example, in D. melanogaster there is a strong correlation between recombination rate and microsatellite variation similar to that observed for single nucleotide variation and due presumably to the combined effects of background selection and positive hitchhiking (Schug et al. 1998). In contrast, in humans there is no evidence of a correlation between microsatellite variation and recombination, but there is a positive correlation between recombination and single nucleotide polymorphism (Nachman et al. 1998; Payseur & Nachman 2000; Nachman 2001). The lack of a correlation between microsatellite variation and recombination in humans is due apparently to the high mutation rate of human microsatellites (∼10−4, references in Payseur & Nachman 2000). Microsatellites in regions of low recombination will have the same reduced coalscent times as other sequences in the same regions, but may not have reduced levels of variation in humans due to mutational saturation (Wiehe 1998). In contrast, in Drosophila microsatellite mutation rates are much lower (∼10−6), take longer to become mutationally saturated and therefore have patterns of variation that are influenced more readily by background selection or selective sweeps (Schug et al. 1998).
The effects of violations of selective neutrality will vary from study to study depending on the specific questions being addressed, and it is beyond the scope of this review to address the topic in any detail (see Rand 1995 for a comprehensive treatment, particularly with regard to mtDNA). The point is that it is important to test assumptions about selective neutrality before using ‘neutral’ markers to make inferences about populations.
Studying natural selection at the molecular level
The ability to make strong inferences about the action of selection from patterns of DNA sequence variation is a potentially powerful way of studying adaptation. For example, one might be interested in knowing if geographical patterns of variation in a particular trait are adaptations that arose through natural selection. Conceptually, a way to answer this question would be to survey DNA sequence variation at a gene or genes with major effects on the trait and determine if the null hypothesis of selective neutrality can be rejected (i.e. a variation of Endler's (1986) model fitting method of studying natural selection). If the patterns of variation are statistically inconsistent with the neutral theory, but statistically consistent with positive natural selection, this would provide evidence that the variation in the trait of interest arose in response to natural selection. It may also be possible to make inferences about how, when and perhaps even where the selection occurred. Several of the studies described above used successfully this approach of studying the history of natural selection on specific traits.
Using DNA sequence data to study natural selection is useful but there are several obstacles, both practical and conceptual, that will limit its near-term applicability. The primary practical limitation for most molecular ecologists studying nonmodel organisms will be finding appropriate genes to study. Many of the traits that ecologists find interesting are complex or quantitative traits of morphology, physiology or behaviour. For most organisms, the genetic basis of these traits is unknown and likely to be influenced by many genes. Finding the genes that influence variation in quantitative (or for that matter even simple) traits is difficult and time-consuming even in ‘model’ genetic organisms such as Drosophila and maize (Flint & Mott 2001; Mackay 2001), and is not currently feasible for many of the less genetically studied organisms that are usually of interest to ecologists or conservation biologists.
In addition to the practical problem of isolating genes, a related conceptual problem is that genetic variation in traits of interest may not always involve variation at a gene of large effect. If a trait of interest is influenced only by many genes of small effect, it is more likely that the effects of selection on the trait would also be dispersed widely around the genome and therefore more difficult to detect from patterns of DNA sequence variation. In Drosophila, where considerable progress has been made dissecting the genetic basis of quantitative traits, the distribution of genic effects for several traits is roughly exponential with a few genes of large effect and many genes of small effect (Mackay 2001). If this patterns holds for other traits and other species, it would mean that most traits will be influenced by some genes of large effect. Genetic mapping experiments in other species do generally find quantitative trait loci (QTL) of large effect (e.g. Schemske & Bradshaw 1999; Peichel et al. 2001), but in these studies individual QTL almost always encompass large chromosomal regions containing hundreds of genes, so the effects of individual genes cannot be evaluated without extensive additional genetic analysis (Flint & Mott 2001).
If isolating genes of interest is a struggle even in model genetic organisms, is there any hope for the ecologist studying an organism with no genetic map, no sophisticated genetic tools and no probable prospect for a government-funded genome project? There are several potential approaches. First, for many organisms it will be more fruitful to examine variation in genes whose functions are already well understood than to start with a phenotypic trait of interest and try to find the genes that influence it. A place to start might be genes whose products are involved in basic energy metabolism, such as the enzymes in the glycolytic pathway and the Krebs cycle. These genes code for products with well-understood biochemical functions that are similar across many organisms. Variation in at least some of these genes has been shown to be subject to positive natural selection and to contribute to ecological interesting variation in a variety of species (see above), making them good candidates for studying adaptive variation. Cloning these genes using heterologous probes or degenerate PCR primers is relatively straightforward (e.g. Katz & Harrison 1997) and will become more so as more sequences become available. For a great many organisms, variation at these genes has been surveyed using protein electrophoresis, and in some cases patterns of electrophoretic variation may suggest good candidates to study at the DNA level.
Model genetic organisms provide a second potential source of candidate genes for molecular adaptation studies (reviewed by Haag & True 2001). Ford & Aquadro (1996), for example, in their study of speciation in D. athabasca, surveyed variation in candidate mating song genes isolated originally from D. melanogaster. Lukens & Doebley (2001) studied patterns of variation in the tb1 gene among members of the grass family to determine if this gene, which influences morphology in maize, had also been evolving adaptively in other grass species. A coat colour gene, Mc1r, isolated originally in mice, has been shown to affect coat colour in a variety of farm animals (Andersson 2001), and there is evidence for positive selection acting on the same gene during primate evolution (Rana et al. 1999). There is even weak evidence that the gene is subject to positive selection in humans, where it may influence skin and hair colour (Rana et al. 1999). The CAULIFLOWER gene, isolated originally in Arabidopsis, was also found to affect floral morphology in domesticated Brassica sp. (Purugganan & Suddith 1998; Purugganan et al. 2000). MHC genes have been isolated from, and found to be subject to positive selection in, a variety of species (Table 1). As the functions of more and more genes are elucidated in a few ‘model’ organisms, molecular ecologists will have a large pool of candidate genes to draw upon for population genetic study. Confirming that candidate genes actually influence the trait of interest in the organism studied is rarely a trivial task, however (Flint & Mott 2001).
A third potential source of genes to study could come from randomly or systematically sampling a species’ genome for genes that show evidence for positive selection. Pogson et al. (1995) took this approach in their study of genetic variation in Atlantic cod. They surveyed variation initially within and among cod populations at a group of randomly cloned cDNA fragments, and found that as a class the random clones had greater levels of diversity among populations than did allozyme loci, suggesting the action of natural selection on one or both classes of loci. Pogson (2001) conducted a detailed analysis of sequence variation at one of the random clones and found that the gene (pantophysin) had dn/ds > 1 within cod populations. That observation provided additional evidence for positive selection, although the form of selection and even the gene's function remain obscure (Pogson 2001). There is no way to know how often such random searches can be expected to find genes of interest, but this example suggests that developing methods of scanning genomes for positively selected genes would be worthwhile.
Finally, the availability of complete genome sequences combined with well-developed genetic tools provides a compelling reason to study the ecology of genetically well-characterized organisms. In plants, Arabidopsis thaliana is rapidly becoming a model plant for ecological genetic studies (Pennisi 2000). D. melanogaster is one of the best understood animals genetically and developmentally, and studies in the species have led to fundamental advances in population genetics (Aquadro et al. 1994; Hey 1999). Despite this wealth of knowledge, D. melanogaster has been the subject of relatively few ecological studies and the ecological factors causing selection on even the famous Adh polymorphism are not well understood (Hughes 1999). More intensive ecological study of D. melanogaster, and other genetically well-characterized organisms, would clearly be beneficial.
In summary, statistical analysis of DNA sequence diversity within and among species is a powerful way to study natural selection and adaptation. In large part, this is because DNA sequences can provide an informative record of their own evolutionary history. Over the last 20 years, investigators have made substantial progress in their ability to read and make sense of the historical information stored in DNA sequences. As more genes are sequenced and their allelic diversity analysed within and among species, the challenge will be not so much to find genes that have evolved under positive selection but to understand causes of the selective pressures.
The author thanks Paul Moran, Robin Waples, Peter Kareiva and two anonymous reviewers for helpful comments on previous versions of the paper.
Michael J. Ford heads the Genetics and Evolution Program in the Conservation Biology Division of the Northwest Fisheries Science Center. The Conservation Biology Division studies the processes maintaining biological diversity so that diversity can be sustained in the face of human impacts. The purpose of the Genetics and Evolution Program is to understand how genetic processes contribute to species viability, and to develop and use molecular genetic tools for addressing evolutionary, ecological and resource management questions.
Methods of detecting positive selection using DNA sequence data
Soon after protein electrophoretic studies began demonstrating that natural populations were polymorphic at enzyme loci, investigators developed methods of testing the fit of observed patterns of variation to expectations under neutrality (e.g. Ewens 1972; Lewontin & Krakauer 1973; Watterson 1978). These tests can suffer from low power and their assumptions are frequently violated, making strong inferences about selection difficult. Many of the statistical tests of the fit of DNA sequence data to neutral models have been more successful, due largely to the greater information content available in the data (Kreitman & Akashi 1995; Hey 1999; Kreitman 2000). A plethora of neutrality tests have been developed for DNA sequence data, and four important and representative ones are illustrated briefly.
The HKA test.Hudson et al. (1987) developed the first statistical test of neutrality to take advantage of the greater information content available from DNA sequence data compared to protein electrophoretic data. The test is based on the expectation that under an infinite sites neutral model (appropriate for DNA sequence variation within and among closely related species) the level of polymorphism at a gene within a species is proportional to the amount of divergence at that gene between species. This expectation can be tested by comparing levels of polymorphism and divergence at two or more genes using Hudson et al.′s (1987) test statistic. The HKA test can be applied to both DNA sequence and RFLP data, and has been used commonly to test for departures from neutral expectations (Table 1). The HKA and related tests, like the earlier tests designed for allozyme data, makes specific assumptions about population structure which, if violated, can lead to rejection of the neutral model for reasons other than positive selection (Ford & Aquadro 1996; Ford 1998). Sliding window graphs (Fig. 2) are a method of visualizing the HKA concept, and identifying specific regions of a gene responsible for the departure from neutral expectations. McDonald (1996, 1998) has developed a formal statistical test using the sliding window approach.
Tajima's D-test.Tajima (1989) developed a statistical test of neutrality that uses only polymorphism data within a population. The test statistic, D, is based on the difference between two estimators of the neutral polymorphism parameter 4Neµ, one based on the total number of polymorphic nucleotide sites observed and the other based on the average number of differences between all pairs of sequences sampled. Under neutrality, both methods are expected to produce the same value, and Tajima's D statistic is designed to test whether the difference between them is unlikely to be observed by chance alone. The test is essentially a method of testing the fit of the data to the expected neutral site frequency distribution, with negative D-values indicating an excess of rare variants and positive values indicating an excess of intermediate frequency variants. Like the HKA test, Tajima's test can be used on both RFLP and full sequence data. The test is sensitive to violations of its demographic assumptions and is not particlarly powerful (Tajima 1989;Simonsen et al. 1995), so interpretation of test results can be uncertain. Fu (1997) discusses a number of similar tests.
Dn/ds tests.Hill & Hastie (1987) and Hughes & Nei (1988) were the first to use the ratio of nonsynonymous differences to synonymous differences among DNA sequences as a test for positive selection. The idea behind the test is that if synonymous mutations are essentially neutral because they do not result in a change in a protein, the rate of synonymous site evolution (ds) will equal the mutation rate (Kimura 1983). Nonsynonymous mutations, because they result in a change in a protein product, are more likely to be subject to natural selection. If most nonsynonymous mutations are deleterious, then the rate of nonsynonymous evolution (dn) will be lower than neutral rate, resulting in dn/ds < 1. If a substantial fraction of nonsynonymous mutations are beneficial, however, the average rate of nonsynonymous evolution can be higher than the neutral rate, resulting in dn/ds > 1. Conceptually, all that is involved in performing a dn/ds ratio test is estimating the number of synonymous and nonsynonymous substitutions between two sequences and the number of synonymous and nonsynonymous sites in the sequences and then performing a statistical test to determine if the difference between the two ratios is greater than expected by chance alone. In practice, complications such as accounting for multiple substitutions at the same site, estimating the number synonymous and nonsynonymous sites, and accounting for transition/transversion ratios add considerable complexity to the problem, especially for highly divergent sequences (Nielsen 1997; Yang & Nielsen 2000).
An exciting recent development in the application of dn/ds tests has been the development of maximum likelihood approaches for estimating dn/ds and related parameters (Goldman & Yang 1994; Nielsen & Yang 1998; Yang & Bielawski 2000). In addition to providing potentially more accurate dn/ds estimates (Yang & Nielsen 2000), the likelihood approaches allow estimation of dn/ds for individual branches of a phylogenetic tree, estimation of the proportion of codons in a sequence that are subject to positive selection, and identification of specific codon sites that are likely to have been subject to positive selection (Nielsen & Yang 1998, Fig. 1). Suzuki & Gojobori (1999) have developed an different method of detecting positive selection at single codons that uses a parsimony approach (see Suzuki & Nei 2001 for an initial comparison of the two methods).
The dn/ds ratio tests do not require any assumptions about population structure or equilibrium conditions, and therefore can provide particularly compelling evidence for positive selection. Various versions of the dn/ds ratio test do make assumptions about codon usage bias, transition/transition transversion rate ratios and selective neutrality of synonymous mutations. If these assumptions are violated the tests can produce misleading results, but these problems can be potentially avoided by using appropriate evolutionary models when conducting the tests (Yang & Bielawski 2000; Yang & Nielsen 2000).
McDonald–Kreitman test. McDonald & Kreitman (1991) proposed a two-by-two contingency test using the numbers of nonsynonymous and synonymous polymorphisms polymorphic within species and the numbers of nonsynonymous and synonymous differences between species. Under neutrality, the ratio of nonsynonymous to synonymous polymorphisms with species is expected to be the same as the ratio of nonsynonymous to synonymous differences between species (Sawyer & Hartl 1992). One striking finding that has emerged from using the test is that animal mtDNA genes generally have a greater number of nonsynonymous polymorphisms within species than expected compared to nonsynonymous divergence among species (e.g. Rand & Kann 1996), apparently due to weak selection against slightly deleterious nonsynonymous mutations (Nielsen & Weinreich 1999).