Evidence for complex selection on four-fold degenerate sites in Drosophila melanogaster

Authors

  • F. Clemente,

    1. Institute of Population Genetics, Veterinärmedizinische Universität Wien, Vienna, Austria
    Search for more papers by this author
  • C. Vogl

    Corresponding author
    1. Institute of Animal Breeding and Genetics, Veterinärmedizinische Universität Wien, Vienna, Austria
    • Institute of Population Genetics, Veterinärmedizinische Universität Wien, Vienna, Austria
    Search for more papers by this author

Correspondence: Claus Vogl, Institute of Animal Breeding and Genetics, Veterinärmedizinische Universität Wien, Veterinärplatz 1, A-1210 Vienna, Austria. Tel.: +43 1250 775 626; fax: +43 1250 775 690; e-mail: claus.vogl@vetmeduni.ac.at

Abstract

We considered genome-wide four-fold degenerate sites from an African Drosophila melanogaster population and compared them to short introns. To include divergence and to polarize the data, we used its close relatives Drosophila simulans, Drosophila sechellia, Drosophila erecta and Drosophila yakuba as outgroups. In D. melanogaster, the GC content at four-fold degenerate sites is higher than in short introns; compared to its relatives, more AT than GC is fixed. The former has been explained by codon usage bias (CUB) favouring GC; the latter by decreased intensity of directional selection or by increased mutation bias towards AT. With a biallelic equilibrium model, evidence for directional selection comes mostly from the GC-rich ancestral base composition. Together with a slight mutation bias, it leads to an asymmetry of the unpolarized allele frequency spectrum, from which directional selection is inferred. Using a quasi-equilibrium model and polarized spectra, however, only purifying and no directional selection is detected. Furthermore, polarized spectra are proportional to those of the presumably unselected short introns. As we have no evidence for a decrease in effective population size, relaxed CUB must be due to a reduction in the selection coefficient. Going beyond the biallelic model and considering all four bases, signs of directional selection are stronger. In contrast to short introns, complementary bases show strand specificity and allele frequency spectra depend on mutation directions. Hence, the traditional biallelic model to describe the evolution of four-fold degenerate sites should be replaced by more complex models assuming only quasi-equilibrium and accounting for all four bases.

Introduction

Synonymous sites have frequently been considered as potentially neutrally evolving, as the translated protein is unaffected by synonymous mutations (McDonald & Kreitman et al., 1991; Andolfatto & Przeworski, 2000; Smith & Eyre-Walker, 2002; Andolfatto & Wall, 2003; Sawyer et al., 2007). A special class of synonymous sites are four-fold degenerate sites, which are synonymous for all four nucleotides at the third codon position. Nevertheless, previous studies have revealed that in the genus Drosophila, synonymous sites in general and four-fold degenerate sites in particular are subject to selective constraint (reviewed by Hershberg & Petrov, 2008).

This is especially apparent when four-fold degenerate sites are compared to other putatively neutrally evolving sites, e.g. those in short introns. In a Drosophila melanogaster and Drosophila simulans data set, Parsch et al. (2010) found that short introns exhibit high levels of intraspecific polymorphism and interspecific divergence compared to four-fold degenerate sites, indicating little selective constraint. The relatively high AT content of introns between 59% and 65% among the twelve sequenced Drosophila species has been explained by a general mutation bias towards AT (Vicario et al., 2007). In contrast, synonymous sites are GC-rich. This is likely due to codon usage bias (CUB), i.e. directional selection for ‘preferred’ over ‘unpreferred’ synonymous codons. In most Drosophila species, preferred codons end in C or G (Shields et al., 1988; Akashi, 1994; Duret & Mouchiroud, 1999; Carlini & Stephan, 2003).

A simple biallelic population genetic model with a balance of mutation, directional selection and drift can account for many observed patterns reasonably well. In essence, this model consists of a mutation bias favouring AT alleles, counteracted by selection favouring GC alleles (Bulmer, 1991; Akashi, 1996; McVean & Charlesworth, 1999; Vicario et al., 2007; Zeng & Charlesworth, 2010). Such a process might actually elevate the amount of polymorphism and increase the number of fixations in the population over that expected under neutral equilibrium (McVean & Charlesworth, 1999; Lawrie et al., 2011; Vogl & Clemente, 2012). Although an equilibrium model of mutation, selection and drift may explain many observations, it is clear that in D. melanogaster allele frequency spectra and substitutions are not at equilibrium. D. melanogaster shows a relative increase in AT content at synonymous sites compared to D. simulans (Akashi, 1995, 1996; Kliman, 1999; Singh et al., 2009). This has been explained by relaxed selection for GC in D. melanogaster. Based on simulation studies, Akashi (1996) suggested at least a five-fold reduction in the scaled selection coefficient γ = Ns at synonymous sites in D. melanogaster, where N is the effective population size and s the (unscaled) selection coefficient. Alternatively, the mutation bias may have shifted even more towards AT in D. melanogaster, which results in a departure from mutation–selection–drift equilibrium (Takano-Shimizu, 2001; Kern & Begun, 2005; Akashi et al., 2006). These two explanations do not exclude each other.

Most tests to infer deviations from neutrality assume equilibrium (e.g. Tajima, 1989; Fu & Li, 1993; Fu, 1995; Fay & Wu, 2000; Achaz, 2008) and are therefore not applicable to the D. melanogaster data. However, even in the absence of equilibrium, powerful tests can be performed, if the effects of recurrent mutations can be neglected and the ancestral state can be determined to polarize mutations, i.e. to infer their direction. For this, the mutation rate must be low, i.e. θ = 4N μ < 0.05, where N is the effective population size and μ the mutation rate per site per generation. With such a low θ, recurrent mutations become unlikely, such that allele frequency spectra are largely independent of mutation biases and only governed by selection and population demography (e.g. Desai & Plotkin, 2008; Vogl & Clemente, 2012). In this case, parsimony can be assumed to determine the ancestral state, allowing the comparison of allele frequency spectra from different mutation directions.

Many earlier studies considered only relatively small data sets. With whole-genome data, greater power can be achieved. Using whole-genome data from related Drosophila species, Singh et al. (2009) provided evidence that rates and patterns of nucleotide substitution differ significantly among species and also among loci within genomes. These authors also provide evidence that not only reduced selection for preferred codons but also a mutational shift towards more AT could explain the high numbers of AT fixations at synonymous sites in D. melanogaster. Interestingly, a few genes were found to be under positive selection in favour of the otherwise unpreferred codon (DuMont et al., 2004; Singh et al., 2007; DuMont et al., 2009). With genome-wide short intron data, evidence for the effects of directional selection or other selective constraints is absent in D. melanogaster, whereas polymorphism and substitution patterns show a slight shift in bias towards AT (Clemente & Vogl, 2012). This supports the hypothesis of a shift in mutation bias towards AT in the D. melanogaster lineage. As no difference in the proportions of site-frequency spectra could be found in both mutation directions, we concluded that this shift must have occurred early after the split from the D. simulans/Drosophila sechellia lineage. The bias in the substitution pattern towards AT mutations is, however, far stronger at four-fold degenerate sites than in introns (Singh et al., 2009).

Genome-wide data not only provide greater power to decide among models that were developed earlier, but also allow for the evaluation of more sophisticated models. With genome-wide data, it is not necessary to reduce the complex mutation patterns to a biallelic model by lumping A with T and G with C alleles. Thus, strand-specific polymorphism and substitution patterns may now be revealed.

In this article, we provide a genome-wide within and among species analysis of four-fold degenerate sites. We use short intron sequences as a neutral reference (Clemente & Vogl, 2012) to infer the evolutionary forces acting on four-fold degenerate sites. We combine data from the African (Malawi) D. melanogaster population of the Drosophila Population Genomics Project (Langley et al., 2012) with D. simulans, D. sechellia, Drosophila erecta and Drosophila yakuba (Begun et al., 2007; Clark et al., 2007) as outgroups to polarize mutations with higher accuracy than in previous studies and to include divergence. Compared to cosmopolitan populations, African populations show little evidence of a bottleneck, which would interfere with the detection of patterns of selection. In addition to the studies of AT vs. GC mutation patterns, we also consider mutations among all four bases. We only consider autosomes and ignore the X-chromosome, because the X-chromosome differs systematically from the autosomes in D. melanogaster (e.g. Singh et al., 2005a, c, 2008). Because within four-fold degenerate sites we cannot assume equilibrium for our tests and inferences, we apply quasi-equilibrium models throughout using polarized frequency spectra and distinguishing mutational directions of substitution patterns.

Materials and methods

Sequence data

We analysed genome-wide four-fold degenerate sites from an African (Malawi) D. melanogaster population (Release 1.0), provided by the Drosophila Population Genomics Project (http://www.dpgp.org/; Langley et al., 2012). To obtain outgroup sequences, we downloaded (http://genome.ucsc.edu/) aligned single genome-wide sequences of D. simulans, D. sechellia, D. erecta and D. yakuba (Begun et al., 2007; Clark et al., 2007) (Release 5), and combined them with the D. melanogaster sequences for all autosomes. Because there are six D. melanogaster individuals for the second chromosome and five for the third, we considered both chromosomes separately and show the results from the second and third chromosomes in the main text and supplement, respectively.

On the second chromosome, the alignment consists of 1 246 810 four-fold degenerate sites. We considered only sites with complete data, i.e. where the states of all sequences are known, such that 1 183 781 sites remained in the analyses. We wrote Python and R scripts to extract the data according to the annotation of the D. melanogaster genome reference file (Release 5.31) from Flybase and to perform the analyses.

We compared the data of four-fold degenerate sites to short introns (bases 8–30 of introns < 66 bp, following Halligan & Keightley (2006) and Parsch et al. (2010)) from the same data set. A more detailed description of short introns can be seen in the study by Clemente & Vogl (2012).

The genome of the Malawi-flies was sequenced by using Solexa/Illumina technology (Bentley et al., 2008). Using singly sequenced inbred lines, as in the 50 genomes data, allows for relatively clean detection of sequencing errors. The quality of these data and especially the alignment were very well evaluated, resulting in empirical quality scores, i.e. ‘Phred-scores’ (Langley et al., 2012). As in the study by Clemente & Vogl (2012), we tested the influence of sequencing errors on the data by comparing the results of the analyses of only high-quality sites (average Phred >= 40) to those of all sites. We found no qualitative differences between these comparisons (see Data S1) and thus used all sites irrespective of their quality score for this study.

Polarization and the low mutation rate assumption

We polarized the data with conservative criteria, i.e. by requiring that all four outgroup sequences have the same state. This provides relatively high confidence in the identification of the ancestral state, but eliminates rapidly evolving sites. Also shared mutations between D. melanogaster and D. simulans/D. sechellia are lost by this, which may bias the data towards young polymorphisms. Recurrent mutation, however, may confuse the inference of ancestral vs. derived states. Because most derived mutations are rare, misinferred ancestral states will mainly cause low-frequency variants to appear as high-frequency-derived alleles (Watterson, 1975; Fu, 1995). We tested our data for misinference of the ancestral state by comparing the pattern of polarized allele frequencies to the pattern of folded unpolarized allele frequencies. Moreover, we compared the allele frequency spectra of low and high mutating sites. For the latter, recurrent mutations are more likely, and thus, we expect a higher chance of polarizing errors.

We also tested the mutational pattern between all four nucleotides to see whether mutations of each of the individual nucleotides satisfy the low mutation rate assumption, i.e. θ=4<0.05, where N is the effective population size and μ the mutation rate per site per generation.

Simplified representation of nucleotides in a binary system

We also simplified the data set by lumping A with T nucleotides into the class ‘AT’ and G with C nucleotides into the class ‘GC’. Generally, in Drosophila, the state AT is thought to correspond to the unpreferred state, and GC to the preferred. Due to this simplification, a site is only found to be polymorphic when there has been an AT⇆GC mutation. This simplification avoids overparameterization in statistical inference and allows for comparison to previous analyses. Nevertheless, information is lost and it might introduce a bias.

Polarized, folded and GC-site allele frequency spectra

We distinguished three different types of allele frequency spectra in D. melanogaster. The polarized spectrum uses outgroup information to determine and count the derived state. Summing the corresponding low- and high-frequency classes of the polarized spectrum, such that the information about ancestral and derived states is lost, creates the folded spectrum. For the GC-site spectrum, the site frequencies of GC are chosen to be counted, irrespective of being ancestral or derived.

Distribution of alleles in a sample from a population

To test for deviations from the expectations in a mutation–drift equilibrium, we performed chi-square tests on the polarized and unpolarized site-frequency spectra. A detailed description of the expectations in mutation–drift equilibrium with biased mutation rates is given in the study by Roychoudhury & Wakeley (2010); Vogl & Clemente (2012) extended this model to mutation, selection and drift and also accounted for polarization. We performed these tests mainly to detect where the site-frequency spectra deviate from neutrality.

Asymmetry of the GC-site spectrum is a consequence of a directional force (e.g. directional selection or biased gene conversion or a change in the mutation bias). We used a simple biallelic model of directional selection to estimate the strength of selection from polarized and unpolarized polymorphisms under equilibrium assumptions.

On the population level, it has been shown that the relative frequency of the preferred allele p in mutation–selection–drift equilibrium is distributed as follows (Wright, 1931; McVean & Charlesworth, 1999):

display math(1)

For small-scaled mutation rates, the constant of proportionality is approximately math formula, where math formula and math formula are the scaled mutation rates towards the unpreferred (AT) and preferred (GC) alleles, respectively, and γ is the scaled selection coefficient (Vogl & Clemente, 2012).

For a sample of size n, the number of preferred alleles y is binomially distributed conditional on p, and thus, the joint distribution of p and y is given by

display math(2)

By integrating over all values of p, we obtain the distribution of preferred polymorphic alleles y in the sample, where y=(1,…,n−1), which can be used for inferring γ. For small mutation rates and γ=0, the shape of the allele frequency spectrum is symmetrical. Thus, γ can be seen as a measure of the asymmetry of the spectrum.

Similarly, we can estimate the scaled selection coefficient γ from the polarized spectrum. On the population level, the frequency distribution of the preferred allele for mutations towards the preferred allele is given by (Vogl & Clemente, 2012)

display math(3)

For mutations towards the unpreferred allele, the frequency distribution of the preferred allele is

display math(4)

Note that the equilibrium assumption might not hold in D. melanogaster, and thus, inference of γ from might be inaccurate. By splitting the equation of the unpolarized allele frequencies into the two equations of the polarized allele frequencies, the assumption of equilibrium is relaxed. Whereas quasi-equilibrium can still be assumed for each polarization direction independently, the two directions together may not be at equilibrium. Please note that here we focus on the comparison between short introns and four-fold degenerate sites and the qualitative interpretation of the analyses, rather than on parameter estimation per se.

Direct comparison of four-fold degenerate sites to short introns

In order to disentangle genetic forces acting on four-fold degenerate sites, we compared them to short introns (see Clemente & Vogl, 2012) in several ways.

First, using chi-square tests, we compared the polarized allele frequency spectra of the two site classes, i.e. the relative frequencies of the corresponding spectra. In the case of directional selection due to CUB at four-fold degenerate sites, the allele frequency spectrum should be distinguishable from that of short introns. In particular, if GC is preferred over AT, we expect a lack of GC low-frequency and excess of high-frequency variants at four-fold degenerate sites. Furthermore, we checked for an over-representation of singletons, where we opposed singletons in both polarization directions to the rest of the polymorphism between both site classes. We expect an excess of AT singletons at four-fold degenerate sites if we assume directional selection in favour of GC.

Second, we used the data from the outgroup species to classify polymorphism in D. melanogaster into ‘melanogaster-specific’ mutations, i.e. identical allelic states in all outgroups, and ‘shared’ mutations, when allelic states differed between D. simulans and D. sechellia, but were identical between D. erecta and D. yakuba. Polarization was inaccurate when both D. simulans and D. sechellia had a different state from D. erecta and D. yakuba or when D. erecta and D. yakuba differed from each other. We thus did not use these cases in our analysis. This classification allowed us to test for a difference in the amount of melanogaster-specific and shared mutations between short introns and four-fold degenerate sites. According to our simulations (see Data S1), we expect coalescence times on average to be shorter with purifying selection than under neutrality, and thus, we should observe less shared polymorphism within four-fold degenerate sites for mutations away from preferred sites.

Finally, we compared the relative frequencies of substitutions between short introns and four-fold degenerate sites. For this, we compared the number of substitutions for both polarization directions (AT to GC and vice versa) relative to the corresponding number of polymorphisms to correct for the different base composition and mutation rate at four-fold degenerate sites and short introns. A significant difference of the ratio between both site classes indicates selective forces.

With short introns, we found no differences in frequencies between complementary bases (Clemente & Vogl, 2012). With four-fold degenerate sites, we checked for differences in the 4 × 4 mutation frequency matrix among the four bases with respect to both polymorphisms and substitutions using the polarized data and chi-square tests.

Influence of recombination rate and expression level on polymorphism and divergence

Recombination rates vary across the genome. Due to hitchhiking (Smith & Haigh, 1974) or background selection (Charlesworth et al., 1993, 1995), the local effective population size may be larger in regions with high compared to low recombination rates. The level of polymorphism should therefore be correlated with recombination rate, whereas the rate of divergence is affected mainly by mutation rate and only transiently by the effective population size. Moreover, the efficiency of selection is expected to be decreased in regions of low recombination (Hill & Robertson, 1966). We used the recombination rate calculator (Singh et al., 2005b; Fiston-Lavier et al., 2010) to determine the recombination rate of each gene in the data set.

The expression level among genes is different, which may affect mutation bias or the strength of selective constraints (Comeron, 2004). To check this, we associated the average and maximum expression level of genes from different male and female developmental stages (Graveley et al., 2011) to our data.

We then used linear models to determine the influence of the recombination rate and expression level on the amount of polymorphism and divergence at four-fold degenerate sites. Moreover, we split the data set into genes with zero recombination rate and all others and compared their allele frequency spectra. We found no significant differences between the data sets (see Data S1) and thus used all data for the analyses.

In general, the influence of recombination rate and expression level on polymorphism and divergence was small. The results can be seen in Data S1.

Results

We considered whole-genome polymorphism data of an African D. melanogaster population and included D. simulans, D. sechellia, D. erecta and D. yakuba as outgroups to polarize and classify the ages of mutations. We extracted four-fold degenerate sites and compared them to our results from short introns (Clemente & Vogl, 2012) throughout the following section.

Tests involving polarization

Polarization of sequence data can lead to the misidentification of ancestral and derived alleles. This generally manifests itself in a bias towards high-frequency-derived alleles. With the data reduced to a biallelic system, allele frequency spectra for classes of mutations from AT to GC and GC to AT show an excess of low-frequency variants and a lack of intermediate-frequency variants compared to neutral equilibrium expectations (math formula for both, Fig. 1a,b). A paucity of intermediate-frequency variants is also evident in the folded allele frequency spectrum (Fig. 2). Hence, the deviation from neutral equilibrium cannot only be due to the misidentification of the ancestral state.

Figure 1.

Site-frequency spectrum of melanogaster-specific mutations (2nd chromosome) The lines represent the expected derived site frequencies under neutral equilibrium; the bars represent observed site frequencies.

Figure 2.

Folded site-frequency spectrum of melanogaster-specific mutations (2nd chromosome). The lines represent expected site frequencies under neutral equilibrium; the bars represent observed site frequencies.

We also compared the polarized allele frequency spectrum from sites with relatively low mutation rates (A to C and T to G) to the spectrum from sites with relatively high mutation rates (C to T and G to A), according to Table 1. Both spectra again show an excess of low-frequency variants and a lack of intermediate-frequency variants compared to neutral equilibrium. Moreover, we found them to be significantly different from one another (math formula). The spectrum from sites with low mutation rates, i.e. mutations with ancestral A or T sites, shows a greater excess of singletons than the spectrum from sites with high mutation rates, i.e. with ancestral C or G nucleotides. Polarization errors cannot explain this result, as they would lead to an excess of high-frequency-derived variants in the spectrum. Comparison of these spectra to the ones of short introns reveal a tendency towards low frequencies for the low mutating sites, whereas the high mutating sites tend to be shifted towards high frequencies. Thus, weak directional selection may be acting on high mutating sites, explaining the differences to the spectra from low mutating sites.

Table 1. Mutation frequency matrix for the 2nd chromosome. The rows represent the ancestral state when all outgroup species Drosophila simulans, Drosophila sechellia, Drosophila erecta and Drosophila yakuba are monomorphic. The columns represent the derived polymorphic (polymorphisms) or fixed (substitutions) states in Drosophila melanogaster
 No. of polymorphismsNo. of substitutions
 ACGTSumACGTSumAncestral
Four-fold degenerate sites
A 850149112373578(128 000)806159114273824135 402
C4020 2304826814 5923810(289 624)2366665112 827317 043
G72491755 306012 06460311812(245 889)283910 682268 635
T10621360728 315014641688721(114 377)3873121 400
Sum12 3313965452312 56533 38411 3054306467810 91731 206842 480
Short introns
A 196337415948(28 437)230490473119330 578
C234 95467796261(12 346)14450991414 056
G419118 221758547143(12 565)26895814 281
T385364207 956478440213(28 073)113130 160
Sum10386786391103345812868138471250419689 075

We performed the same tests on the short intron sequences and found a similar deviation from neutral equilibrium, no evidence for the misidentification of ancestral states and no difference in spectra between sites with low and high mutation rates.

Mutational patterns and the low mutation rate assumption

In order to detect possible signs of selection and to test the low mutation rate assumption, we considered the polymorphism and substitution patterns among the four nucleotides at four-fold degenerate sites (Table 1). Genome-wide, the ancestral base composition contains about 70% C and G nucleotides, in contrast to the base composition in short introns (68% in favour of A and T). Assuming that patterns in short introns are not constrained by selection, this result likely reflects a mutation bias towards AT that must have been counteracted by selection towards GC at four-fold degenerate sites.

On the coding strand, the fraction of A nucleotides in the ancestral base composition exceeds the fraction of the complementary T nucleotides at four-fold degenerate sites (math formula) and, similarly, the fraction of C nucleotides exceeds the fraction of G nucleotides (math formula). This indicates strand specificity and provides evidence for selective forces on four-fold degenerate sites. Strand specificity is also reflected in the mutational pattern of polymorphisms and substitutions, where mutations between complementary base pairs are not equal (e.g. between C to A and G to T, Table 1). In contrast, we found no evidence for strand specificity in short introns.

The pattern of unpolarized AT vs. GC allele frequencies (GC-site spectrum) is very asymmetric, showing an excess of low-frequency AT and a lack of high-frequency AT variants (Fig. 3). Such an asymmetry is expected even in the absence of selection, due to the relatively high ancestral GC content at four-fold degenerate sites and the mutation bias from GC to AT. In short introns, we observed a much weaker but similar asymmetry, which might be due to a recent shift in mutation bias towards increased GC to AT mutations.

Figure 3.

Unpolarized (GC-site)-frequency spectrum of melanogaster-specific mutations (2nd chromosome). The lines represent expected site frequencies under neutral equilibrium; the bars represent observed site frequencies of C or G nucleotides, irrespective of the ancestral state.

The proportions of the matrix of polymorphic mutations among the four nucleotides differ from those in substitutions (math formula). For example, at polymorphic sites, the ratio of mutations (from GC to AT)/(from AT to GC) is about 5.1/1, whereas the ratio of the same nucleotides among substitutions is significantly reduced to about 4.0/1 (math formula, Table 2). The difference between polymorphism and substitution patterns indicates selective forces. Moreover, the unequal substitution rate between both directions shows that AT becomes fixed about four times as often as GC, increasing the AT content of the ancestral base composition. Thus, the base composition of four-fold degenerate sites is not at equilibrium. In short introns, on the other hand, the proportions of polymorphisms to substitutions are balanced among the four nucleotides, indicating the absence of selective forces. But, similar to four-fold degenerate sites, in short introns, we observed a relative excess of GC to AT substitutions, i.e. a general disequilibrium in the base composition, suggesting a change in mutation bias. The imbalance between AT and GC in polymorphisms and substitutions is much less pronounced in introns than with four-fold degenerate sites (see Table 2).

Table 2. AT vs. GC mutations (2nd chromosome). The ancestral base composition in short introns is AT-rich, opposite to the ancestral base composition of four-fold degenerate sites. Assuming the absence of selection in short introns, their base composition reflects the strength of mutation bias. Our short intron data suggest that the strength of mutation bias towards AT is currently increasing compared to that causing the ancestral base composition. We used the estimated strength of mutation bias from the polymorphism and substitution data of introns to calculate the expected ratio of substitutions at four-fold degenerate sites in the absence of selection
 AT/GCGC→AT/AT→GC
Anc. base compositionSubstitutionsPolymorphisms
Short introns (observed)2.14/11.15/11.21/1
Four-fold deg. sites (observed)1/2.284.02/15.10/1
Four-fold deg. sites (expected, γ = 0)5.64/15.94/1

Because four-fold degenerate sites are not at equilibrium, the low mutation rate assumption is essential to perform powerful tests for deviations from neutrality (see Introduction). We checked this by estimating scaled mutation rates (math formula, Watterson (1975)) from any ancestral base towards one of the three other bases. We found math formula to be less than 0.05, sufficiently low to justify the infinite sites assumption (Desai & Plotkin, 2008; Vogl & Clemente, 2012), even though the estimation of math formula assumes equilibrium. As with four-fold degenerate sites, the low mutation rate assumption holds for short introns.

Estimation of selection coefficients

To infer directional selection on four-fold degenerate sites, we initially used a biallelic mutation, selection and drift equilibrium model. Because the equilibrium assumption does not hold for our data, we also use a quasi-equilibrium model and compare to spectra of short introns. We note that selection coefficients inferred from the polarized data may differ from selection coefficients inferred from unpolarized data. The analysis with the GC-site spectrum assumes equilibrium, whereas the analysis with the polarized spectra assumes only quasi-equilibrium. Thus, the former analysis is affected by the GC-rich ancestral base composition, whereas the latter is not.

The estimates from the GC-site spectrum suggest directional selection towards GC both in short introns and at four-fold degenerate sites (see Table 3). But the estimates from the polarized data suggest purifying selection in both mutation directions, i.e. mutations from GC to AT appear deleterious, as do mutations from AT to GC. This can also be due to the bias caused by only taking sites where all outgroups are identical, as this biases the data towards young mutations, or by population demography. In fact, with polarized data, selection coefficients inferred from four-fold degenerate sites and short introns are not significantly different (or rather, the inferred posterior distributions overlap broadly). No other evidence for constraint has been found for short intron sequences. Consequently, we think that the evidence from the allele frequency spectra for purifying selection in both site classes rather derives from a deviation from demographic equilibrium or a bias due to polarization.

Table 3. Estimates of the scaled selection coefficient. We estimated γ for short introns and four-fold degenerate sites from unpolarized and polarized site frequencies (chromosome 2). A positive γ indicates directional selection favouring GC, a negative γ selection favouring AT; P-values were determined with a likelihood ratio test
 Short introns 4-fold d. sites 
 γP-valueγP-value
Unpolarized ‘mel-spec.’0.220.0051.440
Unpolarized ‘all’0.38 math formula 1.230
AT to GC−0.720.018−0.87 math formula
GC to AT0.810.0030.42 math formula

Direct comparison of four-fold degenerate sites to short introns

In the former subsection, we compared the inferred selection coefficients between four-fold degenerate sites and short introns and found no difference for polarized data. In fact, allele frequencies for mutations (melanogaster-specific) from AT to GC and GC to AT were in similar proportions (math formula and math formula, respectively) between four-fold degenerate sites and short introns (Table 4); if both mutation directions were summed up, the chi-square test was also nonsignificant (math formula). This indicates the absence of directional selection at four-fold degenerate sites.

Table 4. Site frequencies of melanogaster-specific mutations in a sample of n=6 (2nd chromosome). The columns 1–5 are the site frequencies (absolute numbers) for polymorphic sites in Drosophila melanogaster. The columns 0 and 6 are the numbers of ancestral sites and substitutions, respectively. The rows refer to the direction of mutation, i.e. the polarization
 Derived allele frequencies
0123456
Four-fold d. sites
AT → GC256 80222817645544104204806
GC → AT585 67810 681448530792318203419 331
Short introns
AT → GC60 7385402231331071011373
GC → AT28 3376712551601371181585

We further tested the proportion of singletons among all polymorphic sites for both polarization directions separately. For melanogaster-specific mutations from AT to GC, we found no significant difference in GC singletons between four-fold degenerate sites and short introns (math formula), although relatively few GC singletons would have been expected if directional selection favoured GC. For GC to AT mutations, we found a just about significant excess of AT singletons in short introns relative to four-fold degenerate sites (math formula), opposite to what would have been expected for directional selection favouring GC at four-fold degenerate sites. However, on the third chromosome, we found the opposite pattern: for mutations from AT to GC, we found a significant excess of GC singletons at four-fold degenerate sites (math formula), whereas for mutations from GC to AT we found no significant difference between short introns and four-fold degenerate sites (math formula). Thus, these results are rather inconclusive and might be considered as statistical fluctuation. Alternatively, different local selection pressure at four-fold degenerate sites may explain these inconsistencies.

We also compared the number of melanogaster-specific and shared mutations at four-fold degenerate sites and short introns (Table 5). We found a lack of shared polymorphisms at four-fold degenerate sites compared to introns (math formula). With purifying selection, less old polymorphism is expected than under neutrality (see Data S1). This result may thus be due to purifying selection on four-fold degenerate sites. Note that the mutation direction is not distinguished in Table 5, because the number of shared mutations in short introns is too low to provide meaningful results in this case.

Table 5. Derived allele frequencies in Drosophila melanogaster (2nd chromosome) depending on the state of Drosophila simulans and Drosophila sechellia. In the column, D. simulans/ D. sechellia, a/a indicates the ancestral state (as in Drosophila erecta and Drosophila yakuba), a/d,d/a indicates exactly one ancestral and one derived state in either D. simulans or D. sechellia, and d/d indicates that both states are derived in D. simulans and D. sechellia
D. simulans/D. sechelliaDerived allele frequencies
0123456
Four-fold d. sites
a/a777 89015 950642945003418308731 206
a/d,d/a40 6564552321961711833604
d/d23 955604476613736160369 460
Short introns
a/a81 42117316684123223254196
a/d,d/a47086223251754661
d/d39611061048114229812 056

We compared substitutions at four-fold degenerate sites and short introns (Table 4). Because the base composition and mutation rate differ between both site classes, we set the number of AT to GC substitutions in relation to AT to GC polymorphisms (and vice versa for the other direction) in each site class. For mutations in both directions (from AT to GC and from GC to AT), we observed a significant excess of substitutions in short introns (math formula and math formula, respectively). Whereas mutations from AT to GC become fixed about 1.15 times as often in short introns as compared to four-fold degenerate sites, the same ratio in the reverse direction is 1.38. We used a generalized linear model to show that the different substitution rates for short introns and four-fold degenerate sites between both mutation directions are significant (I = [0.068,0.31], see eqn (1) in Clemente & Vogl (2012)). This significant difference indicates a slight directional force towards GC at four-fold degenerate sites. However, regardless of the direction, substitutions are reduced at four-fold degenerate sites compared to short introns, especially for GC to AT mutations, indicating purifying selection at four-fold degenerate sites. We note that the observed pattern is different from a simple directional selection, mutation and drift model and more complicated. A possible explanation may be that different selective forces operate on different four-fold degenerate sites.

Comparison of allele frequency spectra among different mutation directions

The amount of genome-wide data allowed us to look for signs of selection by comparing the allele frequency spectra of four-fold degenerate sites among different mutation directions. First, we separated all different mutation directions and compared the proportions of allele frequencies, i.e. the relative frequencies of the corresponding spectra to those of short introns (because with short introns the spectra of AT to GC and GC to AT are not significantly different, we summed up the two directions there). Significant deviations were found for mutations from T to A (math formula) and from T to G (math formula). The former spectrum shows a shift towards high-frequency variants, whereas the latter spectrum is shifted towards low-frequency variants. Mutations from A to C, A to G, C to T and G to A deviated marginally significantly; A to C mutations showed an excess of singletons, whereas the other spectra were slightly shifted towards high-frequency variants compared to introns (math formula, math formula, math formula and math formula, respectively). All other mutations resulted in allele frequency spectra indistinguishable from those of short introns.

Second, we compared the proportions of allele frequency spectra of four-fold degenerate sites for all possible twelve mutation directions against each other. These 12 × 11/2 = 66 pairwise comparisons can be arranged in a matrix, where each line contains the P-values, resulting from the comparisons of one particular mutation direction against all others. For the twelve complementary mutations, e.g. from A to C and from T to G (equivalent mutations on the opposite DNA strand) at four-fold degenerate sites, we found no significant differences. In fact, complementary bases are actually extraordinarily similar: in eight of twelve distinct mutation directions in the matrix (lines), complementary bases were most similar, a significant enrichment.

With respect to mutations in reverse directions, e.g. from A to C and from C to A, we would expect directional selection to lead to maximally different spectra, as one allele is selected positively and the other negatively. Thus, allele frequencies for mutations from unpreferred to preferred alleles should be shifted towards high-frequency-derived alleles, whereas allele frequencies for mutations from preferred to unpreferred alleles should be shifted towards singletons. We actually found some significant deviations from equality of proportions for mutations: between A and C, A and G and G and T (math formula and math formula, respectively). Thus, the frequency spectra of mutations in reverse direction are more divergent to each other than those for complementary mutations. Nevertheless, these differences cannot easily be explained by directional selection. In the case where the spectra of both directions showed the strongest deviations (G and T), we found signs of purifying selection for mutation from T to G, but no signs of selection in the reverse direction. It thus seems that an ancestral T is favoured over G, but sites with an ancestral G have no preference if substituted by a T nucleotide. Evidence for directional selection here is therefore equivocal at best.

Discussion

We analysed genome-wide polymorphism (Langley et al., 2012) and substitution (Begun et al., 2007; Clark et al., 2007) patterns at four-fold degenerate sites in an African D. melanogaster population and its close relatives. In an attempt to disentangle the influences of mutation, selection and genetic drift, we compare them to short intron sequences (Clemente & Vogl, 2012), as they may not be constrained by selection (Halligan & Keightley, 2006; Parsch et al., 2010; Clemente & Vogl, 2012).

The biallelic equilibrium model

Using a biallelic mutation–selection–drift model (Bulmer, 1991; Akashi, 1996; McVean & Charlesworth, 1999; Vogl & Clemente, 2012) and unpolarized allele frequency spectra, we infer directional selection favouring GC alleles at four-fold degenerate sites. This result has been obtained numerous times before with smaller data sets (Akashi, 1996; McVean & Charlesworth, 1999; Vicario et al., 2007; Zeng & Charlesworth, 2010). The strength of the inferred scaled selection coefficient would be strong enough to maintain the GC-biased base composition at four-fold degenerate, when the mutation bias is inferred from the relatively AT-rich base composition of short introns. However, it is obvious from the biased substitution pattern that four-fold degenerate sites are not at equilibrium, but become AT-rich. Thus, a selection coefficient inferred from a model that assumes equilibrium is likely biased. We inferred a weaker substitution bias in short introns, from which we deduced a shift in mutation bias towards increased GC to AT mutations. The strength of this inferred shift in mutation bias is too weak to explain the substitution pattern at four-fold degenerate sites. Consequently, CUB in favour of GC must have been relaxed to allow for the high rate of AT substitutions (discussed later).

Polarization of the data

A solution to the problem of nonequilibrium is to polarize the data into ancestral and derived alleles and assume quasi-equilibrium. But polarizing the data and thereby assuming low mutation rates may introduce errors ( Hernandez et al., 2007a, b). To gain more confidence about the ancestral state and to eliminate fast-evolving sites, we used four outgroup sequences and conditioned on all four having the same state (as in Clemente & Vogl, 2012). The allele frequency spectra and substitution patterns then showed no evidence of misidentification of the ancestral state. Furthermore, scaled mutation rates were so low that recurrent mutations would not influence frequency spectra. We thus believe that, as with short introns, polarization of four-fold degenerate sites does not create a notable bias with respect to recurrent mutations. However, it may bias the data towards young polymorphism, because old polymorphism shared between D. melanogaster and D. simulans/D. sechellia is ignored (Clemente & Vogl, 2012).

Estimation of the selection coefficients assuming quasi-equilibrium

Considering polarized allele frequency spectra and distinguishing mutations from AT to GC and GC to AT, the strength of selection can be estimated assuming quasi-equilibrium, where the ancestral base composition determines the absolute frequencies of spectra, whereas the current selection pressure determines the proportions of spectra. Most evidence for directional selection on four-fold degenerate sites favouring GC comes from the GC-rich ancestral base composition. From the polymorphism data, we infer only weak current selection that is always directed towards the ancestral state in both mutation directions, if constant population size is assumed. Moreover, the magnitude of this selection is similar to that inferred in short introns. Because we find little evidence for selection in short introns, this might just reflect demography or a bias due to the polarization scheme.

Tests on mutational patterns and comparison of spectra assuming quasi-equilibrium

In other tests, however, evidence for purifying selection at four-fold degenerate sites is clear. The proportion of substitutions at four-fold degenerate sites is reduced compared to short introns, again in both mutation directions. This may indicate that ancestral codon usage is close to optimal, such that mutant codons ending in GC may also be selected against. Here, we additionally observe a slight directional component with a stronger reduction at sites with ancestral GC, suggesting a slight force towards GC. Within four-fold degenerate sites, the ratio of mutations (from GC to AT)/(from AT to GC) is expected to be ≈5.94/1, based on the base composition and current mutation bias from short introns and assuming no selection. We observe a ratio of ≈5.1/1 with polymorphic sites, which is even more reduced, to ≈4.0/1 with substitutions (Table 2). All these differences are significant. We note that the effects of selection should be stronger for substitutions than for polymorphisms. Thus, stronger purifying selection for mutations from GC to AT than in the reverse direction can explain the observed pattern and, obviously, constitutes a directional force towards GC. Nevertheless, comparing allele frequency spectra of four-fold degenerate sites to short introns among all mutation directions, we found little evidence for directional forces towards GC. Only mutations from A to G showed a marginally significant shift towards high-frequency variants (which is expected with directional selection favouring GC), whereas mutations from T to A, C to T and G to A supported a force towards AT. Thus, it seems unlikely that the observed imbalance between both directions is due to current directional selection in favour of GC. Possibly, the intensity of selection leading to the ancestral GC-bias of four-fold degenerate sites was reduced gradually and relatively recently, such that the ancestral base composition is affected strongly, the substitutions slightly and the relatively young polymorphism not at all anymore.

Furthermore, although the total proportion of polymorphisms at four-fold degenerate sites (0.039) when compared to introns (0.038) is roughly equal (Table 5), four-fold degenerate sites show significantly fewer ‘shared’ mutations among D. melanogaster and D. simulans/sechellia, suggesting shorter coalescence times. Thus, coalescence times of the presumably neutrally evolving short introns seem to be longer than the likely constrained four-fold degenerate sites, which fits with the expectation under purifying selection.

Explanations for patterns observed using the biallelic model

CUB in D. melanogaster has been associated with tRNA abundance (Shields et al., 1988). Genes with codons for the most abundant tRNA molecules can be translated more efficiently and accurately (Akashi, 1994; Duret & Mouchiroud, 1999; Marais & Duret, 2001; Duret, 2002; Stoletzki & Eyre-Walker, 2007). According to our results, most new mutations at four-fold degenerate sites seem to be neutral or slightly disfavoured irrespective of the ancestral state. Although GC had been preferred in Drosophila, as reflected in the GC-rich ancestral base composition of D. melanogaster, recent polymorphism and substitution patterns show little evidence for CUB with respect to AT vs. GC nucleotides. Our results support the findings of Kliman (1999) and McVean & Vieira (2001), who also reported only little evidence for CUB in D. melanogaster. The directional selective force γ is the product of the (unscaled) selection coefficient s times the effective population size N, i.e. γ = Ns. Akashi (1996) suggested a bottleneck, i.e. a reduction in the effective population size N, to explain the relaxed selective constraint and the increase in AT substitutions. Signs of such a bottleneck should also be apparent in short intron sequences. There, however, shared ancestral polymorphism between D. melanogaster and D. simulans/D. sechellia indicates a stable long-term effective population size. Thus, the selection coefficient s responsible for the strength of ancestral CUB cannot have stayed constant (Hershberg & Petrov, 2008), but must have diminished, presumably through an environmental change. D. melanogaster is a human commensal and might have adapted to the human-made environment, as indicated by the increased tolerance to ethanol compared to other Drosophila species e.g. Thomson et al., 1991). Because our data provide no evidence for a significant population size reduction, the shift in CUB cannot be explained by a neutral shift in the expression level of tRNA molecules during a bottleneck (Hershberg & Petrov, 2008).

Going beyond the biallelic model

Although directional selection favouring GC has diminished in D. melanogaster, other forms of directional selection are evident. The ancestral base composition and mutational patterns reveal strand specificity: the complementary bases A and T as well as C and G do not occur at the same frequencies at four-fold degenerate sites, whereas no such imbalance is observed for short introns. Strand specificity in D. melanogaster has already been reported in a study of nonfunctional fragments of a transposable element (Singh et al., 2005b). Such strand specificity is presumably the result of directional selection, as a mutation bias should have no effect on complementary nucleotides unless mutations are caused by transcription. In mammals, Hwang & Green (2004) confirmed a previously found transcription-associated substitution asymmetry, where pyrimidine transitions occur at higher rates on the transcribed DNA strand than purine transitions due to replication errors. Such a mechanism cannot easily explain the pattern observed in Drosophila, however, as strand specificity is not found in short introns that are also transcribed before being spliced out. There is a slight bias in the representation of short introns, however, as short introns are more frequently found in highly expressed genes, whereas lowly expressed genes tend to have longer coding regions (Castillo-Davis et al., 2002; Comeron, 2004). Thus, there might be a difference between the transcription level of short introns and four-fold degenerate sites. This is unlikely though, as polymorphism and divergence at four-fold degenerate sites depend only weakly on expression level and recombination rate (see Data S1). Hence, strand specificity is more likely a result of directional selection than of transcription-associated mutation processes.

In contrast to the observed strand specificity at the ancestral base composition, the proportions of the allele frequency spectra of pairs of complementary mutations were not significantly different. In fact, compared to others, complementary spectra seem to be more similar to each other than expected. This would suggest the absence of current directional selection between complementary mutations; the observed asymmetry of complementary nucleotides in the ancestral base composition would then be a remainder of previous directional selection. On the other hand, among the spectra of all twelve possible mutation directions, we find that some are significantly shifted towards lower and others towards higher frequencies compared to the spectra of introns, which are presumably evolving neutrally. When AT and GC are summed, however, spectra are not significantly different from those of short introns.

It is still debated if selection on CUB is a weak force (Hershberg & Petrov, 2008), i.e. if the magnitude of four times the scaled selection coefficient is about one (Ohta, 1992). Carlini & Stephan (2003) showed that even few changes in synonymous codons already have noticeable effects in Drosophila, which might indicate that selection on CUB may be strong at least in some cases. According to our results, CUB seems to be weak for any mutation direction. The opposing direction of some selective forces may, however, lead to an underestimation of the strength of selection, e.g. when spectra of different mutation directions are lumped. Hence, four-fold degenerate sites are affected by much more complex forces than previously assumed. Although evidence for purifying selection seems clear, local shifts in mutation bias or directional selection of variable direction or magnitude might increase complexity. In any case, future models to describe the sequence evolution of four-fold degenerate sites should be much more sophisticated than the biallelic mutation, selection and drift model. In a first attempt, Zeng (2010) extended the biallelic model to a model of inference of selective differences between all synonymous codons. Unfortunately, this model assumes equilibrium and is thus not appropriate for four-fold degenerate sites in D. melanogaster. Although more complex models are necessary, we consider it even more important to better understand the biological nature of CUB in relation to the genomic context, the absolute level and regulation of gene expression and the species' environment.

Conclusion

With genome-wide data from D. melanogaster and related species, we confirmed earlier studies that four-fold degenerate sites in D. melanogaster are not at equilibrium, but become AT-rich. Thus, to explain polymorphism and divergence data together, we cannot apply a simple equilibrium model. Instead, we based our analyses on the assumption of quasi-equilibrium, which requires polarization of the mutation direction. In contrast to many earlier studies, we found that four-fold degenerate sites show little signs of current directional selection favouring GC over AT. Rather, most evidence for directional selection in favour of GC comes from the ancestral base composition. Thus, the directional selective force must have become weaker recently. Our data from short introns suggest that the effective population size remained high since the split of the D. melanogaster lineage from D. simulans, and D. sechellia, i.e. there is no evidence of a population bottleneck in D. melanogaster. Instead, the strength of the unscaled selection coefficient seems to have decreased, likely due to a change in the environment. Furthermore, we found evidence for purifying selection in both mutation directions (from AT to GC and from GC to AT): a reduced ratio of substitutions to polymorphisms in both directions compared to short introns and a paucity of polymorphism at four-fold degenerate sites shared with D. simulans and D. sechellia, again compared to short introns. Hence, ancestral codons are generally favoured and new mutations selected against. A more complex picture of the evolution of four-fold degenerate sites comes from data where all four nucleotides are distinguished. Strand specificity is found in the ancestral base composition and also in the more recent polymorphism and substitution patterns. Although comparison of the proportions of allele frequency spectra from complementary mutations shows no differences, indicating the absence of current selective forces between complementary nucleotides, spectra from other mutation directions show significant differences to the spectrum of short introns. For example, the spectrum of T to A mutations is shifted towards high-frequency variants, suggesting directional selection. On the other hand, the spectrum of T to G mutations is shifted towards singletons, indicating purifying selection. However, the summed effect of AT vs. GC spectra balances and is not significantly different from that of short introns. This suggests a more complex mutation and/or selection scheme on four-fold degenerate sites than previously assumed. We find that the biallelic equilibrium model (where AT is contrasted to GC) that has so far been used for analyses of four-fold degenerate sites is oversimplified and should be replaced by models that incorporate all four bases and do not assume equilibrium. With the low mutation rates in Drosophila and the current genome-wide data, such models can actually be analysed and tested.

Acknowledgments

We express our sincere thanks to the other members of the ‘Initiativkolleg Population Genetics’ and the ‘Doktoratskolleg Populationsgenetik’ and the members of the external advisory committee, especially Christian Schlötterer, Joachim Hermisson, Andrea Betancourt, Carolin Kosiol and Brian Charlesworth, for motivation, interesting discussions, critical reading of the manuscript and helpful suggestions. Furthermore, we thank Charles Aquadro and Michael Whitlock for discussions and suggestions. We also thank John Pool and Charles Langley for providing us with information about the data from the 50 Genomes Project. We acknowledge funding by the University of Veterinary Medicine Vienna (for the Initiativkolleg) and the FWF (for the Doktoratskolleg, W1225-B20), both headed by Christian Schlötterer. We thank the editor and two reviewers for comments that helped to improve the article.

Ancillary