Natural selection can reduce the effective population size of the nonrecombining Y chromosome, whereas local adaptation of Y-linked genes can increase the population divergence and overall intra-species polymorphism of Y-linked sequences. The plant Silene latifolia evolved a Y chromosome relatively recently, and most known X-linked genes have functional Y homologues, making the species interesting for comparisons of X- and Y-linked diversity and subdivision. Y-linked genes show higher population differentiation, compared to X-linked genes, and this might be maintained by local adaptation in Y-linked genes (or low sequence diversity). Here we attempt to test between these causes by investigating DNA polymorphism and population differentiation using a larger set of Y-linked and X-linked S. latifolia genes (than used previously), and show that net sequence divergence for Y-linked sequences (measured by Da, also known as δ) is low, and not consistently higher than X-linked genes. This does not support local adaptation, instead, the higher values of differentiation measures for the Y-linked genes probably result largely from reduced total variation on the Y chromosome, which in turn reflect deterministic processes lowering effective population sizes of evolving Y-chromosomes.

Measures of population differentiation are sometimes used as a means of identifying loci under selection, comparing the proportion of diversity that is found between alleles from different populations, as opposed to within populations (measured by FST and similar measures that are more appropriate for DNA sequences, such as KST, see Hudson et al. 1992a; Charlesworth 1998). Such tests rely on associations between variants at sites under selection and closely linked neutral variants. If one or more sites are locally adapted, with one variant favored in some populations, and a different one in others, gene flow will be reduced. This is because immigrant alleles at the selected locus will tend to be eliminated when they move from a source population where the allele is favored, into environments where they are disadvantageous (Barton and Bengtsson 1986). As a consequence, functionally different alleles at the selected site(s) can maintain different frequencies between environments. Haplotypes carrying functionally different alleles should thus have longer coalescent times back to a common ancestor than in the absence of selection, and these sequences should be differentiated, even for neutral sites because the removal of poorly adapted immigrant alleles also removes closely linked neutral variants that originated from the population that is the source of the migrants (Charlesworth et al. 1997).

Local adaptation is therefore expected to lead to increased measures of interpopulation subdivision and divergence. These effects are predicted to be particularly strong in large nonrecombining genome regions because large regions are most likely to contain a gene under such selection, and to be more readily detectable than in other chromosome regions (because the effects of such selection will affect polymorphism at multiple sites across the region). This is indeed observed in some Drosophila populations for genes in nonrecombining regions (Charlesworth 1998), and in asexual aphid populations, which only rarely have opportunities for recombination (Via and West 2008).

These effects are therefore expected in Y-linked genes, because the Y chromosomes of many species, include large nonrecombining regions. This might offer a way to test whether one or more Y-linked genes are locally adapted. If such adaptation has occurred, different populations will have different Y haplotypes and migration of haplotypes between populations will be impeded, unlike the situation in other genomic regions. In populations of the dioecious plant Silene latifolia, and between S. latifolia and its close relative, S. dioica, differentiation, estimated by measures such as FST, is much higher for Y-linked than for X-linked genes (Ironside and Filatov 2005; Laporte et al. 2005). These studies concluded that the cause could be either local adaptation, or low sequence diversity, without attempting to test between them.

High FST values, however, do not definitively indicate local selection differences. High FST can also be due to low diversity within populations (Charlesworth et al. 1997; Charlesworth 1998). The higher FST of Y-linked genes, compared with their X-linked homologues could therefore be wholly or partly due to reduced polymorphism in Y-linked genes. FST and its sequence-based analogue KST (Hudson et al. 1992a) do not estimate the divergence between populations (expressed as either mean numbers of differences between sequences from different populations, or numbers of fixed differences). For sequence data, the type of information relevant for testing the above ideas about coalescent times, FST can be estimated as [πT–πS]/πT, where πT is the total nucleotide diversity estimated from a sample of two or more populations (without regard to the sequences’ population of origin), and πS is the average number of differences between two sequences from the same population (the within-population nucleotide diversity, reviewed by Charlesworth 1998).

Even for neutral sites, measures such as FST, and KST clearly do not relate to the evolutionary times since the alleles from different populations had a common ancestor. This is because they are strongly affected by the amount of polymorphism within populations, and, when within-population diversity (πS) is low, the differentiation assessed by [πT–πS]/πT is inevitably high (Charlesworth et al. 1997). This problem can be partly avoided by using the numerator (πT–πS), or “raw divergence” to estimate the mean number of differences between sequences from different populations (sometimes referred to as DST). When πS is high, however, it is preferable to use “net divergence,” (δ of Nei and Li 1979), also known as Da (the notation used here). Local adaptation maintained over time is predicted to lead to increased values of measures such as Da; this measure is therefore appropriate for detecting local adaptation (Charlesworth et al. 1997, see Materials and methods and Justification for Da section).

Low within-population diversity is a plausible explanation for higher FST (or KST) values for Y-linked genes compared with their X-linked homologues. This is because much of the Y chromosome does not cross over and exchange alleles with the X (leading ultimately to its degeneration). Selection at different sites in most of the Y is thus non-independent. In addition to allowing local adaptation to cause the effects outlined above, this also allows other hitch-hiking processes to affect large genomic regions. Specifically, spread of an advantageous allele at a single Y-linked locus reduces the effective population size (Ne) of the entire chromosome, in a selective sweep, just as in the classical theoretical models of such events (Maynard Smith and Haigh 1974; Braverman et al. 1995). This can lower diversity, either in a population undergoing local adaptation involving a Y-linked gene or throughout the whole species, if an unconditionally advantageous Y-linked mutation has recently spread. The Drosophila miranda neo-Y chromosome has a very low Ne, and appears to have undergone such events (possibly multiple events, see Bachtrog 2003, 2004; Bachtrog et al. 2009).

In recently evolved sex chromosome systems, many X-linked genes will still have functional homologues on the Y chromosome, so that equilibrium neutral diversity on the Y chromosome will also be lowered by selection against deleterious mutations. Strongly deleterious mutations are rapidly eliminated within a population, reducing within-deme Ne and thus diversity (“background selection,” see Charlesworth et al. 1993). Muller's ratchet also reduces diversity because in the absence of recombination and back mutation, finite populations suffer an accumulation of deleterious mutations due to the irreversible loss of individuals with the least numbers of mutations; the least-loaded classes (Muller 1964). The extent to which variability is reduced is a function of the deleterious mutation rate, the fitness effects of the mutations, and the population size (Gordo et al. 2002). Weakly deleterious mutations can also contribute, as their removal is impeded by selection acting on any variant in the entire haplotype in the nonrecombining region, as is the spread of advantageous mutations (Charlesworth 1994; Charlesworth and Charlesworth 2000; Kaiser and Charlesworth 2009). In subdivided populations, all these processes are expected to reduce diversity within local populations, and inflate measures such as FST (Charlesworth et al. 1997).

Plants are well suited for studying sex chromosome evolution, because their Y chromosomes evolved relatively recently. In white campion (S. latifolia), the Y chromosome evolved within the last 10 million years (Bergero et al. 2007). Most X-linked genes isolated so far in this species have functional alleles on the Y chromosome (Filatov 2005; Bergero et al. 2007). One can thus study diversity for multiple genes in a system in which the Y is still a large target both for deleterious mutations that could drive the hitch-hiking processes just outlined, and locally adaptive Y-linked mutations that could also increase interpopulation Y-linked divergence. In S. latifolia, Y-linked genes have much lower diversity than their X-linked homologues (Filatov et al. 2000; Ironside and Filatov 2005; Laporte et al. 2005; Qiu et al. 2010), that is, Ne is considerably lower for the Y, as predicted under hitch-hiking, and, as previously mentioned, population differentiation is significantly higher for genes on the S. latifolia Y chromosome, compared to X-linked genes.

Here we reinvestigate DNA polymorphism on the Y chromosome in S. latifolia populations by using a larger dataset than previously, including seven Y-linked and three X-linked genes. We use data on population differentiation to test whether S. latifolia populations show evidence of race or subspecies differentiation, and whether the low Ne can fully explain the apparently higher differentiation in Y-linked, compared to X-linked genes (or whether local adaptation is necessary to account for this effect). Although in our samples, KST is much higher in Y- than X-linked genes, as previously suggested, net sequence divergence (Da) between populations is low for both chromosomes, which gives no support for the local adaptation scenario.

Materials and Methods


A set of male S. latifolia plants was used for all of the new sequences reported here. Our sampling strategy was partly based on biogeographic studies that have used RAPDs, allozymes, and flavonoid variation, and pollen and seed morphology (Mastenbroek et al. 1983, 1984; Vellekoop et al. 1996), which suggest that European S. latifolia can be divided into nine geographic races with a major east-west subdivision (Mastenbroek and Van Brederode 1986), possibly corresponding to colonization routes following the clearance of land for agriculture. As such, our samples consist of plants grown from seeds collected from 23 populations spanning the European range of the species and any possible geographic race (Fig. 1 and Table S1). The natural range of S. latifolia also includes Siberia and the eastern Mediterranean and Black Sea, and introduced populations also occur in the USA, but these regions were not included in our study.

Figure 1.

Map of the localities where samples were collected. The names and details of the localities are given in Table S1. Samples consisted of plants grown from seeds collected in 23 populations spanning the European range of the species. In our sampling, we attempted to maximize the number of different populations included. The samples were divided into several sets detailed in Table S1.

Table 1.  DNA polymorphism in the X- and Y-linked genes in the entire sample and in the four geographic regions (see Table S3 for results from other groupings of samples into regional “populations”).
 Numbers of individualsHaAll site typesNonsynonymous sitesSilent sites
LbScπθTajima's DLπLπ
  1. aNumber of haplotypes.

  2. bNumbers of sites analyzed. For the results for all site types, this is the length of alignment analyzed, while for silent and non-synonymous sites, these columns show the numbers of the corresponding types of sites. The length analyzed, changes according to which data sets are included. As a result it can vary.

  3. cNumber of polymorphic sites.

  4. dAll Y-linked genes except SlY1. The top rows (labeled “All available sequences”) summarize the results for the entire set of sequences. One of the Y-linked homologues, SlY1, had no polymorphisms in the sample of 37 individuals. The table lists the results with and without this gene (“all Y-linked genes” and “6 Y-linked genes”, respectively).

All available sequences
 All X373325491830.01800.01780.1791030.50.00860684.60.0364
 All Y261611650590.00140.00130.1152084.50.000613904.40.0011
 Six Y genesd332010639750.00180.0018−0.0181954.30.000713529.70.0015
Set 1 (northern, 18 individuals)
 All X151425981510.01830.01820.1011030.60.00850684.40.0340
 All Y16711660430.00050.00112.34**2084.30.000243904.70.0005
 Six Y genes17710652430.00050.00122.26**1954.00.0002435310.0005
Set 2 (eastern, 6 individuals)
 All X 4426041160.02300.0247−0.5421030.40.01030684.60.0445
 All Y 3311783150.00090.0009 n.a.2156.80.000313930.20.0005
 Six Y genes 4410755150.00070.0008−0.6402026.70.000253556.30.0004
Set 3 (southern, 22 individuals)
 All X181525591120.01580.01321.0211030.30.00900684.70.0374
 All Y 7611774440.00180.00151.1752158.10.000663923.90.0020
 Six Y genes121010762600.00210.00190.7242028.00.0005735490.0021
Set 4 (northern + eastern individuals)
 All X19182598660.02000.01900.3341030.60.00840684.40.0350
 All Y191011655540.00060.00132.125*2084.30.000253904.70.0005
 Six Y genes211110647540.00070.00142.023*1954.00.0002435310.0005


We sequenced seven Y-linked loci (SlY1, SlY3, SlY4, SlY6a, SlY7, DD44Y and SlCypY (Delichère et al. 1999; Atanassov et al. 2001; Moore et al. 2003; Nicolas et al. 2005; Bergero et al. 2007) with a total concatenated alignment length of 13,168 bp (see Table S2). In addition, we sequenced portions of the X-linked homologues of three of the genes (SlX4, SlX7, and SlCypX) that were sequenced from the same sample of males as the Y-linked genes (the X-linked copies of DD44, SlXY1, and SlXY3 were sequenced from more limited plant samples). SlX6a was not included because locus-specific primers were not available that could distinguish the two Sl6 X copies (Bergero et al. 2008a). The combined alignment length of X-linked sequences was 2645 bp, and the greater diversity of these sequences (see below) provides information sufficient for comparisons with the Y-linked genes.

Table 2.  Results of Hudson-Kreitman-Aguadé tests comparing DNA polymorphism between the concatenated X-linked and Y-linked genes in the entire sample and in the separate geographical regions defined in Materials and methods (Sets 1–4). The results for all tests are highly significant (P < 0.001). The tests used variants of all site types, and took into account the different effective population sizes of X- and Y-linked genes; this assumes a 1:1 sex ratio and no major difference in variance of offspring numbers per individual of the two sexes.
  1. aSample size.

  2. bNumbers of sites analyzed. For the total, this is the length of the alignment (excluding indels). Silent sites: synonymous and noncoding. The length analyzed, changes according to which datasets are included. As a result it can vary.

  3. cNumber of polymorphic sites.

  4. dDivergence from homologous region in S. vulgaris.

Set 1X1517179460684.46754
Set 2X417177962684.65956
Set 3X817178962684.76855
Set 4X19171710260684.47454

To obtain the sequences, genomic DNA was isolated from leaves of individual S. latifolia males using the Qiagen (Hilden, Germany) plant genomic DNA kit, or the Invitrogen (Carlsbad, CA) Plant Charge-Switch kit (using magnetic beads). For PCR amplification, we used either BioRad (Hercules, CA) Biomix Red, or Expand long template PCR (Roche; Basel, Switzerland). The PCR products were run on 1% agarose gels with standard 1 × TAE buffer (pH 8) and extracted from gels using the QIAquick gel extraction kit (Qiagen). Sequencing of gel-purified PCR products was performed directly (without cloning) on an ABI Prism 3700 automatic sequencer using ABI Prism BigDye version 3.1 terminator cycle sequencing kits (Applied Biosystems; Foster City, CA). The primers used for PCR amplification and sequencing of these genes are in Table S2. Chromatograms were visually checked, sequences corrected, contigs assembled, and aligned using Sequencher software (version 4.5, GeneCodes, Ann Arbor, MI) or ProSeq3 (Filatov 2009), with manual adjustments using ProSeq3 and Se-Al version 2.0a11 (Se-Al: Sequence Alignment Editor, in some cases.

Two males were sequenced per population, to yield one Y- and one X-sequence per plant whenever possible. However, some individuals failed to amplify either the Y- or the X-sequence, so that there are fewer than 46 sequences for some loci. Some Y-linked loci consistently failed to amplify in some of the individuals, despite the use of different primer pairs, and these loci (or regions) are probably deleted in some of the individuals (Bergero et al. 2008a). The missing X chromosome sequences are due to failure to amplify. The newly obtained sequences were deposited in GenBank under accession numbers (JN394077-JN394464). Complete datasets for Y-linked loci were obtained from 26 individuals, and, for all three X-linked genes, from 37 individuals. Sequences of homologous genes from S. vulgaris were obtained for use as outgroups.

To provide further information about the haplotypes’ ages and their phylogeography, we examined insertions of Miniature Inverted-repeat Transposable Elements (MITEs) in the Y haplotypes (Bergero et al. 2008b), using a single male plant from each sampling locality. MITEs are only 100- to 500-bp long, and although, unlike other major classes of TEs, they insert preferentially in or near genes (Bureau and Wessler 1994; Feschotte et al. 2002), they often cause no major disruption of the genes or their regulation (Naito et al. 2006). The MITE insertion data, from individuals used in the present study, were from a previous study (Bergero et al. 2008b).


To compare X- and Y-linked genes (see below) we used concatenated sequences (excluding individuals with missing data); X-linked ones were generated in ProSeq3 (Filatov 2009) and Y-linked ones in BioEdit (Hall 1999). DNA polymorphism in X- and Y-linked loci was analyzed using the two standard measures of DNA diversity per nucleotide, Watterson's theta θ (Watterson 1984), and nucleotide diversity, π (Tajima 1993). These were calculated for all sites, and for nonsynonymous and silent (synonymous + noncoding) sites using DnaSP (Librador and Rozas 2009). Estimates of Tajima's D (Tajima 1989) were also calculated using this program. Nucleotide diversity was also estimated within each of the four geographic regions defined below. Nucleotide diversity values were compared between X- and Y-linked genes using Hudson-Kreitman Aguadé (HKA) tests (Hudson et al. 1987) implemented in DnaSP, taking ploidy differences into account, and using the concatenated S. latifolia sequences, together with sequences of the orthologous genes from a single S. vulgaris individual (to correct for differences in mutation rate). To test for intragenic recombination within each of the Y- and X-linked genes, we used the four-gamete test of Hudson and Kaplan (1985) to estimate RM, the minimum number of recombination events in the history of the sample.

A Y haplotype network was estimated from single nucleotide polymorphism (SNP) variants using median-joining (Bandelt et al. 1995) with the Network software available from Because sequences were missing for some plants (see above), we first constructed a network using all of the individuals with complete Y-linked gene sequence information, and then added the other individuals to the network manually, using the sequences that were available for these plants. This will underestimate the numbers of distinct haplotypes, and the numbers of differences between them. There are many fixed differences between the Y- and X-linked sequences, so we did not include the X haplotypes in this analysis, but simply inferred their position in the network, using one X sequence.


To test whether the populations in our sample are strongly structured, for instance into clear “races” (see section “Plant samples”), we used the Bayesian clustering program STRUCTURE (Pritchard et al. 2000). The program is designed to analyze multiple unlinked polymorphisms and is not applicable to Y-linked DNA polymorphism data. Thus, we used only our X-linked sequences for this analysis. These sequences are highly polymorphic (Table 1), and should detect strongly isolated clusters of populations. The posterior probability of the data, Ln P(D), given the number of individual clusters (populations) is often used as an estimate of the true number of clusters (K) in the data. However, Ln P(D) does not usually form a clear peak, so we used ΔK, the second-order rate of change of K, which should peak at the number of clusters best describing the dataset (Evanno et al. 2005). To infer the number of clusters with highest ΔK, we ran STRUCTURE 10 times for each of the K values from 1 to 10 and monitored convergence.

FST (Hudson et al. 1992b), KST (Hudson et al. 1992a), and Da (eq. 10.21 of Nei's 1987 textbook) were calculated using ProSeq3. Confidence intervals for these statistics were calculated using 1000 bootstrap replicates across sites, using the same program. To avoid dependence of population differentiation measures on within-population diversity (HS), Jost (2008) suggested partitioning allelic diversity (for variants classified as different alleles, such as microsatellites) using a measure that is less dependent on HS: HST= (HTHS)/(1 –HS). However, Jost (2008) defined diversity in terms of the number and frequencies of alleles, which is not suitable for DNA sequence data, and does not relate to coalescence times (Whitlock 2011). For our sequence data, we therefore replaced HT and HS, respectively, by πT and πs (the average numbers of differences between sequences), and calculated HST using a modified version of ProSeq3. The results yield the same conclusions as the other statistics, and thus we have not shown HST values.


Da (dA) is defined as the number of net nucleotide substitutions between two populations (X and Y) and is estimated as inline image (eq. 10.21 of Nei 1987), where inline image is the average number of nucleotide substitutions between haplotypes from populations X and Y, and is estimated by inline image (eq. 10.20 of Nei 1987). inline image and inline image are the sample frequencies of the ith haplotype for populations X and Y, respectively and dij is the nucleotide substitutions between the ith haplotype from population X and the jth haplotype from population Y. dX from Nei's equation (10.21) is the average number of nucleotide substitutions for a randomly chosen pair of haplotypes in population X and is estimated by inline image (eq. 10.19 of Nei 1987) where nX is the number of sequences sampled. The average number of nucleotide substitutions (dY) for Y can be estimated in the same way.

For sequence data, Da is the difference between the mean divergence between pairs of sequences, each sampled from one of two different populations, with the within-population diversity πS subtracted, to correct for coalescence times within the populations. This estimates the number of fixed differences between the pair of populations. As explained in the Introduction, local adaptation maintained over time is predicted to lead to increased values of measures such as Da; this measure is therefore appropriate for detecting local adaptation, and relates to the average amount of time that the sequences have been isolated (Charlesworth et al. 1997). Note that the study by Charlesworth et al. (1997) includes stochastic simulations of within-population genetic drift (and coalescent simulations of the evolutionary variance). These simulations explicitly show that if local selection is acting on a nonrecombining chromosome (for long enough), then its effect will be seen in Da. Figure 5A and C of Charlesworth et al. (1997) shows that πT–πS, (similar to DST values), can be very high with low recombination, for weak and strong local selection in outcrossing populations, without background selection. In such populations, background selection leads to minor decreases in πT–πS values, see section 3(v) of Charlesworth et al. (1997). Net divergence (Da) behaves similarly (Charlesworth 1998).

The values of Da (or related quantities such as FST) will of course depend on the choice of localities sampled, and any grouping into sets of localities. For example, using each locality as a “population” tends to yield higher FST values than pooling sets of alleles from several localities to form a smaller set of (larger) regional “populations” (because such small population samples may lack variants by chance and appear uniform). Our analysis was intended to test whether the high FST previously observed for S. latifolia Y chromosomes shows evidence of being due to local adaptation, and we therefore grouped the sequences into regional “populations” similar to the populations analyzed previously (Ironside and Filatov 2005; Laporte et al. 2005). However, these groupings may not reflect geographic regions with selective differences, if these exist. Thus, using these particular groupings might fail to detect differently adapted populations. Based on information about races in the species (see above), we initially classified our samples into three sets (see Table 1): (1) northern, (2) eastern, and (3) southern. We also analyzed the northern and eastern samples combined, which we refer to as Set 4. To take a range of different possibilities into account, we also estimated subdivision measures (FST, KST, and Da) for six other groupings of our samples. All 10 groupings are defined in Table S1.



We analyzed DNA polymorphisms in seven Y-linked and three X-linked genes, using the same sample of individual male plants for all genes (see Materials and methods). The top rows of Table 1 (labeled “All available sequences”) summarize the results for the entire set of sequences. One of the Y-linked homologues, SlY1, had no polymorphisms in the sample of 37 individuals. Table 1 lists the results with and without this gene (“all Y-linked genes” and “six Y-linked genes,” respectively). Consistent with previous reports (Filatov et al. 2000, 2001; Ironside and Filatov 2005; Laporte et al. 2005; Filatov 2008; Kaiser et al. 2009; Qiu et al. 2010), the X–Y difference in nucleotide diversity is large (more than a factor of 10, see Table 1), and is highly significant by an HKA test (Table 2). Overall, the species-wide Ne for the Y is estimated to be no more than 1/20th of the X value; the within-population Y/X Ne ratio is 1/30 (Filatov 2005) without correcting for the higher Y mutation rate compared with the homologous X-linked genes (Filatov and Charlesworth 2002; Qiu et al. 2010). HKA tests found no significant differences in nucleotide diversity between the different X-linked genes, or the different Y-linked genes, respectively (not shown).

In agreement with these results for the species as a whole, the within-region nucleotide diversity is also significantly higher for X-linked, compared to Y-linked genes (Tables 1 and 2). This supports previously reported results for DD44X/Y and SlXY4 genes, using a different sample of populations (Ironside and Filatov 2005; Laporte et al. 2005), and for two further sex-linked gene pairs, using sampling similar to that used here (Qiu et al. 2010).


The estimated minimum number of recombination events was zero for each of the Y-linked genes that had any polymorphisms, and for the complete concatenated sequences, whereas recombination was detected in each X-linked gene. Figure 2 shows the Y haplotype network estimated using SNP variants. The haplotypes from northeastern populations are moderately distinct, but the major difference appears to be between northern and southern European populations, and suggests some north–south differentiation in S. latifolia, rather than a simple subsampling of southern diversity during colonization northwards. Nevertheless, several plants from northern France and southern England cluster with the southern haplotype.

Figure 2.

Haplotype network of the concatenated Y-linked sequences based on SNP variants only (haplotype states with respect to MITE insertions are mapped onto the haplotypes inferred using SNPs). Haplotype numbers correspond to the plant numbers in Table S1. Haplotypes with incomplete data are indicated with dashed boxes and lines connecting them to the most similar completely genotyped haplotypes. The position of one X haplotype is indicated, but the number of differences from all Y haplotypes is large, so the branch is not shown.

We also attempted to include the presence or absence of MITE insertions (see Materials and methods) in the Y haplotypes. However, this yielded no clear network. Rather, given the absence of recombination indicated by the SNP data, we suggest that repeated instances of the same insertion or, more likely, independent excisions of an element from the same site, have occurred. Alternatively, some of the MITE insertions may be located in the pseudo-autosomal (recombining) region of the sex chromosomes. This hypothesis is not excluded by the segregation data showing Y-linkage for these variants (Bergero et al. 2008b), because a low frequency of crossovers might not be detected in the family studied (92 progeny plants). Figure 2 is therefore based on SNPs only, and the states of the haplotypes with respect to the MITEs are mapped onto the haplotypes inferred using SNPs.

Interestingly, the MITE numbers differ markedly (Fig. 2) between the haplotypes, forming three sets, roughly corresponding to sets of populations from different geographic regions (defined in the Materials and methods section and Fig. 1). Y chromosomes sampled from southern populations (plus some from the UK and France) generally had between two and five MITE insertions, considerably fewer than ones from northern populations (with 17 insertions), whereas Y chromosomes from northeastern localities had intermediate numbers of insertions (13 or 14). This suggests that MITE insertions often originated recently, and are mostly “private” variants within geographic regions, which are probably not strongly isolated, based on our X-linked sequences.


To detect local adaptation (see Materials and methods), one should ideally define local groupings of regional “populations” and then estimate Da between populations known to be locally adapted. However, there is currently no information allowing one to select populations by this criterion. The most likely candidates would be populations from the different races previously inferred in S. latifolia, but these inferences are based on morphology, physiology, or different types of markers (see Materials and methods). We therefore tested whether our sequences support these races, by running a STRUCTURE (Pritchard et al. 2000) analysis on our X-linked genes. These sequences are highly polymorphic (Table 1), and should detect strongly isolated clusters of populations. The ΔK criterion outlined in the Materials and methods (Evanno et al. 2005) suggests a most likely number of clusters, K= 4. However, the four clusters do not correspond to any clearly defined geographical origins of the samples, and cannot be interpreted as distinct races (Fig. S1). In the absence of support for the nine geographical races suggested by Mastenbroek and Van Brederode (1986), for further analyses of the possibility of local adaptation, we separated the data into three broad geographic regions (see Materials and methods) rather than these four clusters. These regions are differentiated in the numbers of MITE insertions in the Y haplotypes, with our samples from southern populations having at most five MITE insertions, whereas haplotypes with larger numbers (≥13, often as many as 17) predominately in the north and northeastern Europe (Fig. 2), despite some haplotypes with small insertion numbers among the northern samples (as shown in Fig. 2, several plants from northern France and southern England have only two MITE insertions).


Although the Y sequences have much lower diversity than the X sequences, they nevertheless include 75 variants. A recent species-wide selective sweep on the Y chromosome is thus unlikely, which is consistent with the fact that the value of Tajima's D for the Y-linked sequences is not negative for the entire set of samples. Given the expected existence of different races within S. latifolia (see above), the Y diversity could, however, be partly or wholly due to the presence of variants that are “private” to particular localities or regions, for example, to certain races, but not others (as discussed above for the MITE insertions). Under a scenario of recent local adaptation involving Y-linked genes, Y haplotypes with locally advantageous variants in the selected genes could cause selective sweeps within certain regions or localities, and the effects of these sweeps should be detectable through an unexpectedly high proportion of rare variants, that is, negative Tajima's D values within the geographic region(s) affected. Table 1 shows that Tajima's D is significantly negative for Y-linked sequences from the set of northern samples (or for Set 4, which includes many of these samples). This is consistent with a selective sweep due to recent adaptation to a different environment, but could be due to a recent expansion specifically of these populations, or both processes could have occurred.

For X-linked genes, the nucleotide diversity (π) values within the three sets of sampling localities (defined in the Materials and methods section and Table S1) are similar to the species-wide estimates, as might be expected if migration is not highly restricted. The estimates are similar in all of the regions, with no evident loss of diversity in northern populations (Table 1). In contrast, for Y sequences, π values are often lower for the regional samples, suggesting that Y diversity is indeed more structured than that of the X. The estimated silent site diversity is higher for the southern than the northern sample; however, the difference is not statistically significant (Table S4).

To compare subdivision in X- and Y-linked genes, using FST, KST, and Da, the SlY1 gene (which had no polymorphisms in our sequences) was excluded, and the analysis was based on a concatenation of the other genes. In line with previous findings, population differentiation for Y-linked genes, assessed by either FST or KST statistics using variants of all site types, is considerably higher than for X-linked genes, for any of a range of different groupings of sequence samples tested (Tables 3, and S5). In contrast, Da is quite low for all of the genes studied, from either the Y- or the X-chromosome (Tables 3 and S5). Da is significantly higher for the X than the Y in several of the comparisons (Sets 1 vs. 2, 2 vs. 3, 5 vs. 6, 7 vs. 8, and 9 vs. 10), whereas in one comparison (Set 1 vs. 3), and possibly also in the comparison of Sets 3 versus 4, Da is significantly higher for the Y than for the X. HST values (see Materials and methods) are very similar to Da, and are not shown.

Table 3.  Population subdivision statistics for the X- and Y-linked genes. The sets are defined in the Materials and methods.
 LSSample sizesKSTFSTDa
set Aaset BaMeanMinbMaxcMeanMinbMaxcMeanMinbMaxc
  1. aThe set listed on the left in column 1, and B to the set listed on the right.

  2. b5th percentile value.

  3. c95th percentile value.

Sets 1 vs. 2
All X sequences259816615 40.01720.01070.02810.05240.00610.11310.00110.00010.0025
Six Y genes106475417 40.21510.13620.34440.41470.22900.54660.00040.00020.0008
Sets 1 vs. 3
All X sequences254917215180.02440.01880.03480.01320.00200.03000.000200.0005
Six Y genes106446717120.24850.20140.32160.46440.38640.54160.00120.00070.0016
Sets 2 vs. 3
All X sequences2551158 4180.03350.02520.04680.16760.10670.23660.00390.00230.0058
Six Y genes1075768 4120.10780.08580.14440.42330.34010.50470.00110.00070.0015
Sets 3 vs. 4
All X sequences254918318190.02570.02030.03490.04390.02730.06150.00080.00050.0011
Six Y genes106397512210.18370.14980.22940.42840.35670.49620.00110.00070.0015


We set out to reinvestigate why S. latifolia populations show evidence of race (or subspecies) differentiation in Y-linked genes and specifically to ask whether the lower Ne for the Y-chromosome can fully explain the apparently higher differentiation in this chromosome, compared to X-linked genes. To evaluate this, we first described the partitioning of geographic variants for our European samples, then we applied a measure of “net divergence” (Da, based on existing theory) that corrects for the amount of polymorphism within populations to circumvent the drawbacks of FST (reviewed earlier).


Population differentiation is considerably higher for Y-linked than X-linked genes (assessed using KST), for any of the range of different geographic group sets. This led us to ask whether the partitioning of polymorphism on sex-linked chromosomes corresponds with what is already known about geographic structure in S. latifolia?

The MITE insertions that are mapped onto the Y-haplotype network based on SNPs (shown in Fig. 2) suggest some degree of north–south isolation in S. latifolia, rather than a simple subsampling of southern diversity during colonization of the northern populations of this species, because Y haplotypes in the north and northeastern populations are distinctive (in terms of their MITE insertions). The Y haplotypes (based on the SNPs) in northeastern populations are also moderately distinctive (see Fig. 2), and again the major difference appears to be between northern and southern European populations.

These results are also consistent with the north–south geographic structure observed in the chloroplast genome, which appears to be influenced by whether S. dioica is present or absent. In a phylogeographic study mostly concerned with S. dioica but including 18 S. latifolia plants by Prentice et al. (2008), all S. latifolia plants sampled in the north had the same chloroplast haplotypes as the local S. dioica haplotypes, falling into two of the three major S. dioica types, or, in the most eastern populations, similar to these. In contrast, the few plants sampled from southern European populations more rarely contain haplotypes identical to ones found in S. dioica, but their haplotypes (29 and 30) tend to be similar to haplotype 1, which predominates in northern S. dioica (Prentice et al. 2008). These results suggest that the ancestry of southern European S. latifolia may less often include recent introgression from S. dioica, which is consistent with the absence of S. dioica in Mediterranean regions (although more sampling of S. latifolia from these regions is clearly necessary). However, it is unclear whether the north–south division of the Y chromosome haplotypes is connected with these effects, or independent of them. Our chief reason for including the data on Y chromosome MITE insertions was to give additional information about selection, and we discuss these data further below. Although the Y chromosome phylogeny corresponds fairly well with the S. latifolia geographic structure based on traits that might be expected to be locally adapted, including phenotypic and biochemical characters (Mastenbroek et al. 1984), the clusters of individuals inferred from X-linked variation do not, and this is the expected pattern. Y-linked loci do not recombine and so their phylogeny reflects the history of the Y chromosome, which (because it is inherited as a single genetic locus and because the evolutionary variance of a single locus is very high) may differ from that of the population. Even though X- and Y-linked genes both migrate via pollen and seed, and Y-linked genes have higher migration rates than X-linked genes (see above and File 1 in Supporting information), the Y is predicted to have higher differentiation than the X, due to its lower effective population size (Laporte and Charlesworth 2002). The low Y effective population size allows little variation to be maintained, and also allows genetic drift to fix neutral variants that arise locally. This can create regional “private” variants that give clear signals of the population's history. Moreover, because genes on the X recombine, immigrant haplotypes are quickly recombined with resident sequences, and cannot be clearly recognized, whereas immigrant Y haplotypes remain recognizable as long as they persist in a population. The X will therefore contain a less clear signal of population history. Similarly, differentiation between outcrossing populations, where recombination occurs, is generally low for nuclear genes (Pannell and Charlesworth 2000; Charlesworth 2006), and higher for organelle genomes, with lower effective population sizes (Newton et al. 1999, see Muir and Filatov 2007; Qiu et al. 2010 for cyto-nuclear comparisons in Silene).


Selective sweeps on the Y chromosome can potentially lower diversity species-wide, provided that there is enough migration (Santiago and Caballero 2005). Our results for KST for the X-linked genes indeed suggest that migration is not strongly restricted between our study populations. The finding that Y-linked haplotypes are quite variable, and show some geographic structure, thus appears inconsistent with a very recent species-wide selective sweep on the Y chromosome. If selection has acted, it is therefore most likely to be selection causing local adaptation.

Although in our samples, KST (or FST) is much higher in Y- than X-linked genes, net sequence divergence (Da) between populations is low for both chromosomes, which does not support the local adaptation scenario. If this is due to hitch-hiking processes (reviewed in the Introduction), diversity within populations should be reduced, as these processes act within populations, which will lead to high differentiation, as observed. Our new results add to the previous observation (see above, and review in Laporte and Charlesworth 2002) that Y-linked genes in S. latifolia do indeed have much lower nucleotide diversity than their X-linked homologues.

Da is not, of course, a general test of local adaptation. Da values for different genomic regions will depend on their mutation rates, regardless of the presence of selection, so the low Da value for the Y could reflect a low mutation rate. However, two of the three X–Y gene pairs we analyzed have higher mutation rates for the Y than for the X lineages in S. latifolia (SlXY4 and SlXY7, see Filatov and Charlesworth 2002; Filatov 2005; Qiu et al. 2010). The lower net sequence divergence for the Y is therefore conservative for assessing potential selection. A more serious problem is that local adaptation can occur without Da being high, because Da depends on the divergence time. Any local adaptation of S. latifolia to northern environments must have occurred since the end of the last glaciation (<∼20,000 years), which may not have allowed time for sequence differences to accumulate between Y haplotypes. However, the southern populations may have been somewhat isolated from the source of the northern ones before that time, so the maximum divergence time between the northern and southern populations could be longer. Our results cannot, therefore, definitively disprove local adaptation. A difference in Da is difficult to interpret without other information, but may be helpful in the search for local adaptation, when combined with other signals such as a bias in allele frequency spectra or the fixation (or near fixation) of alleles.

Another possible test of whether the large difference in FST can be explained under neutrality is investigated in File 1 in Supporting information: “A potential test for local adaptation.” The test compares the ratio of estimated within-population diversity values for the X- and Y-loci with the ratio AX/AY, where A=inline image. Under the infinite island model, assuming neutrality, both ratios depend only on the ratio of the genes’ effective population sizes. If local adaptation causes lower effective migration for Y chromosomes, the ratio of A values would be increased above the neutral expectation, and should thus exceed the ratio of diversity values. However, this again makes several assumptions that are probably not met: a 1:1 sex ratio, equal variances of male and female progeny numbers, and particularly, as discussed below, equal migration rates for X- and Y-linked genes. Note that a higher mutation rate for the Y than for genes on the X chromosome will lower the ratio for diversity, which could create the appearance of selection. However, this makes the test conservative.

With our results (southern vs. other populations, and using six Ys and all the X sequences), the ratio for A is 10.7, considerably less than the ratio for diversity (16.3), which does not suggest that the Y chromosomes are locally adapted. However, given that the migration rates must differ for the two types of genes used (see File 1 in Supporting information), the test might fail to detect selection. A Y migration rate higher than for X-linked genes will lower the ratio for A (for our data, a 52% higher Y migration rate yields equality of the two ratios). A higher migration rate difference is required if the Y-linked genes have a higher mutation rate, as occurs in S. latifolia (Filatov and Charlesworth 2002; Filatov 2005; Qiu et al. 2010). Estimates from US populations of S. latifolia (McCauley 1997) suggest at least 20-fold higher pollen than seed migration rates, potentially yielding a migration rate for Y-linked genes close to the maximum of threefold higher than the X rate. Thus this test is inconclusive, and adaptation on the Y chromosome must currently be regarded as uncertain, given that the much higher FST for the Y than the X chromosome can readily be explained by low within-population Y diversity. The higher values of differentiation measures for Y-linked genes probably result largely from reduced total variation on the Y chromosome, which in turn reflect deterministic processes lowering effective population sizes of evolving Y-chromosomes.

Finally, selection among Y haplotypes might potentially be detectable using data on the geographical extent of haplotypes, together with information on haplotype ages: a wide geographic spread of a young haplotype might suggest a selective sweep. The north European cluster of Y haplotypes with large MITE numbers not found elsewhere, and comparatively low diversity may have been caused in this way (all these haplotypes differ by only one MITE insertion and three SNP mutations). The few northern samples with haplotypes resembling those from southern populations (whose haplotypes have a range of different MITE insertions, and differ by many SNPs, suggesting that they are older) could represent survivors of a pre-sweep population. However, it is difficult with currently available data to distinguish between selection and the nonselective spread of one Y haplotype across large regions of northern Europe due to a bottleneck during the migration of the species to this region.

As noted above, the Y does not show the excess of rare variants expected under either the selective sweep or bottleneck hypothesis. Previous analyses of individual loci found positive Tajima's D for Y-linked genes, but the samples used came from only a few populations, with several alleles from each. Such samples will often yield positive Tajima's D values (reviewed by De and Durrett 2007). Our present dataset, with two Y-linked sequences from each of a number of populations, is likely to show the same effect, but less strongly, and we do not now find high positive values (Table 1, see also Qiu et al. 2010). Despite the positive D value for the entire sample, D is significantly negative for the northern population samples of Y-linked alleles, when calculated for all types of polymorphisms (almost all silent site variants). This result, together with the slightly lower nucleotide diversity in the northern population samples (Table 1), might suggest a recent expansion of northern populations, or recent adaptation to a different environment (or both). A bottleneck should create a similar signal in the nuclear and organelle genomes. The chloroplast genome (with the same effective population size as the Y, assuming a 1:1 sex ratio) indeed exhibits an excess of rare variants, consistent with a selective sweep (Muir and Filatov 2007). Nevertheless, interpretations in all of these markers are complicated by introgression from S. dioica, particularly if most hybrids had a S. latifolia Y chromosome.


The use of sequence data to identify candidate genes with locally adapted alleles, as an alternative to conducting reciprocal transplant experiments has attracted considerable interest. These methods have focused on detecting outliers from FST distributions in genome-wide diversity studies (Beaumont 2005). Unfortunately, as we have emphasized here using the Y chromosome of S. latifolia, high values of FST do not necessarily imply positive (or diversifying) selection, as high differentiation can be due to low diversity within populations. It remains difficult to determine the underlying cause of an outlier FST value and as a precaution an additional piece of information investigators may report is net divergence (Da).

Associate Editor: J. Kelly


We thank NERC for funding a collaborative project in the Oxford and Edinburgh labs, BBSRC for funding for the Edinburgh lab, and The Leverhulme Trust for funding for the Oxford lab. We are grateful to A. Harper for contributing Y-chromosome sequences for several genes and to S. Glémin for suggesting the ratio test for selection.