COMPARATIVE POPULATION GENOMICS IN COLLINSIA SISTER SPECIES REVEALS EVIDENCE FOR REDUCED EFFECTIVE POPULATION SIZE, RELAXED SELECTION, AND EVOLUTION OF BIASED GENE CONVERSION WITH AN ONGOING MATING SYSTEM SHIFT
Khaled M. Hazzouri,
Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Canada
Selfing species experience reduced effective recombination rates and effective population size, which can lead to reductions in polymorphism and the efficacy of natural selection. Here, we use illumina transcriptome sequencing and population resequencing to test for changes in polymorphism, base composition, and selection in the selfing angiosperm Collinsia rattanii (Plantaginaceae) compared with its more outcrossing sister species Collinsia linearis. Coalescent analysis indicates intermediate species divergence (500,000–1 million years) with no ongoing gene flow, but also evidence that the C. rattanii clade remains polymorphic for floral morphology and mating system, suggesting either an ongoing shift to selfing or a potential reversal from selfing to outcrossing. We identify a significant reduction in polymorphism in C. rattanii, particularly within populations. Analysis of polymorphisms suggests an elevated ratio of unique nonsynonymous to synonymous polymorphism in C. rattanii, consistent with relaxed selection in selfing lineages. We additionally find higher linkage disequilibrium and differentiation, lower GC content at variable sites, and reduced expression of genes important in pollen production and pollinator attraction in C. rattanii compared with C. linearis. Together, our results highlight the potential for rapid shifts in the efficacy of selection, gene expression and base composition associated with ongoing evolution of selfing.
The shift in mating system from outcrossing to selfing is one of the most prevalent evolutionary transitions in flowering plants, yet over the long term is hypothesized to be an evolutionary dead-end (Stebbins 1974; Barrett et al. 1996; Takebayashi and Morrell 2001; Igic et al. 2008). In addition to major effects on mating patterns and floral morphology (reviewed in Sicard and Lenhard 2011), mating system transitions are expected to affect the patterns of molecular polymorphism, molecular evolution, and base composition in the genome (Charlesworth and Wright 2001). Compared to outcrossing, selfing decreases the effective population size by reducing the number of independent gametes sampled for reproduction (Pollak 1987; Nordborg 2000), reduces the effective rate of recombination because of limited heterozygosity and, as a consequence, increases linkage disequilibrium (LD) among loci. Increased LD can also cause a further reduction in effective population size because of selection at linked sites (selective sweeps, background selection, and weak Hill–Robertson interference: Charlesworth et al. 1993; McVean and Charlesworth 2000; Charlesworth and Wright 2001). In addition, ecological and demographic factors associated with selfing, such as frequent extinction–recolonization, founder events, and selection on reproductive assurance, can further reduce the effective population size, and diminish within-population and species-wide genetic diversity (Baker 1955; Schoen and Brown 1991; Hamrick and Godt 1996; Pannell and Charlesworth 2000; Ingvarsson 2002; Foxe et al. 2009; Guo et al. 2009; Ness et al. 2010; Busch et al. 2011; Pettengill and Moeller 2012). Limited gene flow via pollen should also increase population subdivision in selfers compared to outcrossers (Ingvarsson 2002). Under an equilibrium island model of subdivision strong differentiation among demes will increase the likelihood of retention of alleles and would have the counteracting effect of increasing effective population size (Wright 1931). However, additional factors such as extinction and differential migration will often lead to reduced effective size with higher subdivision (Whitlock and Barton 1997). Finally, if selfing evolved recently via a severe founder event, this may lead to particularly extreme genome-wide reductions in effective population size (Foxe et al. 2009; Guo et al. 2009; Busch et al. 2011; Pettengill and Moeller 2012).
Demographic and genetic factors causing reduced effective population size in selfing populations should also reduce the efficacy of natural selection compared to outcrossing species, particularly at weakly selected sites. One consequence is that slightly deleterious mutations are more likely to be segregating and become fixed in selfing than in outcrossing species, which may increase the ratio of nonsynonymous to synonymous polymorphism and divergence (Glémin 2007). Furthermore, selection on advantageous mutations may be less efficient in selfers than in outcrossers (Slotte et al. 2010), although if advantageous mutations are predominantly recessive the predictions for beneficial mutations are less clear (Glémin 2007). Over long evolutionary timescales, the loss of diversity, accumulation of deleterious mutations and reduced scope for adaptive evolution may increase rates of extinction in selfing lineages (Lynch et al. 1995; Schultz and Lynch 1997), with important implications for long-term diversification of selfing and outcrossing lineages. Although recent macroevolutionary studies are consistent with the prediction of higher extinction rates in selfing lineages (e.g., Goldberg et al. 2010; Ferrer and Good 2012), the underlying mechanism remains unclear, and the evidence for reduced efficacy of selection in selfing species remains limited.
In addition to the effects on nonsynonymous polymorphism and divergence, selfing species are expected to experience reduced selection on synonymous codons associated with codon usage bias (Marais et al. 2004; Wright et al. 2004; Qiu et al. 2011). As the strength of selection for synonymous codon usage is weaker compared to purifying selection against deleterious nonsynonymous mutations (Zeng and Charlesworth 2009), we expect that the effects of selfing and effective population size may be stronger on the patterns of codon usage than on nonsynonymous sites (McVean and Charlesworth 2000; Qiu et al. 2011).
Conversely, high levels of homozygosity in selfing species are expected to reduce the impact of GC-biased gene conversion (BGC; Marais et al. 2004). BGC is a process that preferentially converts A/T into G/C at sites heterozygous for AT and GC (Marais 2003). The net effect of BGC is to increase the GC content of recombining DNA sequences and genomes (Glémin 2011), and this process can act as a type of selfish genetic element, reducing the efficacy of selection against deleterious mutations (Glémin 2010). Because of high homozygosity, highly selfing taxa will experience a strong reduction in the opportunity for BGC, and therefore exhibit reduced GC content (Marais et al. 2004; Wright et al. 2007). Thus, if the effect is strong enough, differences in biased gene conversion between selfing and outcrossing lineages may act to reduce the efficacy of selection in outcrossing lineages, mitigating the general predictions about mating system and selection efficacy (Glémin 2010).
There is clear evidence that selfing markedly decreases nucleotide diversity in numerous genera, including in Leavenworthia (Liu et al. 1998; Busch et al. 2011), Arabidopsis (Savolainen et al. 2000; Wright et al. 2003), Capsella (Foxe et al. 2009; Guo et al. 2009), Eichhornia (Ness et al. 2010), Mimulus (Sweigart and Willis 2003), Lycopersicum (Baudry et al. 2001), Clarkia (Pettengill and Moeller 2012), several grasses (Glémin et al. 2006), and the nematode genus Caenorhabditis (Graustein et al. 2002; Cutter 2006). However, the general effect of mating system shifts on the strength and efficacy of selection is less clear. In particular, studies to date find little to no effect of selfing on the strength of selection on nonsynonymous sites or on codon usage bias, suggesting that effective population sizes in most selfing lineages may be high enough, or the proportion of slightly deleterious mutations low enough, to counteract the effects of reduced recombination rates (Wright et al. 2002; Foxe et al. 2008; Haudry et al. 2008; Escobar et al. 2010).
However, these studies focused primarily on patterns of substitution rates, and this may give a limited picture for several reasons. First, in some cases selfing may have evolved very recently relative to species divergence, meaning that the effects of selfing may be limited to a small proportion of overall divergence (Escobar et al. 2010; Qiu et al. 2011). Second, the fitness effects of nonsynonymous mutations undoubtedly range from strongly deleterious, through neutral and beneficial, which means that a simple comparison of substitution rates provides only partial information about the relaxation of selection. Because of this, studies that combine polymorphism and divergence across multiple genes are needed for quantifying the interaction between mating systems, demographic history, and the strength of selection. Furthermore, the most powerful tests will contrast patterns of polymorphism within related selfing versus outcrossing lineages, to distinguish contemporary selective forces subsequent to mating system evolution from historical selection that occurred prior to the evolution of selfing. Indeed, recent analyses of large-scale polymorphism and divergence patterns in Arabidopsis and Capsella for both amino acid mutations (Slotte et al. 2010) and codon usage bias (Qiu et al. 2011) provide more evidence for changes in selection. To more robustly assess the role of mating system on the efficacy of selection, the use of next generation sequencing can enable large-scale studies of polymorphism and divergence, even in species with little prior genomic information (Ness et al. 2011).
The self-compatible genus Collinsia (Plantaginaceae) has experienced numerous repeated shifts in flower size (Randle et al. 2009; Baldwin et al. 2011) and mating system (Kalisz et al. 2012) making it an outstanding model to investigate the causes and consequences of mating-system evolution. Replicate shifts to high selfing in Collinsia are thought to be caused primarily by selection for reproductive assurance through autonomous selfing (e.g., Kalisz et al. 2004) that is driven by local population size/mate availability (Kennedy and Elle 2008) and/or local pollination environment (Kalisz and Vogler 2003). The presence of multiple independent sister species pairs with contrasting mating systems makes Collinsia an important target to test predictions about the evolution of selfing and its genomic consequences. However, at present the availability of genomic resources for Collinsia is limited.
In this study, we investigate and contrast patterns of sequence polymorphism, base composition, and codon usage bias in two sister species of Collinsia with contrasting mating system. We use a combination of floral transcriptome sequencing of pooled population samples and Sanger sequencing of a broader population sample to compare levels of polymorphism at synonymous and nonsynonymous sites, as well as base composition and codon usage in the two sister species. In particular, we address three main questions:
To what extent is the effective population size reduced and what are the timescale and demographic history of the shift to selfing in Collinsia rattanii?
Is there evidence for relaxed purifying selection on nonsynonymous polymorphism and codon usage in C. rattanii compared with Collinsia linearis?
Has the shift to selfing led to a change in the strength of GC-biased gene conversion?
Materials and Methods
COLLINSIA SPECIES PAIR
Our sister species pair includes C. linearis, a relatively large-flowered, mixed mating species (average outcrossing rate t = 0.57; Kalisz et al. 2012) and C. rattanii, a predominantly selfing, small-flowered species (average t = 0.12; Kalisz et al. 2012). As with other sister pairs, C. linearis and C. rattanii exhibit strong range overlap and frequently co-occur (Randle et al. 2009). The geographic range of C. linearis comprises the Klamath River watershed, which runs through Oregon and northern California. In contrast, the range of the highly selfing C. rattanii is larger: it completely overlaps the range of C. linearis where the two species are often found sympatrically, and also extends into northwestern California and the north Sierra Nevada (Randle et al. 2009; Baldwin et al. 2011). These species diverged recently (estimated at approximately 1.45 Mya by Baldwin et al. 2011) and exhibit incomplete reproductive isolation upon crossing (Randle and Kalisz, unpubl. ms).
Naturally pollinated seeds of C. linearis and C. rattanii were collected from eight wild populations across the species’ ranges (Table 1). Seeds were germinated under species-specific conditions (C. linearis 2 weeks 20°C day, 14°C night, 12-h day; C. rattanii 8 weeks 12°C day, 7°C night, 10-h day). Seedlings were transplanted into 10 cm pots, given a 1-month vernalization treatment (constant 4°C). Plants were then grown to flowering under standardized conditions (23°C day/15°C night 12-h day) at the University of Pittsburgh plant growth facility. Flower bud and/or leaf tissue were collected, and used as described later.
Table 1. Locations of Collinsia linearis and Collinsia rattanii population used for transcriptome assembly (⧫ 12 individuals pooled / population, flower buds sampled) or Sanger sequencing (all populations, N = number of individuals/population sequenced)
County, State; Latitude, Longitude (Degree, Min., Sec.)
Total RNA was extracted from a pooled sample of early flower buds collected from 12 individuals from a single population each of C. rattanii (M3; Table 1) and C. linearis (THR; Table 1) using the RNeasy Plant Mini kit (Qiagen) according to the manufacturer's instructions. Paired-end (PE) cDNA libraries were generated for each species and one lane of 108 bp PE Illumina GAII sequencing was run for each of them. This generated a total of approximately 71 million reads (7.7 Gbp) per lane passing quality filter. Sequencing and library construction were conducted at the Genome Quebec Innovation Centre at McGill University.
Details of illumina transcriptome de novo assembly approaches and identification of cross-species orthologous transcripts are described in the Supplementary Methods. Expression analysis, including the identification of differentially expressed genes, was conducted using Cufflinks v 0.83 (Trapnell et al. 2010). We tested whether the differentially expressed transcripts tended to belong to particular functional categories using enrichment tests on the GO terms annotated by BLAST2GO (Conesa et al. 2005). Further details on the expression analysis and functional enrichment tests are in the Supplementary Methods.
Single nucleotide polymorphism (SNP) analysis of the transcriptome data was conducted by mapping transcriptome reads of both C. rattanii and C. linearis using the Burrows–Wheeler Aligner (BWA; Li and Durbin 2009) to the C. rattanii in-frame orthologous transcripts as reference. As confirmation, we also conducted analysis using the C. linearis transcripts as reference, and the quantitative results were unaffected (data not shown). We used Samtools and bcftools (Li and Durbin 2009; Li et al. 2009) to call SNPs and calculate genotype likelihoods. In-house Python scripts were used to process SNPs in the VCF file from the genotype likelihoods, using a cutoff of Q = 40 for the second-highest genotype likelihood and a depth cutoff of DP = 20. SNPs were defined as shared between species or unique to one species based on the genotype likelihood calls, and classified as synonymous or nonsynonymous based on the inferred reading frame.
CODON USAGE, BASE COMPOSITION, AND SUBSTITUTION ANALYSIS
Optimal codons were identified in Collinsia as described in Supplementary Methods. Substitution rates were estimated using the program CODEML implemented in the PAML 4.4 package (Yang 2007).
To compare base composition and codon preferences between species, we focused on sites that were invariant within species but divergent between species. Specifically, we counted the number of cases where one species was fixed for a GC base and the other for an AT base at each gene, and tested for differences between the two Collinsia species by means of signed-rank tests. Similarly, we determined the number of times one species had a preferred codon and the other species an unpreferred codon among synonymous substitutions. We determined this for the whole set of preferred codons, as well as for GC-ending and AT-ending codons separately using signed-rank tests. Statistical analyses were performed with R 2.13.0 (R Development Core Team 2011).
PRIMER DESIGN, DNA EXTRACTION, AND SANGER SEQUENCING
To gain a more thorough understanding of polymorphism patterns within and between populations, we also did targeted resequencing of 17 genes across multiple populations of each species. For four C. rattanii and four C. linearis populations, we sequenced 2 to 5 individuals/population (Table 1). Genomic DNA was extracted from leaf material using a Qiagen DNeasy plant kit.
Based on the orthologous genes generated from the de novo assembly of both species, and using BLASTx (Altschul et al. 1997) similarity against Mimulus gutattus, conserved exonic regions were identified by aligning Collinsia transcripts from both species to M. gutattus genomic regions. Only single copy transcripts with a clear homolog in M. gutattus were used. Seventeen primers (sequences available upon request) were designed to amplify 500 to 700 bp fragments using Primer3 (Rozen and Skaletsky 2000). A standard PCR reaction of 30 cycles with optimized annealing temperatures for each primer pair was used for all genes. Each cycle included 30 s of denaturing at 95°C, 40 s of annealing, and 1 min of extension at 72°C. Both forward and reverse strands of the amplicons were sequenced directly at the Genome Quebec Innovation Centre at McGill University, using the same primers as for the amplification.
For the 17 loci sequenced in the eight populations of Collinsia, haplotypes were inferred using PHASE 2.1 (Stephens et al. 2001; Stephens and Donnelly 2003), as implemented in DNAsp v 4.50 (Rozas et al. 2003). Diversity statistics for the 17 resequenced loci were calculated for synonymous and nonsynonymous sites using a modified version of Polymorphorama (Bachtrog and Andolfatto 2006) including θW (Watterson 1975), π (Tajima 1989), and Tajima's D (Tajima 1989). In addition, to partition diversity, we calculated the average number of pairwise differences at synonymous sites between species, between populations within species, within populations (excluding within-individual comparisons) and within individuals, using Perl scripts written by S. Wright. Because these latter analyses include between-species divergence estimates, all pairwise comparisons were calculated using the Jukes–Cantor correction. In addition, we estimated population structure (Fst) by calculating 1−πS/πT, where πS is the average pairwise nucleotide diversity within populations and πT is the total pairwise nucleotide diversity across populations. Using custom Perl scripts, we calculated the number of synonymous and nonsynonymous unique and shared polymorphisms, and fixed differences between the two species.
We estimated intralocus recombination and LD using unphased diploid data with the LDhat software (McVean et al. 2002) implemented as a composite likelihood method (Hudson 2001) for pairs of polymorphic informative sites, and calculated the correlation between pairwise LD (r2) and physical distance between sites (d). We also estimated the population recombination parameter (ρ = 4Ner, where Ne is the effective population size, and r is the recombination rate expressed as the expected crossover events per generation per base pair between adjacent SNPs). Using LDhat, we calculated the minimum number of recombination events (Hudson and Kaplan 1985). The effect of mating system on LD was assessed using the ratio of the population recombination parameter ρ to the population mutation parameter θ (i.e., ρ/θ, where θ = 4Neμ and μ is the rate of mutation; Nordborg 2000). Using the libsequence analysis package (Thornton 2003), LD values were calculated between pairs of phased sites using the squared allele frequency correlation measure, r2. We also used R script (LDit.r) written by J. Ross-Ibarra (http://www.rilab.org/code/files/LDit.html) that uses equation 1 from Remington et al. (2001) to estimate 4Ner via a nonlinear regression and plot the decay of LD over distance. To minimize the effects of low polymorphism in downwardly biasing our estimates, we only estimated effective recombination rates and the minimum number of recombination events at loci with at least 3 nonsingleton segregating sites. Note that estimates of recombination with small numbers of segregating sites (< 10) are still likely to have a high variance, although they are not expected to generate a consistent bias between species.
The 17 genes described earlier were concatenated into a supermatrix 10,158 bp long to visualize the genetic similarity of populations. A neighbor-joining (NJ) tree with 1000 bootstraps was inferred from this alignment using MEGA5 (Tamura et al. 2011) with the maximum composite likelihood method, a substitution model including both transitions and transversions, a γ distributed rate among sites and allowing for a heterogeneous pattern among lineages; when gaps or missing data occurred, pairwise deletion was allowed.
We also investigated population structure using InStruct (Gao et al. 2007), which is a population assignment tool that uses a similar algorithm to STRUCTURE (Pritchard et al. 2000) but allows for partial inbreeding. For each of our 17 phased loci, we removed singleton polymorphisms and then assigned each sequence into haplotypes for population structure analysis. Analyses were conducted allowing for admixture, varying the number of clusters from 1 to 10, with 5 chains per value of K, a burn-in of 100,000, and a total of 200,000 iterations. Using Deviance Information Criteria, the optimal number of clusters was found to be K = 5, but cluster assignment deviated from K = 4 only by placing a single C. linearis individual from one population (THR) into its own cluster. Therefore, we report only the results from K = 2 to 4 clusters. In all cases, all five replicate chains gave identical clustering assignments of individuals. Population-level selfing rate estimates are reported from the K = 4 run with the highest posterior mean likelihood value.
We used coalescent simulations to infer the timing, demographic context, and extent of gene flow, using the MIMAR package (Becquet and Przeworski 2007), which allows for recombination within loci. MIMAR fits the observed number of shared, unique, and fixed polymorphisms to a standard “Isolation with Migration” model, where populations diverge at time T, experience an instantaneous time change at divergence, and may or may not experience gene flow subsequent to divergence. Details on the prior probabilities of parameters and assessment of model fit are given in the Supplementary Methods.
The NJ tree inferred with the 17 genes for the C. linearis and C. rattanii populations (Fig. 1) confirmed the existence of two strongly supported clades (bootstrap = 100%). One clade grouped all populations of C. rattanii with the CFR population of C. linearis and a second clade grouped the populations of C. linearis except CFR. Notably, the lengths of terminal branches in the C. rattanii clade are shorter than in the C. linearis clade, consistent with fewer segregating sites and reduced nucleotide diversity in the selfing C. rattanii.
Our inferences of population structure confirm the global clustering found in the NJ tree, and also reveal a strong geographic subdivision (Fig. 2A, B). In particular, at K = 2, there is a clear separation, where the Oregon population (CFR) of C. linearis clusters with C. rattanii. At K = 3, two California populations of C. rattanii (M3 and FH7) form a distinct cluster, whereas at K = 4 the SM population clusters separately. Even at K = 4 the CFR population, which is morphologically similar to the outcrossing C. linearis clade, still clusters at the population level with C. rattanii from AG500, and increasing values of K do not distinguish these populations (data not shown). Estimates of selfing rates from the InStruct analysis are high for the M3/FH7 cluster (mean 0.808, variance 0.016) and the SM cluster (mean 0.913, variance 0.002) of C. rattanii, and lower for the AG500/CFR (mean 0.4, variance 0.014) of C. rattanii/linearis and the main C. linearis cluster of BAIR/BURR/THR (mean 0.56, variance 0.025). These estimates conform reasonably well with direct estimates of selfing rates using microsatellite genotyping of maternal sibships (C. rattanii, 0.88; C. linearis, 0.43; Kalisz et al. 2012), with the combined AG500/CFR clade showing a high rate of outcrossing comparable to the main C. linearis clade.
The tree (Fig. 1) and the InStruct analyses (Fig. 2) strongly suggest that individuals from the CFR population are more closely related to those from C. rattanii than C. linearis. For analysis of species-wide diversity later, we therefore excluded the CFR population from summary statistics for C. linearis, and retain it as a separate taxon from both species.
A total of 10,158 bp of aligned coding sequences was produced from 17 loci, an average of 536 bp per locus. The number of segregating sites in three populations of C. linearis yielded 197 sites and an average nucleotide diversity (θ) of 0.0190 ± 0.0033 for synonymous sites and 0.0020 ± 0.0007 for nonsynonymous sites. In contrast, the four populations of C. rattanii yielded 73 segregating sites and average θ of 0.0069 ± 0.0010 for synonymous sites and 0.0008 ± 0.0010 for nonsynonymous sites. Average Tajima's D at synonymous sites in C. linearis overlapped zero (−0.06 ± 0.20), whereas for C. rattanii it was positive (0.79 ± 0.20; see Table S1) consistent with recent population bottlenecks and/or the effects of pooled samples in a clearly structured population (Wakeley and Lessard 2003; Stadler et al. 2009).
Within-population variation in C. linearis is generally high and comparable to species-wide diversity, with the exception of the divergent population CFR, which shows relatively limited synonymous diversity (Fig. 3), although larger sample sizes will be needed to confirm this. In C. rattanii, within-population variation exists in some populations, whereas others are nearly devoid of variation (Fig. 3). Consistent with the InStruct analysis showing more subdivision in C. rattanii, average between-population differentiation (Fst) is lower, at 0.14 ± 0.02 for C. linearis, compared with 0.57 ± 0.10 in C. rattanii. Both species show reduced within-individual diversity compared to among individuals within populations (Fig. 3B), consistent with the effects of partial selfing. As expected, this effect is considerably larger in C. rattani where population-level diversity is 2.2 times greater than the average within individuals, compared with C. linearis, where the difference is 1.4-fold. In total, five individuals showed evidence of heterozygosity in C. rattanii consistent with a history of some outcrossing (both AG500 individuals, two from FH7 and one from SM), with AG500 individuals in particular showing extensive multilocus heterozygosity.
Divergence between the species is low but considerably larger than within species polymorphism; at the 17 Sanger sequenced loci, average pairwise differences between C. rattanii and C. linearis at synonymous sites πlin-rat = 0.033, compared with πlin = 0.018 between C. linearis populations and πrat = 0.011 between C. rattanii populations (Fig. 3B). The estimate of between-species differences is comparable to the median Ks from 12,453 orthologous genes from the transcriptome dataset (median = 0.0359).
Furthermore, there are very few shared polymorphisms between species (Table 2), and twice as many fixed differences relative to shared polymorphisms, consistent with intermediate divergence and long-term low effective population size in C. rattanii. Consistent with the NJ tree and InStruct analyses, CFR has no fixed differences with C. rattanii, but shows fixed differences with C. linearis populations (Table 2). To investigate in detail these patterns within single populations, we also examined these values using the pooled transcriptome data. In total we identified approximately 111,000 synonymous SNPs in the pooled transcriptome dataset. Although varying expression levels across individuals may complicate these estimates, as expected we see a much larger proportion of fixed differences and a much lower proportion of shared SNPs and polymorphisms unique to C. rattanii within the single M3 population (Table 2).
Table 2. Pairwise comparisons of synonymous polymorphisms (unique, shared, and fixed differences) between Collinsia linearis and Collinsia rattanii. Note that the CFR population is treated as a separate taxon (see Figs. 1, 2). The THR/M3 contrast is based on the transcriptome resequencing of pooled population samples
Unique Taxon 1|Taxon 2
C. linearis|C. rattanii
C. linearis (THR)|C. rattanii (M3)
COALESCENT MODEL FITTING
The within- and between-species population structure shown in Figures 1 and 2 raise several possible scenarios for the evolution of selfing. In particular, it is possible that selfing evolved early, and subsequent gene flow has led to the reacquisition of an outcrossing floral morphology in the large-flowered CFR population. Alternatively, selfing evolved recently from the Northern CFR clade, or there has been independent evolution of outcrossing in this region. To investigate divergence time, test the assumption of no gene flow, and quantify population size changes, we investigated a series of coalescent models using the Bayesian estimation approach MIMAR (Becquet and Przeworski 2007). In addition, to assess the role of gene flow in causing the large-flowered CFR population to cluster with C. rattanii, we contrasted models where CFR was included as part of the C. rattanii clade with models where it was excluded (Table 3). We hypothesized that if CFR was the result of interspecific gene flow we would infer greater evidence of migration and higher migration estimates when these samples were included within the C. rattanii clade.
Table 3. Results of MIMAR analysis of models of isolation with migration, where population 2 either excludes (Collinsia rattanii) or includes (C. rattanii/CFR) the large-flowered CFR population within the C. rattanii clade. Parameter values are the modes of parameter estimates, with 90% highest posterior density intervals given in parentheses. Parameters include: N1, effective size of population 1; N2, effective size of population 2; NA, effective size of ancestral population; M, migration rates from population i to population j; T, divergence time in years
1Summary statistics showing significant rejection of the model based on goodness-of-fit tests. All additional summaries (seven in total) did not show significant departures from the model.
2Estimated difference in AIC from the minimum (most supported) model, where a value of 0 shows the most support, and higher values show increasingly less support.
No migration, Na
Under the full model, migration estimates were low, and the lower bounds of the 90% highest posterior density approached zero (Table 3), both including and excluding CFR. Furthermore, models with no gene flow showed considerably lower estimated AIC values, suggesting a better model fit without migration. All models inferred a strong reduction in effective population size (two- to fourfold) in C. rattanii compared with the ancestor and with C. linearis, regardless of whether CFR was included. Models that allowed the ancestral population to differ from C. linearis also suggested a twofold reduction in effective population size in C. linearis compared with the ancestral population. The combination of estimates of AIC and goodness-of-fit tests provided moderate support for this model compared with a model where C. linearis is constrained to have the same population size as the ancestral population, although for the dataset with CFR there was slightly more support for the constrained model using AIC estimates (Table 3). Estimates of divergence time under the no migration models ranged from approximately 500,000 to 852,000 years.
Goodness-of-fit tests generally showed correspondence with model estimation (Table 3). However, mean Tajima's D in C. rattanii, tended to be too high under most models, with one-tailed P approximately 6% even in models that could not be rejected. This suggested recent population bottlenecks and/or population structure in the selfing lineage that were not specified in the model.
RECOMBINATION AND THE DECAY OF LD
For the 17 loci, the composite likelihood estimates of ρ and the lower bound on the number of recombination events (Rmin) are given in Table S1. On average, recombination rates are higher in C. linearis compared to C. rattanii; Rmin varies between 0 and 8 in C. linearis compared to 0 and 1 in C. rattanii, and the average ρ estimates are higher in C. linearis (8.0 × 10−3/bp) compared to C. rattanii (0.6 × 10−3/bp). The ratio of ρ/θ is reduced in C. rattanii (0.1) compared to C. linearis (0.9), consistent with expectations from the mating-system transition from outcrossing to selfing. In the outcrossing C. linearis, LD decayed rapidly over less than 1 kbp, whereas patterns of LD in the more selfing C. rattanii decayed more slowly and suggest LD could extend to several kbp (Fig. 4).
EFFICACY OF SELECTION
The mean Tajima's D across the 17 genes in C. linearis is more negative at nonsynonymous (−0.56 ± 0.10) than synonymous sites (−0.06 ± 0.20), consistent with the action of purifying selection (Table S1). This is also apparent in C. rattanii, where average Tajima's D is less positive at nonsynonymous (0.31 ± 0.20) than synonymous sites (0.79 ± 0.20). For the 17 genes, the ratio of nonsynonymous to synonymous polymorphism at unique sites is higher in C. rattanii (0.51) than C. linearis (0.35), although this difference is not significant (2 × 2 contingency table, P > 0.05).
To evaluate this trend further, we examined SNPs in the transcriptome data of pooled within-population samples. From the Illumina data the ratio of nonsynonymous to synonymous unique polymorphisms is significantly elevated within the C. rattanii population (0.79) compared to our population of C. linearis (0.37; 2 × 2 contingency table, P < 0.0001; Fig. 5), consistent with ongoing relaxation of selection in the more selfing species. Shared polymorphisms and fixed differences show an intermediate ratio of nonsynonymous to synonymous polymorphism.
Although this signal may reflect an increased role of drift in selfing species, it is also possible that it reflects a shift in distribution of selection coefficients. One possibility is that there is relaxed selection on genes involved in pollinator attraction, and the observed signal reflects relaxed selection on floral attraction genes. To assess this, we identified differentially expressed genes using our transcriptome data, reasoning that downregulated genes in C. rattanii may be enriched for those subject to relaxed selection, whereas upregulated genes should not show such an effect under this scenario.
From our set of orthologous genes, we identified 408 genes that are upregulated in C. rattanii, whereas 286 were identified as downregulated relative to C. linearis (Table S2). To assess the extent to which these genes might reflect the evolution of the selfing syndrome, we performed functional enrichment tests of differentially expressed genes. Among the set of transcripts up-regulated in C. rattanii genes associated with catalytic activity (monooxygenase activity, GO:0004497) were significantly over represented (P = 2.14 × 10−5; Tables 4, S2). The downregulated transcripts in the selfing C. rattanii had a disproportionate number of genes associated with pollen development (11 transcripts, P = 1.15 × 10−4; Table 4), specifically pollen exine formation (9 transcripts, P = 7.47 × 10−11; Table 4). It is possible that selection on pollen traits associated with the transition to selfing has resulted in changes to the development of the outer wall of the pollen in C. rattanii. The pollen exine is a highly specialized structure that functions in a diversity of pollen traits such as transport, specificity, longevity and self-incompatibility (reviewed in Edlund et al. 2004). Selection maintaining these traits may have been relaxed since the transition to a largely self-fertilizing mating system. Interestingly, there was also a disproportionate number of transcripts associated with flavonoid 3′,5′-hydroxylase activity (GO:0033772, 3 transcripts, P = 2.72 × 10−5; Table S2) in the genes differentially expressed by more than twofold. Flavonoids are common secondary plant metabolites which function in protection against pathogens, UV-damage and are involved in attracting pollinators. Interestingly, although the species do not differ in flower color, flavonoid 3′,5′-hydroxylase is necessary for the synthesis of purple or blue colors in flowers and evolution of this enzyme has thought to be associated with pollination (Seitz et al. 2006). Therefore, these transcripts represent a set of candidate genes that may be associated with a loss of investment into pollinator attraction with shifting mating systems.
Table 4. Functional enrichment tests for genes showing evidence of significant expression differences between Collinsia rattanii and Collinsia linearis. Upregulated genes are those showing higher expression in C. rattanii compared with C. linearis, whereas downregulated genes are those showing reduced expression in C. rattanii. Categories in bold are those that also show significant enrichment for genes showing greater than two-fold expression differences between the samples. All listed categories show a false discovery rate (FDR) less than 0.05. For details, and for categories showing significant enrichment only in the greater than twofold expression difference tests, see Table S2
GO Function ID
Pollen exine formation
Cellular component assembly involved in morphogenesis
Pollen wall assembly
Anatomical structure formation involved in morphogenesis
In contrast with the prediction of relaxed selection on genes important for pollinator attraction, the signal of elevated nonsynonymous polymorphism is apparent in both up- and downregulated genes, and is comparable to the global patterns (Fig. 5). Thus, although we cannot rule out a global shift in selection coefficients, these results are most consistent with the reduced efficacy of selection because of reductions in effective population size.
CODON USAGE BIAS AND BASE COMPOSITION
The set of optimal codons in M. guttatus is shown in Table S3. According to this set, the frequency of optimal codons (Fop) in Collinsia significantly increased with gene expression (r = 0.18, P < 0.0001). Moreover, such an increase was not linear but exponential: optimal codons were overrepresented in the highest-expressed genes (Fig. S1, Table 5). However, gene expression is slightly but significantly correlated with GC content (r = 0.09, P < 0.0001) and GC3 (r = −0.03, P < 0.0001). Note that GC3 in particular shows a very weak correlation with gene expression, consistent with the fact that many of the inferred preferred codons are AT-ending (Table S3). This means that tests for changes in BGC between species should be reasonably independent of tests for changes in selection on codon usage.
Table 5. Synonymous differences in base composition and codon usage between Collinsia linearis (Cl) and Collinsia rattanii (Cr). Values represent total number of sites. P-values obtained from signed-rank tests. P = preferred codons; U = unpreferred codons
1Cases where C. linearis has a GC or preferred base and C. rattanii has an AT or unpreferred base.
2Cases where C. rattanii has a GC or preferred base and C. linearis has an AT or unpreferred base.
GC bases at divergent synonymous sites
2.20 × 10−16
P vs. U (all codons)
P vs. U (AT-ending codons)
2.67 × 10−14
P vs. U (GC-ending codons)
3.72 × 10−10
GC bases at divergent synonymous sites
P vs. U (all codons)
P vs. U (AT-ending codons)
P vs. U (GC-ending codons)
GC bases at divergent synonymous sites
3.42 × 10−6
P vs. U (all codons)
P vs. U (AT-ending codons)
P vs. U (GC-ending codons)
To examine evidence for shifts in base composition and codon usage, we compared base composition at variable sites among orthologous codons and codon usage in codons where we identified a between-species difference at synonymous sites, excluding sites that are polymorphic within species. We find significantly higher GC content among variable sites in C. linearis than C. rattanii, at synonymous sites (Table 5), consistent with C. rattanii experiencing a shift towards greater AT richness. This was observed in the whole dataset and in genes with low and high expression (Table 5), suggesting it is unrelated to changes in selection in codon bias. At orthologous codons, we see an overall enrichment of cases where C. rattanii has a preferred codon whereas C. linearis has an unpreferred codon (Table 5). However, this is not apparent in highly expressed genes, and overall it is clear that GC-ending preferred codons show an enrichment in C. linearis, whereas AT-ending preferred codons are enriched in C. rattanii (Table 5). Overall, the patterns are consistent with a shift in base composition that appears unrelated to selection on codon usage.
The CFR population was identified as C. linearis based on its flower morphology. However, the tree presented in Figure 1 suggests that CFR clusters more closely with C. rattanii than C. linearis, consistent with results from Baldwin et al. (2011) showing that C. linearis populations north of the Klamath River are closely related to C. rattanii populations. Furthermore, even at K = 4 with InStruct analysis, this population is still identified as belonging to the same population as the AG500 population of C. rattanii, with no evidence that it represents an admixed population between species (Fig. 2). Furthermore, MIMAR analyses provide no evidence for historical gene flow between California C. linearis and C. rattanii, regardless of whether the CFR population is included in the C. rattanii clade (Table 3). Taken together, we suggest that the CFR population should be appropriately placed within C. rattanii.
We envision three possibilities that could explain the geographic distributions and the genetic patterns that we found. The first requires that both C. linearis and C. rattanii expanded their range north from California, and in this northern region the species experienced significant mixing, such that in the north, the larger flowered “C. linearis” have become genetically indistinguishable from C. rattanii. Our InStruct and MIMAR results do not provide support for this model, as we have no evidence for gene flow from the California C. linearis into either C. rattanii or CFR. The second possibility is that C. linearis expanded northward and diverged from the southern clade and that C. rattanii has recently evolved selfing from the northern clade. In this scenario, C. rattanii is polymorphic for mating system, and the spread of a selfing variant is ongoing, including dispersal and subsequent subdivision into the southern edge of the range. The third possibility is that C. rattanii diverged from the southern C. linearis and spread north from California into S. Oregon, where selection for larger flower size or increased outcrossing resulted in the formation or maintenance of larger-flowered outcrossing variants of C. rattanii, that are becoming reproductively isolated from the ancestral selfing C. rattanii. This isolation may be far from complete, because hand pollinations between large- and small flowered populations from Oregon produced viable F1 and F2 progeny (Randle and Kalisz, unpubl. data), and there appear to be few prezygotic barriers to hybridization (geographic, elevational, flowering time, or pollinator preference) other than mating system, suggesting that gene flow may be common when large- and small-flowered populations in Oregon are sympatric (Randle et al. 2009).
Of these, the two latter possibilities seem equally likely based on our analysis and earlier analysis by Baldwin et al. (2011). The evolution of large-flowered, outcrossing morphs from small-flowered selfing morphs has also been proposed for other sister species pairs within Collinsia: the C. parviflora and C. grandiflora clade and the C. sparsiflora clade (Baldwin et al. 2011). Although in general reversals have rarely been documented, they are potentially more likely in self-compatible species like Collinsia. Given these possibilities, it is clear that fine scale sampling including areas in Oregon and in the transition zone between the northern and southern groupings (Klamath River area) combined with genome-wide data are required to confirm the relationship of the Oregon C. rattanii clade to the California C. rattanii populations. This sampling would also allow us to fully assess if introgression between large- and small-flowered populations occurs, as we suspect it could within Oregon populations. Such detailed information would allow us to distinguish between scenarios. InStruct analyses and coalescent modeling from in-depth sampling of sympatric and allopatric populations of these two species (Randle et al. 2009) could further inform the likelihood of introgression. Larger-scale genome wide genotyping of these samples will allow for the application of powerful new approaches for inferring population splitting and migration (Pickrell and Pritchard, 2012), and mapping morphological and mating-system shifts onto population trees may reveal if the majority of shifts are transitions between selfing and outcrossing, or if selfing populations exhibit low population viability.
Population genetic theory predicts a twofold reduction in levels of neutral variation between outcrossing and complete self-fertilization. Accordingly, the decrease of neutral diversity is predicted to be less pronounced in partially self-fertilizing species (Pollak 1987; Nordborg and Donnelly 1997; Charlesworth 2003), such as those studied here. If estimated outcrossing rates are close to long-term equilibrium, we can estimate the reduction in diversity expected using the expected ‘effective' population mutation rate under selfing θs as, θs = θ/(1 + F), where θ is the population mutation rate under complete outcrossing, and F is the inbreeding coefficient (Nordborg 2000). Using this, the expected proportion of diversity in C. rattanii compared with C. linearis under an all-else equal assumption is 71%. Yet, in our study, we see a considerably stronger decrease in diversity in C. rattanii compared to C. linearis, with a ratio of θsyn values of 0.36, suggesting a greater than twofold reduction in diversity above expectation. Within populations, there is a further reduction to 0.18. Thus, in agreement with previous results, our study suggests that the effect of mating system on homozygosity alone cannot explain the loss of diversity observed in C. rattanii and that other factors have contributed. This result highlights that even in partially selfing species there are important demographic and/or hitchhiking effects associated with the shift to selfing influencing patterns of polymorphism. Although the sample sizes are small, the extreme loss of within-population diversity across the 17 loci in two populations of C. rattanii (SM, FH7) suggests severe founder events in these populations and/or strong hitchhiking effects.
The shift in mating system and reduction in diversity are expected to result in increased between-population differentiation, consistent with our results (Fst-C. rattanii = 0.80; Fst-C. linearis = 0.14), although this may be partly confounded by a wider geographic range in C. rattanii combined with clear geographic clustering (Fig. 2). A positive average Tajima's D, as observed in our data, is consistent with recent population bottlenecks playing an important role in reducing effective population size in C. rattanii, and our goodness-of-fit tests of the MIMAR results suggest the reduction in effective population size at speciation is not sufficient to explain the data (Table 3). However, other factors such as the pooled sampling scheme in strongly subdivided populations likely also contribute to the positive Tajima's D (Wakeley and Lessard 2003; Städler et al. 2009). Thus, it is difficult to reject the possibility that C. rattanii is at a stable population equilibrium with strong population structure and low effective population size.
Our whole transcriptome and Sanger data both show an excess of nonsynonymous relative to synonymous unique polymorphism in C. rattanii compared to C. linearis. This may be because of a reduced efficacy of selection in eliminating slightly deleterious mutations in the more selfing species. Thus, these results are consistent with the possibility that selfers experience increased extinction risk because of deleterious mutation accumulation (Lynch et al. 1995), although it is important to keep in mind that accumulation of deleterious mutations does not necessarily equate to higher extinction risk. In particular, if deleterious mutations are primarily affecting relative fitness (soft selection) rather than absolute fitness (hard selection), there will not necessarily be a population size reduction toward extinction with more deleterious mutations (Agrawal and Whitlock 2012). The observation of a significant excess of nonsynonymous polymorphisms that are unique to C. rattanii highlights that relaxed selection is an ongoing process, and is not simply associated with a founder event during speciation. Our results contrast with other studies focused on divergence data (Wright et al. 2002; Haudry et al. 2008; Escobar et al. 2010), which have generally shown little evidence for relaxed selection in selfers. This difference likely results in part from the fact that by focusing on lineage-specific polymorphism patterns in a recently derived species pair, we are able to capture and quantify contemporary selection pressures. With comparisons based solely on substitution rates, selfing may often have arisen recently relative to species divergence, making it difficult to quantify and contrast selection pressures since the mating-system shift. Our study and other recent studies focused on polymorphism patterns (Slotte et al. 2010; Qiu et al. 2011) highlight that analysis of contemporary selection with polymorphism data can provide greater power to detect changes in the strength of selection.
An alternative explanation for the higher proportion of nonsynonymous polymorphism in C. rattanii could be that this reflects a shift in the strength of selection, potentially because of global relaxation of selection on pollinator attraction. Our gene expression analysis indicates a general reduction in expression of genes important in pollen production, flavonoid production and morphogenesis (Tables 3, S2), potentially reflecting the reduced investment in pollinator attraction. These genes show no evidence for greater relaxed selection than upregulated genes (Fig. 5), providing no support for the hypothesis that these patterns reflect relaxed selection on pollinator attraction. However, the changes in expression could reflect adaptive expression changes rather than simply relaxed constraint, making these candidate genes for playing a role in adaptive floral evolution. It is also possible that some of the expression differences we see are because of differential sampling of tissues in flower buds because of between-species differences in tissue allocation, rather than tissue-specific shifts in expression. Thus, although our results are consistent with the hypothesis of a reduced efficacy of selection, it remains possible that the patterns could also reflect a relaxation of selection because of population expansion or other factors associated with mating-system shifts. Larger-scale whole genome data fit to demographic and selection models will be important for teasing apart the role of shifts in the efficacy vs. the strength of selection across the genome.
Our analyses of codon usage show a positive and significant correlation in the frequency of optimal codons (Fop) with gene expression in both C. linearis and C. rattanii. Preferred codons are likely to be preferentially used in highly expressed genes to increase translational efficiency and/or accuracy (Kanaya et al. 2001). Furthermore, the correlation between gene expression and Fop is greater than that observed between gene expression and GC or GC3, suggesting the action of natural selection on codon usage. However, we did not find evidence for a consistent reduction in the number of preferred orthologous codons in C. rattanii (Table 5). It is possible that larger-scale codon-specific quantification of selection on codon usage could reveal stronger evidence for changes in selection, as shown by (Qiu et al. 2011) in analyses of polymorphism in Arabidopsis and Capsella. Alternatively, such analysis could suggest that selection on codon usage is too weak in both species for a detectable effect of mating system. However, we do see a consistent pattern suggesting a shift in base composition between the species, indicating more AT-ending synonymous codons in C. rattanii (Table 5). This pattern is consistent with the prediction of relaxed biased gene conversion in the more highly selfing species, leading to reduced GC content in the species with greater levels of homozygosity.
Together, our data provide evidence for rapid shifts in mating system that result in important changes in the efficacy of selection, in gene expression and base composition once selfing evolves. Importantly, our work also suggests a potential for reversibility or rapid evolution in the evolution of selfing and outcrossing, which our future work will explore in greater depth.
There is supporting information accompanying this article, and sequences have been submitted to Genbank SRA (Accession numbers: SRP017038 and SRP017039) and Genbank NR (Accessions numbers: KC420688-KC421071).
S. I. Wright thanks the Natural Sciences and Engineering Research Council of Canada for funding. J. S. Escobar was supported by a University of Toronto Department of Ecology and Evolutionary Biology Postdoctoral fellowship. S. Kalisz and A. M. Randle thank the National Science Foundation for awards DEB 0324764 and DEB 0709638 that supported this research; C. Kohn, N. Brouwer, and especially E. York for growing plants and greenhouse assistance; and A. Simoes Correa for help with maps and figures. The authors thank the Genome Quebec Innovation Centre at McGill University for sequencing facilities. The authors thank S. C. H. Barrett, K. Bomblies and three anonymous reviewers for their comments and suggestions.