Speciation is the process by which reproductively isolated lineages arise, and is one of the fundamental means by which the diversity of life increases. Whereas numerous studies have documented an association between ecological divergence and reproductive isolation, relatively little is known about the role of natural selection in genome divergence during the process of speciation. Here, we use genome-wide DNA sequences and Bayesian models to test the hypothesis that loci under divergent selection between two butterfly species (Lycaeides idas and L. melissa) also affect fitness in an admixed population. Locus-specific measures of genetic differentiation between L. idas and L. melissa and genomic introgression in hybrids varied across the genome. The most differentiated genetic regions were characterized by elevated L. idas ancestry in the admixed population, which occurs in L. idas-like habitat, consistent with the hypothesis that local adaptation contributes to speciation. Moreover, locus-specific measures of genetic differentiation (a metric of divergent selection) were positively associated with extreme genomic introgression (a metric of hybrid fitness). Interestingly, concordance of differentiation and introgression was only partial. We discuss multiple, complementary explanations for this partial concordance.
Speciation is a fundamental evolutionary process that occurs when genetic differentiation leads to reproductive isolation (e.g., a reduced propensity to mate or reduced fitness of immigrants or hybrid offspring) between divergent lineages. As evidenced by empirical data, species boundaries are porous rather than absolute, particularly during the early stages of divergence, and reproductive isolation is a property of genetic loci and not a property of populations or lineages (Harrison and Rand 1989; Wu 2001). Studies of population divergence and speciation with gene flow have consistently documented heterogeneity across the genome in the extent of genetic differentiation between lineages and introgression in hybrid zones (Dowling and Hoeh 1991; Rieseberg et al. 1999; Turner et al. 2005; Hohenlohe et al. 2010; Lawniczak et al. 2010). This genome-wide heterogeneity in genetic differentiation and introgression can arise from variation among loci in their contribution to fitness in parental and hybrid populations (Payseur et al. 2004; Harr 2006; Nosil et al. 2009a). Individual genes responsible for local adaptation and reproductive isolation have been identified (e.g., Mihola et al. 2009; Tang and Presgraves 2009; Barr and Fishman 2010; Nosil and Schluter 2011). Nonetheless, and despite considerable efforts and progress, we are only beginning to understand the process of genomic divergence during speciation and the importance of divergent selection and local adaptation for the evolution of reproductive isolation.
Theory and data demonstrate that reproductive isolation is entwined with divergent selection and local adaptation (Mayr 1963; Endler 1977; Dodd 1989; Nosil et al. 2002; Rundle and Nosil 2005), as originally envisioned by (Darwin 1859). Considering speciation with gene flow, divergent selection is a direct cause of reproductive isolation if immigrants or their hybrid offspring have low fitness. Similarly, divergent selection can cause adaptive divergence in habitat use, phenology, or mating signals and reduce the frequency of interspecific matings (Jiggins et al. 2001; Nosil et al. 2002; Nosil 2004). When species diverge in geographic isolation and in the absence of gene flow, divergent selection and reproductive isolation upon secondary contact have the potential to be partially decoupled. For example, reproductive isolation upon secondary contact in geographically abutting or admixed populations could be due to mechanisms other than historical divergent selection, if the environment or genetic composition of populations differed between periods of allopatric divergence and secondary contact. Selection pressures are certainly known to vary in time (Grant and Grant 2002; Siepielski et al. 2009), and studies contrasting introgression among multiple hybrid zones provide evidence that the effect of a locus on hybrid fitness can be environment-dependent (Nolte et al. 2009; Teeter et al. 2010).
Similarly, genetic differences that cause reproductive isolation upon secondary contact can evolve through genetic hitchhiking or as byproduct of nonadaptive forces, such as genetic drift or gene conversion (Gavrilets et al. 1998; Fierst and Hansen 2010). For example, although Dobzhansky–Muller (DM) incompatibilities evolve most rapidly if they are a pleiotropic byproduct of loci experiencing divergent selection, DM incompatibilities can also readily evolve through tight linkage with selected loci or in the absence of selection (Turelli et al. 2001; Gavrilets 2003). For example, the proliferation of transposable elements can cause hybrid inviability and can readily occur within geographically isolated nonadmixed populations unfacilitated by selection (Hurst and Werren 2001; Lynch 2007). Likewise, reciprocal loss or deactivation of duplicated genes can occur without selection and cause inviability in hybrid offspring (Lynch and Force 2000; Mizuta et al. 2010).
Even if the same suite of loci are affected by selection in geographically isolated nonadmixed populations and admixed populations, the genomic consequences of selection could differ. Specifically, because admixture generates linkage disequilibrium, the size of the genetic region affected by selection in an admixed population could be much larger than the size of the genetic region affected by divergent selection between geographically isolated nonadmixed populations. In other words, differences in linkage disequilibrium could alter the genomic response to natural selection of the same loci in admixed and nonadmixed populations. This effect would be intensified if divergent selection causes soft selective sweeps rather than hard selective sweeps, as the genetic footprint of selection would be further reduced (Hermisson and Pennings 2005).
Consequently, whether loci under divergent selection between geographically isolated nonadmixed populations also experience selection or contribute to reproductive isolation in hybrid zones or admixed populations is an open question that must be addressed empirically. This is an interesting question because selection in hybrids is a direct measure of a major component of reproductive isolation and comparing selection in geographically isolated nonadmixed and admixed populations will increase our understanding of (1) divergent selection’s importance for the evolution of reproductive isolation, (2) geographic variation in the genetic basis of reproductive isolation, and (3) the effect of admixture linkage disequilibrium on the genomic consequences of selection. Indirect, population genetic metrics associated with selection in geographically isolated nonadmixed and admixed populations can be used to address these issues.
Genetic differentiation and introgression are affected by selection and various stochastic processes (Buerkle et al. 2011; Gompert et al. 2012). Thus, there is not a simple correspondence between selection and measures of genetic differentiation (i.e., measures of allele frequency differences) or introgression. Nonetheless, because divergent selection reduces genetic diversity and increases genetic differentiation for the selected locus and linked loci (i.e., genetic hitchhiking; Maynard-Smith and Haigh 1974; Gillespie 2000), divergently selected loci are more likely to reside in highly differentiated regions of the genome (Beaumont and Nichols 1996; Beaumont and Balding 2004). We define introgression as the movement of alleles from one gene pool into another through hybridization and repeated backcrossing. Introgression can be measured by quantifying the movement of alleles across the geographic range of hybridizing lineages (hereafter, geographic introgression; e.g., Barton and Hewitt 1985; Szymura and Barton 1986; Porter et al. 1997) or between different genomic backgrounds (hereafter, genomic introgression; e.g., Anderson 1949; Rieseberg et al. 1999; Lexer et al. 2007; Gompert and Buerkle 2009). These two forms of introgression can, but need not, be coincident (Teeter et al. 2010). First, considering geographic introgression, relative to genetic regions that do not affect fitness, the geographic range of introgression should be reduced for genetic regions that harbor variants that decrease fitness in hybrids (i.e., steeper geographic clines with a rapid, exponential decay of foreign allele frequencies away from the hybrid zone center; Barton 1983; Barton and Hewitt 1985; Szymura and Barton 1986, 1991). Similarly, considering genomic introgression within an admixed population, genetic regions associated with reduced hybrid fitness should have a low frequency of the negatively selected alleles from one parental species or have alleles from each parental species confined primarily to alternative genomic backgrounds (Rieseberg et al. 1999; Gompert et al. 2012). Thus, loci that contribute to isolation through fitness variation in hybrids are more likely to reside in regions of the genome with specific, extreme patterns of geographic or genomic introgression.
In this article, we analyze genome-wide DNA sequence data with novel Bayesian models to quantify and compare genetic differentiation between Lycaeides idas and L. melissa (Lepidoptera: Lycaenidae) and genomic introgression in a hybrid zone. Lycaeides idas and L. melissa are among five nominal Lycaeides species that occur in North America and likely diverged from a common Eurasian ancestor within the last 2.4 million years (Gompert et al. 2006a, 2008a; Vila et al. 2011; Forister et al. 2011). Lycaeides idas and L. melissa differ in male genitalic morphology, wing pattern, host plant use, phenology, and behavior (Scott 1986; Fordyce et al. 2002; Lucas et al. 2008; Gompert et al. 2010b). Diversification of North American Lycaeides likely occurred during periods of Pleistocene glacial advance. However, many of these lineages have hybridized in regions of secondary contact (Gompert et al. 2006a,b, 2010a,b). In the current study, we focus on a hybrid zone between L. idas and L. melissa in northwestern Wyoming (USA), specifically in Jackson Hole and the Gros Ventre mountains (Nabokov 1949; Gompert et al. 2010b). Secondary contact in this geographic region occurred recently, as much of this area was glaciated until 14,000 years before present (Harris et al. 1997). Henceforth, we refer to the admixed population in Wyoming as Jackson Hole Lycaeides. The habitat occupied by Jackson Hole Lycaeides is similar to the habitat occupied by L. idas (relatively mesic forest and montane environments), and differs from the habitat occupied by L. melissa (xeric sites, often associated with agricultural fields; Gompert et al. 2010b). Lycaeides populations have a patchy, discontinuous distribution that is tied to the distribution of their host plants. Jackson Hole Lycaeides feed on Astragalus miser, which is the same host plant used by many L. idas populations in the central Rocky Mountains (Scott 1986; Gompert et al. 2010b). Conversely, nearby L. melissa populations feed either on feral Medicago sativa or A. bisulcatus. The Jackson Hole Lycaeides populations are not directly connected with parental populations and there is only very limited gene flow between Jackson Hole and parental populations (Gompert et al. 2010b).
Herein, we use DNA sequence data from thousands of loci and Bayesian models to document considerable variation across the genome in the magnitude of genetic differentiation between L. idas and L. melissa and genomic introgression in Jackson Hole Lycaeides. We then test the hypothesis that divergent selection between geographically disjunct L. idas and L. melissa populations predicts selection and the consequences of selection in admixed Jackson Hole Lycaeides. We evaluate two specific predictions that would support this hypothesis: (1) because L. idas and Jackson Hole Lycaeides occupy similar habitats and use the same host plant, L. idas alleles at highly differentiated loci that could be associated with adaptation to habitat or host plant will be found at disproportionately high frequency in Jackson Hole Lycaeides, and (2) locus-specific measures of genetic differentiation (a metric of divergent selection) will be positively correlated with extreme genomic introgression (a metric of selection in hybrids). The first prediction is clearly supported by our results, whereas support for the second is mixed and less conclusive. For example, we document concordant patterns of differentiation and genomic introgression, particularly for highly divergent regions of the genome. However, we also document weakly differentiated loci with exceptional patterns of introgression and highly differentiated loci with patterns of introgression not different from the genome average. We discuss several possible causes for this discordance.
SEQUENCE DATA COLLECTION AND ASSEMBLY
We generated DNA sequence data from 116 L. idas (five localities), 76 L. melissa (three localities), and 186 Jackson Hole Lycaeides (five localities; Fig. 1,Table 1). We collected butterflies within U.S. National Parks (NP) in accordance with NP Service permits: Yellowstone NP (YELL-2008-SCI-5682) and Grand Teton NP (GRTE-2008-SCI-0024). We isolated and purified DNA from each of the sampled butterflies from approximately 10 mg of thoracic tissue using Qiagen’s DNeasy 96 Blood and Tissue Kit (Cat. No. 69581; Qiagen Inc., Valencia, CA). We generated reduced genomic complexity libraries for each individual using a restriction fragment-based procedure (van Orsouw et al. 2007; Gompert et al. 2010a; Andolfatto et al. 2011). We labeled fragments from each butterfly with 10 base pair (bp) individual identification sequences (i.e., barcodes). DNA sequencing of the reduced genomic complexity libraries was performed by the National Center for Genome Research (Santa Fe, NM) using the Illumina GAII platform. We used SeqMan NGen 3.0.4 (DNASTAR) to perform a de novo assembly for a subset of the sequences (12 million) and generate a reference sequence. We then assembled the full sequence dataset (110 million sequences) to the reference using SeqMan xng 18.104.22.168 (DNASTAR). We used custom Perl scripts in conjunction with samtools and bcftools (Li et al. 2009) to identify variant sites in the assembled sequence data and to determine the number of reads supporting each alternative nucleotide state for each individual and locus. We identified 119,677 variable sites using stringent criteria. The mean number of sequences per variable site per individual was 2.21. Because sequence coverage was low, we did not attempt to call genotypes, but rather incorporated genotype uncertainty in all analyses (see below). A detailed description of sequence data collection, assembly, and variant calling is included in the Supporting information.
Table 1. Population sample information (JH = Jackson Hole Lycaeides; N= sample size).
King’s Hill, MT
Garnet Peak, MT
Bunsen Peak, WY
Trout Lake, WY
Hayden Valley, WY
Mt. Randolf, WY
Upper Slide Lake, WY
Teton Science School, WY
Blacktail Butte, WY
Bull Creek, WY
GENETIC VARIATION, LINKAGE DISEQUILIBRIUM, AND POPULATION STRUCTURE
We used a Bayesian model to estimate allele frequencies for each of 119,677 variable nucleotides based on the observed sequence data (see Population allele frequency model in the Supporting information). The model incorporates uncertainty in genotypic state arising from low-coverage next-generation sequence data, but is otherwise equivalent to standard methods used to estimate allele frequencies (e.g., Gillespie 2004). In other words, we treated the genotype at each locus and the population allele frequency as unknown model parameters, which we estimated from the DNA sequence data. We estimated posterior probability distributions for genotypic state and allele frequencies separately for each of the 13 sampled localities (populations), L. idas, L. melissa, Jackson Hole Lycaeides, and all sampled Lycaeides. We obtained posterior parameter estimates for the model parameters (allele frequency and genotypic state) using Markov chain Monte Carlo (MCMC). Each analysis consisted of a single chain iterated for 20,000 steps (we recorded samples every fourth step).
We estimated Burrow’s composite measure of Hardy–Weinberg and linkage disequilibrium (Δ) between pairs of variable sites (Weir 1979). This metric does not require phased data or assume Hardy–Weinberg equilibrium, but instead provides a composite measure of intralocus and interlocus disequilibria estimated directly from the genotype frequencies (Weir 1979). We used a Monte Carlo procedure to estimate Δ while accounting for incomplete knowledge of genotypes. Specifically, for each population and taxon, we iteratively sampled genotypic states for each individual from the posterior probability distribution of genotypic states (as estimated above) and calculated for each locus pair (i.e., locus i and locus i′). We iterated this procedure 100 times for each locus pair and used the mean value of as an estimate of . We calculated Δ for the subset of 17,693 variable sites with minor allele frequencies greater than 0.1. We discarded low-frequency variants for this analysis as they are less informative regarding deviations from linkage equilibrium. Similarly, we did not use these loci for the analyses of genetic differentiation and genomic introgression described below, as low-frequency variants provide little information about ancestry and genomic introgression. We summarized the distribution of the absolute value of (i.e., ) separately for all pairs of variable sites (∼ 1.57 × 108 pairs) and pairs of variable sites associated with the same DNA fragment (2986 pairs or ∼ 0.002% of all locus pairs). Finally, for each admixed population, we determined the proportion of locus pairs with (we defined coupling and repulsion alleles so admixture generates positive values of Δ). An excess of locus pairs with would constitute evidence of admixture linkage disequilibrium in Jackson Hole Lycaeides. We used C to implement this Monte Carlo estimation procedure.
We used principal component analysis (PCA) to summarize population genetic structure. Specifically, we used the estimated genotypic state probabilities for two of three genotypes (the heterozygous genotype and one homozygous genotype) at each locus as variables for PCA (119,677 × 2 = 239,354 variables total). We performed the PCA in R using the prcomp function after centering, but not scaling, the genotype probabilities. In other words, we used the covariance matrix rather than the correlation matrix for PCA. We also described population genetic structure based on estimates of genome-level FST between pairs of populations (see the Supporting information, F-model).
GENETIC DIFFERENTIATION ACROSS THE GENOME
We used two methods to quantify genome-wide genetic differentiation between L. idas and L. melissa: (1) we estimated FST using a hierarchical Bayesian implementation of the F-model, (2) we calculated GST directly from allele frequency estimates (these models are described in the Supporting information). The hierarchical Bayesian F-model treats FST as an evolutionary parameter and allows information sharing among loci, whereas GST is modeled as a simple summary of allele frequencies for each locus (Holsinger and Weir 2009). The F-model provides a metric of genetic differentiation that is analogous to FST under several evolutionary models (Balding and Nichols 1995; Nicholson et al. 2002; Falush et al. 2003). Here, we equate this metric of differentiation with FST. We provide a detailed description of our specification and implementation of the hierarchical Bayesian F-model in the Supporting information. Related models for FST outlier analysis without genotype uncertainty have been proposed and are in wide use (Beaumont and Balding 2004; Foll and Gaggiotti 2008; Guo et al. 2009).
We obtained Bayesian estimates of FST using 17,693 loci with a global minor allele frequency ≥ 0.1. We excluded less variable loci from this and all subsequent analyses as they provide little information about genetic divergence or ancestry. We estimated posterior probabilities of FST for each locus and a metric of FST for the genome using MCMC. Additionally, we designated outlier loci with higher than expected levels of genetic differentiation between L. idas and L. melissa given the estimated genome-wide FST distribution. In other words, we concurrently estimated FST for individual loci and model parameters describing the expected distribution of FST across the genome from the DNA sequence data. We then used the expected distribution of FST across the genome to designate outlier loci. This general procedure for designating outlier loci is described by Gompert and Buerkle (2011a; 2011b) and in the Supporting information (see F-model). FST outlier loci are more likely to have been influenced by divergent selection than nonoutlier loci (Beaumont and Nichols 1996; Nosil et al. 2009b; Gompert and Buerkle 2011b). We ran two independent chains for 25,000 steps each. Samples were recorded every step following a 1000 step burn-in. We combined the output of the two chains after inspecting the MCMC output to assess convergence to the stationary distribution.
We measured genetic differentiation between pairs of populations to determine whether FST outlier loci between L. idas and L. melissa were also highly differentiated between individual pairs of populations. We quantified genome-wide, population-level genetic differentiation using both FST and GST, as described for the species-level analyses. We designated outlier loci for each population comparison and quantified the number of outlier loci shared among individual pairs of L. idas, L. melissa, and Jackson Hole Lycaeides populations, and individual pairs of heterospecific populations. We also determined the number of FST outliers between L. idas and L. melissa that were also outliers between individual L. idas and L. melissa populations.
We measured genomic introgression of L. idas and L. melissa genetic regions in admixed Jackson Hole Lycaeides. We were not concerned with the geography of introgression (i.e., geographic introgression), but rather with the movement of genetic material from one genomic background to another within a geographic region (i.e., genomic introgression; Rieseberg et al. 1999; Lexer et al. 2007; Gompert and Buerkle 2009). We quantified locus-specific genomic introgression using the Bayesian genomic cline model on the basis of two locus-specific genomic cline parameters (Gompert and Buerkle 2011a). These cline parameters specify the probability that an individual with hybrid index H=h inherited a gene copy at locus I=i from L. idas (denoted φ; the probability of L. melissa ancestry is 1 −φ). The base probability of L. idas ancestry for a locus is equal to an individual’s hybrid index. The genomic cline center parameter, α, specifies an increase (positive values of α) or decrease (negative values of α) in the probability of L. idas ancestry for a locus relative to the base expectation. The genomic cline rate parameter, β, specifies an increase (positive values) or decrease (negative values) in the rate of transition from low to high probability of L. idas ancestry as a function of hybrid index (Fig. S1; Gompert and Buerkle 2011a). The parameter β is a measure of the average ancestry-based pairwise linkage disequilibrium between a marker locus and all other marker loci. More formally,
where φih is given by a simple transformation of Φih to ensure 0 ≤φ≤ 1 and that φ is a monotonically increasing function of hybrid index (Gompert and Buerkle 2011a). Simulations have demonstrated that selection against specific hybrid genotypes (i.e., locus-specific reproductive isolation), whether arising from single locus (underdominance) or multilocus (DM) incompatibilities, affects α and β, but the effect of selection on α is often more pronounced, particularly if dispersal from parental populations is limited (Gompert and Buerkle 2011a; Gompert et al. 2012). Specifically, underdominance and epistatic incompatibilities cause extreme positive or negative genomic cline center parameters (α) and high, positive genomic cline rate parameters (β). The same simulation studies demonstrate that selection favoring a locally favored homozygous genotype will affect α, but has little to no affect on β (Gompert and Buerkle 2011a; Gompert et al. 2012).
We used a modified implementation of the Bayesian genomic cline model to quantify genome-wide variation in introgression in Jackson Hole Lycaeides (see Genomic cline model with genotype uncertainty in the Supporting information). The modified model incorporates uncertainty in genotypic state inherent in next-generation sequence data, but is otherwise identical to the model described by Gompert and Buerkle (2011a). We estimated marginal posterior probability distributions for hybrid indexes and cline parameters (α and β) using MCMC. We ran five independent chains for 50,000 steps each and recorded samples from the posterior distribution every step following a 30,000 step burn-in. We combined the output of the five chains after inspecting the MCMC output to assess convergence to the stationary distribution.
The following comparative analyses, which test the hypothesis that locus-specific genetic differentiation between L. idas and L. melissa predicts locus-specific introgression in Jackson Hole Lycaeides, consider only the genomic cline parameter α, as we did not detect meaningful variation for β (see Results). Jackson Hole Lycaeides inherited the majority of their genome from L. idas (see Results) and use the same host plant and occupy similar habitats to nearby L. idas populations (Scott 1986; Gompert et al. 2010b). Thus, we predict that, for loci that are highly differentiated between L. idas and L. melissa, selection in Jackson Hole Lycaeides should favor L. idas alleles more frequently than L. melissa alleles. To test this hypothesis, we asked whether L. idas×L. melissaFST outlier loci were more likely to have elevated L. idas ancestry than expected by chance. By elevated L. idas ancestry, we simply mean that the point estimate of α (i.e., ) was greater than zero. Because genomic cline parameter α represents a deviation from the ancestry probability predicted solely from hybrid index and because the α parameters are constrained to sum to zero, we used the expectation that 50% of outlier loci should have as a null hypothesis. We tested for a significant deviation from this expectation based on a binomial probability distribution with pL.idas= 0.5. Additionally, we obtained Bayesian posterior estimates of the probability that FST outlier loci had estimates of by specifying a binomial likelihood for the number of FST outlier loci with and an uninformative beta prior on pL.idas (i.e., ). We repeated these analyses with all 17,693 loci with estimates of FST and α for comparison.
We were also interested in whether there was a general correlation between the genetic regions affected by selection in the geographically disjunct parental populations and Jackson Hole Lycaeides. A correlation between locus-specific FST and the absolute value of locus-specific genomic cline center parameter, α, could provide support for this hypothesis. However, estimates of α could be affected by the ancestry information contained in the sequence data for each locus (i.e., the correspondence between what population an allele was inherited from and the allelic state). Ancestry information is dependent on the allele frequency differential between parental species and thus should be positively correlated with locus-specific FST. Therefore, in addition to calculating the correlation between locus-specific FST and the absolute value of the genomic cline center parameter (|α|) for the Lycaeides sequence data, we estimated the same correlation for simulated datasets. We first analyzed 100 × 4 replicate datasets simulated using four different sets of demographic conditions to ask whether a correlation between FST and |α| might arise in the absence of selection. Each dataset included 204 biallelic loci. Ideally, we would examine simulated datasets with the same demographic history as the Jackson Hole Lycaeides. However, given that this history is not known and would be quite difficult to infer with precision, we instead chose to explore a range of demographic conditions. We simulated 10 additional datasets with a large number of loci (15,010) to determine how the correlation between FST and |α| for the simulated and observed data was affected by the minimum ancestry information (i.e., minimum value of FST) required for a locus to be included in the analysis. The conditions we used to simulate these 10 datasets gave a range of estimated values for FST and α that were very similar to the range of estimated values from the Lycaeides data. Full details regarding the simulation procedures and analysis of simulated datasets are included as Supporting information (see Admixture simulations).
We sequenced 110.2 million, 108 bp DNA fragments from 116 L. idas (five localities), 76 L. melissa (three localities), and 186 Jackson Hole Lycaeides (five localities; Fig. 1, Table 1). These data contained 119,677 variable nucleotides distributed among 52,239 short (∼ 92 bp long) contigs. Estimates of , which is a joint metric of deviations from Hardy–Weinberg and linkage equilibrium, were quite low (Table 2). Specifically, the median value of for each population and each taxon was less than 0.01. The distribution of pairwise disequilibria was similar in L. idas, L. melissa, and Jackson Hole Lycaeides populations. Estimated values of for physically linked variable sites that occurred on the same DNA fragment (i.e., fewer than 100 bp apart) were two to three times greater than for other loci (mean [across all populations] median value of : linked loci = 0.0155, all loci =0.0062; Table 2). The proportion of locus pairs with in each admixed population was 0.50. PCA of estimated multilocus genotypes and estimates of genome-level FST demonstrated genetic differentiation between L. idas and L. melissa populations (Fig. 2; FST= 0.074, 95% credible interval [CI]: 0.072 − 0.075). These analyses indicated that Jackson Hole Lycaeides possess an intermediate gene pool relative to L. idas and L. melissa, but are more genetically similar to L. idas (L. idas× Jackson Hole FST= 0.017, 95% CI: 0.016 − 0.018; L. melissa× Jackson Hole FST= 0.047, 95% CI: 0.046 − 0.048). Ordination of pairwise population genetic differences (measured by FST) and morphometric analyses gave similar results (Figs. S2–S4, Table S1). Bayesian estimates of hybrid index for individual Jackson Hole Lycaeides ranged from 0.63 to 0.78 (Fig. S5; hybrid index measures the proportion of an individual’s genome that was inherited from L. idas).
Table 2. Quantiles of the empirical distribution of estimated pairwise composite linkage disequilibria () for Lycaeides populations and taxa (combined populations). “Linked loci” refers to pairs of variable sites from the same DNA fragment. Population IDs are from Table 1.
GENETIC DIFFERENTIATION AND GENOMIC INTROGRESSION
We detected considerable interlocus variation in genetic differentiation between L. idas and L. melissa (Fig. 3A). Locus-specific estimates of FST between L. idas and L. melissa ranged from 0.059 to 0.298 (mean = 0.074). FST estimates for 876 loci (out of 17,693) exceeded 0.1. Eighty loci exceeded the quantile of the expected genome-wide distribution of FST and were classified as statistical outlier loci (21 loci exceeded the quantile of the expected genome-wide distribution and represent extreme outlier loci; Fig. 3A). This means that the probability of the FST value for these loci conditional on the estimated variation in FST across the genome was less than 0.05 (outlier loci) or 0.01 (extreme outlier loci). FST outlier loci are likely enriched for genetic regions tightly linked to loci experiencing divergent selection between L. idas and L. melissa (Beaumont and Nichols 1996; Beaumont and Balding 2004; Nosil et al. 2009a). All 80 FST outlier loci between L. idas and L. melissa were also classified as outlier loci between at least two pairs of L. idas and L. melissa populations (Table S2). Moreover, all loci classified as outliers between at least 10 of the 15 individual pairs of L. idas by L. melissa populations were also classified as outliers between L. idas and L. melissa. Estimates of GST were highly correlated with estimates of FST (r= 0.979, P < 2.2 × 10−16), but were considerably more variable (mean = 0.054, SD = 0.079, minimum < 0.001, quantile =0.198, maximum = 0.938; Fig. S6).
Genomic introgression in Jackson Hole Lycaeides varied across the genome (Figs. 3B, 4, S7). Genomic cline parameter α was particularly variable, with a minimum of α=− 1.79 and a maximum of α= 1.03. Considering an individual with h= 0.7, the probability of L. idas ancestry for a locus with α=− 1.79 is ∼ 0, whereas the probability of L. idas ancestry for a locus with α= 1.03 is ∼ 1. We detected excess L. idas ancestry (i.e., the lower bound of the 95% CI for α was greater than zero) for 1791 loci (10.1% of the loci) and excess L. melissa ancestry (i.e., the upper bound of the 95% CI for α was less than zero) for 1583 loci (8.9% of the loci; Fig. 4). By excess ancestry, we mean that the 95% CI for α did not include zero. Thus, genomic introgression for approximately 19% of the sampled loci differed from the genome average predicted by hybrid index. In general, extreme values of α are expected for loci that reside in genetic regions affected by selection in Jackson Hole Lycaeides. Genomic cline parameter β was less variable (min =− 0.298, max = 0.230). The 95% CI for β encompassed zero for all loci.
Estimates of genomic cline center (; defined as the median of the posterior distribution for α) for 72 of the 80 FST outlier loci were greater than zero. This represents significantly more outlier loci with an elevated probability of L. idas ancestry (i.e., ) than expected by chance (binomial prob. test, P= 0.0002; Bayesian estimate of the binomial probability, pL.idas= 0.89, 95% CI: 0.81 − 0.95). We obtained similar results for the 21 extreme outlier loci. Point estimates of genomic cline center () for 19 of these loci were greater than zero, which is indicative of an elevated probability of L. idas ancestry for extreme outlier loci (binomial prob. test, P= 5.4 × 10−14; Bayesian estimate of the binomial probability, pL.idas= 0.87, 95% CI: 0.71 − 0.97). Conversely, was greater than zero for 8546 of the 17,693 loci (i.e., all loci, not just FST outlier loci). This result is also inconsistent with binomial expectation (binomial prob. test, P= 6.4 × 10−6). However, the probability of a locus with was much closer to 0.5 (Bayesian estimate of the binomial probability, pL.idas= 0.483, 95% CI: 0.476 − 0.490) and indicative of a slight excess of loci with (i.e., elevated L. melissa ancestry rather than elevated L. idas ancestry).
Locus-specific measures of genetic differentiation between nonadmixed populations were correlated with locus-specific estimates of in Jackson Hole Lycaeides (Fig. 5). Specifically, we detected a positive correlation between FST and the absolute value of (r= 0.152, P < 2.2 × 10−16). Our analysis of datasets simulated under a variety of demographic conditions indicates that the observed correlation could easily be explained in the absence of selection and could simply be due to variation in the ancestry-information content of loci (see the Supporting information: Admixture simulations and Fig. S8). However, for the simulated datasets, the correlation between FST and approached zero when only loci with FST greater than 0.08 or 0.1 were analyzed (Fig. 6). Conversely, for the Lycaeides data, the correlation between FST and increased when loci with low estimates of FST were removed from the analysis. When only highly differentiated loci were used to calculate the correlation between FST and , results for individual simulated datasets were erratic because few loci met the criterion for inclusion; however, the mean correlation coefficient across 10 simulated datasets remained near zero (Fig. 6). The correlation coefficient for the Lycaeides data was positive regardless of the loci included.
Patterns of genetic differentiation and genomic introgression in Lycaeides are consistent with the hypothesis that genetic regions experiencing divergent selection between geographically disjunct L. idas and L. melissa populations also affect hybrid fitness in Jackson Hole Lycaeides. As predicted, Jackson Hole Lycaeides had an elevated probability of L. idas ancestry at a far greater proportion of FST outlier loci than expected by chance. This result is not an artifact of the overall genomic composition of Jackson Hole Lycaeides, as we measured the probability of locus-specific ancestry relative to the genomic admixture proportion or hybrid index, but rather is best explained by selection. There are two primary reasons that selection might favor elevated L. idas ancestry at highly differentiated loci. First, a subset of highly differentiated loci between L. idas and L. melissa could have evolved by divergent selection in response to differences in habitat and host plant, and these same loci could affect habitat or host plant associated fitness in Jackson Hole Lycaeides. Alleles inherited from L. idas would likely be favored at these loci, because L. idas and Jackson Hole Lycaeides populations occupy similar habitat and feed on the same larval host plant. This explanation is consistent with an important role for habitat or host plant associated adaptation during the speciation process, as has been suggested for numerous phytophagous insects (Rand and Harrison 1989; Via et al. 2000; Nosil et al. 2002; McBride and Singer 2010) and other species (Szymura and Barton 1986; Rundle et al. 2000; Schluter and Conte 2009). Elevated L. idas ancestry at highly differentiated loci could also be explained by gene interactions, such as DM incompatibilities, in which alleles are disproportionately favored in particular genomic backgrounds. Because Jackson Hole Lycaeides have inherited a greater proportion of their genome from L. idas than L. melissa populations, selection would likely favor L. idas ancestry at such loci. We would need several admixed populations occupying different habitats or with different overall genomic compositions to distinguish between these two explanations.
As predicted, we detected a positive association between locus-specific measures of genetic differentiation and genomic introgression. Specifically, locus-specific estimates of FST were correlated with the absolute value of locus-specific estimates of genomic cline center () in Jackson Hole Lycaeides. This positive association could reflect consistent selection between parental populations and in the admixed population and suggests that further research to investigate the relationship between selection in parental lineages and hybrids is warranted. However, the association could also simply reflect variation in the ancestry-information content of individual loci. Analysis of datasets simulated under a model of neutral admixture and a variety of demographic conditions certainly supports the possibility that such a correlation could arise as an artifact of variation in ancestry information. However, the correlation between FST and in Lycaeides is greater when only loci with FST > 0.1 are examined, whereas the correlation disappears in a series of 10 simulated datasets under comparable conditions. This contrast between neutral simulations and the empirical Lycaeides data suggests that some fraction of the loci under divergent selection between L. idas and L. melissa also affect hybrid fitness in Jackson Hole Lycaeides. This conclusion should be robust to different demographic histories as long as they generate roughly the same genome-wide distribution of FST and α, but further simulation studies might lead to additional insights.
Reproductive isolation requires differences in allele frequencies at the causal loci, but allele frequency differences at linked marker loci are much more likely if isolation evolved by selection rather than drift. Thus, as hybrid fitness is a component of reproductive isolation, the association of FST, which measures allele frequency differences, and is consistent with the hypothesis that genetic differentiation caused by divergent selection in allopatry contributes to reproductive isolation and affects hybrid fitness in Jackson Hole Lycaeides. Additional empirical evidence from a variety of taxonomic groups links reproductive isolation and divergent selection; this includes comparative studies of isolation and ecological divergence (Bolnick et al. 2006; Funk et al. 2006), experimental evolution studies where replicate populations are exposed to different environmental conditions (Dodd 1989), and molecular population genetic characterization of known speciation genes (Tang and Presgraves 2009; Nosil and Schluter 2011). The current results suggest that multiple loci that affect hybrid fitness might have differentiated between L. idas and L. melissa as a consequence of divergent selection. Future research combining population genomics and experimental crosses to study the association between historical divergent selection and hybrid fitness at many loci across the genome will likely refine this conclusion.
Although our results suggest that a subset of the loci under divergent selection between L. idas and L. melissa also affect hybrid fitness in Jackson Hole Lycaeides, genetic differentiation and genomic introgression were discordant for a substantial portion of the genome. The locus-specific probability of L. idas ancestry in Jackson Hole Lycaeides for many loci that were markedly differentiated between L. idas and L. melissa was not different from the genome-wide average probability of L. idas ancestry (i.e., hybrid index). Conversely, many loci were only weakly differentiated between L. idas and L. melissa, but introgressed disproportionately relative to the genome-wide average. As stated previously, there are numerous reasons why locus-specific measures of genetic differentiation might not predict locus-specific measures of genomic introgression. For example, the subset of weakly differentiated loci with high positive or negative estimates of genomic cline center could correspond to loci involved in DM incompatibilities that differentiated either by stochastic processes in allopatry (e.g., Gavrilets et al. 1998), or by selection on standing genetic variation. Although we find little evidence of elevated linkage disequilibrium presently in Jackson Hole Lycaeides, admixture linkage disequilibrium is likely to have existed in the past (Fig. S10). If this were the case, these marker loci might have been in linkage disequilibrium with DM incompatibility loci in Jackson Hole Lycaeides, but not in L. idas and L. melissa populations. Clearly, these possible explanations are not mutually exclusive, and our results are consistent with theoretical and empirical results suggesting DM incompatibilities are important for reproductive isolation (e.g., Turelli and Orr 2000; Orr and Turelli 2001; Gavrilets 2003; Brideau et al. 2006; Moyle and Nakazato 2009). Finally, although the habitat, host plant, and genomic composition of Jackson Hole Lycaeides are more similar to that of L. idas than L. melissa, differences certainly exist. Such ecological and genomic differences might alter selection and contribute to the documented discordance in genetic differentiation and genomic introgression.
Importantly, our results demonstrate that both genetic differentiation and genomic introgression vary considerably across the genome in Lycaeides. These results are consistent with patterns of heterogeneous genetic differentiation during speciation reported for other taxa (Harr 2006; Egan et al. 2008; Nosil et al. 2008; Hohenlohe et al. 2010), but are based on considerably more loci than are generally available for taxa that are not genetic model organisms. Genome-wide heterogeneity in genetic differentiation or introgression is often interpreted as evidence that selection shapes variation across the genome (Nolte et al. 2009; Nosil et al. 2009a). This inference is strengthened in Lycaeides because many species-level outlier loci were consistently differentiated between individual pairs of L. idas and L. melissa populations. This consistency is expected if genetic differentiation is the result of selection operating similarly in all populations within each species, but would not be expected if genetic differentiation at these loci was caused by idiosyncratic genetic drift (or selection) operating differently in different conspecific populations (Nosil et al. 2009a). However, recent common ancestry among conspecific populations could also account for this pattern. Despite current genetic differentiation among conspecific populations, this possibility cannot be excluded for L. idas or L. melissa. Moreover, the potential for stochastic processes to generate variation in genetic differentiation or introgression across the genome has not been fully characterized (Buerkle et al. 2011). Finally, the genomic locations of the loci analyzed in this study are currently unknown. Knowledge of the linkage relationships and ancestry block sizes would lead to additional insights on the distribution of differentiation and introgression across the genome (e.g., Ungerer et al. 1998; Buerkle and Rieseberg 2008; Lawniczak et al. 2010). We are currently constructing genetic and physical maps for this purpose.
The utility of the Bayesian genomic cline model is affected by linkage disequilibrium and the distribution of hybrid indexes in the admixed population(s). As we discuss in the following section, our results indicate little linkage disequilibrium and a rather narrow distribution of hybrid indexes in Jackson Hole Lycaeides. These are not ideal conditions for the application of this analytical framework and are likely responsible for the uninformative estimates of genomic cline rate parameter β. It is unclear how our results might differ in a different admixed population with increased linkage disequilibrium and a broader range of hybrid indexes. Nonetheless, we were able to generate informative estimates of α that provide intuitive measures of genomic introgression.
HYBRIDIZATION, INTROGRESSION, AND JACKSON HOLE LYCAEIDES
Linkage disequilibrium in Jackson Hole Lycaeides was remarkably low for an admixed population, except between a small subset of loci within the short (<100 bp) DNA fragments, and was similar to linkage disequilibrium in L. idas and L. melissa populations. This is consistent with previous estimates of linkage disequilibrium in Jackson Hole Lycaeides based on six microsatellite and single nucleotide polymorphism (SNP) loci (Gompert et al. 2010b). Moreover, we found no evidence of admixture linkage disequilibrium as positive and negative estimates of Δ were observed with nearly equal frequency. In fact, because most loci had similar allele frequencies in L. idas and L. melissa, admixture is not expected to generate very high levels of linkage disequilibrium (Fig. S10). Likewise, the range of hybrid indexes for Jackson Hole Lycaeides was rather narrow (0.63–0.78). Previous estimates of admixture proportions based on AFLP loci analyzed with the admixture model implemented in structure suggested the genomic composition of Jackson Hole Lycaeides was more variable than our current results indicate; both sets of results show that hybrids are generally more L. idas-like in genomic composition. However, this increased variability from AFLPs was coupled with increased uncertainty in admixture proportion estimates (Gompert et al. 2010b). Moreover, the AFLP dataset included populations at the southern and northern ends of the hybrid zone, whereas our current study focused on populations from the geographic center of the hybrid zone (Gompert et al. 2010b). Together these differences likely account for the discrepancy in hybrid indexes. Regardless, the observed low levels of linkage disequilibrium and narrow range of hybrid indexes in Jackson Hole Lycaeides suggest that current gene flow from L. idas and L. melissa populations to the center of the hybrid zone is minimal to nonexistent. Although the formation of the Jackson Hole Lycaeides populations clearly required a period of gene flow and admixture in the past, they presently appear to be evolving with little ongoing influence of nearby parental populations. Genome-wide variation in estimates of genomic cline parameter α and thus the locus-specific probability of L. idas ancestry suggest that a subset of loci might have fixed for chromosomal blocks inherited from L. idas or L. melissa, but that much of the genome still contains segregating variation from both species. The extent of genome stabilization and size of parental chromosomal blocks can be informative about demographic and evolutionary processes affecting admixed populations (Ungerer et al. 1998; Buerkle and Rieseberg 2008), but presently we are limited in our ability to assess these quantities in the absence of mapped genetic loci.
The admixed Jackson Hole Lycaeides populations offer an interesting contrast to the homoploid hybrid Lycaeides species in the alpine region of the Sierra Nevada of western North America (Gompert et al. 2006a). The alpine species was formed following hybridization between L. anna (formerly L. idas anna) and L. melissa. Unlike the Jackson Hole Lycaeides, the alpine hybrid species occupies a different and extreme habitat relative to L. anna and L. melissa and uses a different, alpine-endemic host plant (Gompert et al. 2006a). Moreover, the alpine hybrid species has novel traits that are adaptive in the alpine habitat (Fordyce and Nice 2003; Gompert et al. 2006a). The alpine hybrid species has a mosaic genome with alleles inherited from both L. anna and L. melissa, but has evolved novel derived alleles not shared with either parental species. Conversely, Jackson Hole Lycaeides occur in similar habitat to nearby L. idas populations and use the same host plant. Moreover, Jackson Hole Lycaeides are largely segregating for parental alleles. These differences could simply reflect a more recent origin for Jackson Hole Lycaeides relative to the alpine species in the Sierra Nevada. However, colonization of a novel habitat and host plant in the Sierra Nevada, rather than simple geographic isolation, has likely contributed to these different outcomes of hybridization. Additional geographically isolated hybrid lineages exist in the Warner mountains and White mountains of western North America (Gompert et al. 2008b, 2010a), and perhaps in the San Juan range in southern Colorado and the Wassatch range in Utah (Nabokov 1943, 1949). Contrasting the genomic outcomes of secondary contact and admixture in each of these ranges has the potential to provide important insights into how selection affects speciation.
Our results provide convincing evidence that multiple loci affected by divergent selection between geographically disjunct L. idas and L. melissa populations also affect fitness in admixed Jackson Hole Lycaeides. The idea that divergent selection drives the evolution of reproductive isolation is pervasive and a signature of divergent selection associated with individual speciation genes has been detected previously (Darwin 1859; Endler 1977; Turelli et al. 2001; Barbash et al. 2003; Tang and Presgraves 2009). In some instances, divergent selection can be equated with reproductive isolation, however divergent selection and reproductive isolation can also be decoupled. This study shows that even when selection in parental and hybrid lineages is potentially decoupled, we see an association between divergent selection in allopatry and hybrid fitness (a component of reproductive isolation) at many loci across the genome. Interestingly, some highly differentiated genetic regions do not appear to be associated with hybrid fitness, whereas genomic introgression of some weakly differentiated genetic regions is affected by selection in hybrids. This indicates that selection associated with species barriers might vary by environment or genomic context and that the genetic basis and evolution of reproductive isolation could be quite complex.
Associate Editor: N. Barton
This manuscript was improved by comments from C. Lexer, P. Nosil, T. Parchman, two anonymous reviewers, and Editor N. Barton. We thank the following for donating Lycaeides specimens or help in the field: M. Diaz, R. Lund, M. Moore, P. Opler, M. Pfeifer, M. Palmer, C. Schmidt, and M. Spurrier. This research was facilitated by the UW-NPS field station in Grand Teton National Park and the research staff at Yellowstone and Grand Teton National Parks. This research was funded by the National Science Foundation (DDIG-1011173 to ZG, NSF EPSCoR WySTEP summer fellowship to LKL, IOS-1021873 and DEB-1050355 to CCN, DEB-0614223 and DEB-1050947 to JAF, DEB-1020509 and DEB-1050726 to MLF, and DBI-0701757 and DEB-1050149 to CAB).