Mining transcriptome sequences towards identifying adaptive single nucleotide polymorphisms in lake whitefish species pairs (Coregonus spp. Salmonidae)


  • The authors are broadly interested in the nature of genetic changes that are associated with speciation. This study is part of Sébastien Renaut’s doctoral research, which aims to study the genomic bases of adaptive divergence in the context of a recent ongoing speciation event in lake whitefish. Arne Nolte is interested in the diversity of fishes and understanding the role that environmental and intrinsic factors play in evolution. Louis Bernatchez’s research focuses on understanding the patterns and processes of molecular and organismal evolution as well as their significance to conservation.

Sébastien Renaut, Fax: +1 418 656 717; E-mail:


Next-generation sequencing allows the discovery of large numbers of single nucleotide polymorphisms (SNPs) in species where little genomic information was previously available. Here, we assembled, de novo, over 130 mb of non-normalized cDNA using 454 pyrosequencing data from dwarf and normal lake whitefish and backcross hybrids. Our main goals were to gather a large data set of SNP markers, document their distribution within coding regions, evaluate the effect of species divergence on allele frequencies and combine results with previous genomic studies to identify candidate genes underlying the adaptive divergence of lake whitefish. We identified 6094 putative SNPs in 2674 contigs (mean size: 576 bp, range: 101–6116) and 1540 synonymous and 1734 non-synonymous mutations for a genome-wide non-synonymous to synonymous substitution rate ratio (pN/pS) of 0.37. As expected based on the young age (<15 000 years) of whitefish species pair, the overall level of divergence between them was relatively weak. Yet, 89 SNPs showed pronounced allele frequency differences between sympatric normal and dwarf whitefish. Among these, SNPs in genes annotated to energy metabolic functions were the most abundant and this, in addition to previous experimental data at the gene expression and phenotypic level, brings compelling evidence that genes involved in energy metabolism are prime candidates explaining the adaptive divergence of lake whitefish species pairs. Finally, we unexpectedly identified 44 contigs annotated to transposable elements and these were predominantly composed of backcross hybrids sequences. This indicates an elevated activity of transposable elements, which could potentially contribute to the reduced fitness of hybrids previously documented.


Next-generation sequencing technologies are rapidly transforming the field of ecology, evolution and genetics (Rokas & Abbot 2009). This avalanche of data promises to answer experimental inquiries ranging from ancient DNA sequencing (Miller et al. 2008), sequence variants discovery (Vera et al. 2008), microbial ecology (Dinsdale et al. 2008) as well as gene expression analysis (Torres et al. 2007; Lipson et al. 2009). High throughput pyrosequencing developed by 454 Life Sciences (Margulies et al. 2005) is of particular interest in ecology and evolution primarily because it yields longer sequencing reads than any other method (up to 600 bp), which allows more accurate de novo sequence assemblies often required for non-model organisms. The recent explosion of second- and third-generation sequencing (Branton et al. 2008; Shendure & Ji 2008; Metzker 2009) has led some researchers to believe that many technical approaches (e.g. Sanger sequencing, DNA microarrays), which where themselves revolutionary a decade or two ago, may already be obsolete today (Ledford 2008). Nevertheless, in order to unleash its full potential, these methods will require careful experimental design, consideration of the techniques’ limitations and finally, innovative bioinformatics approaches to process and extract relevant information (Ellegren 2008; Rokas & Abbot 2009).

One of the primary goals of high throughput sequencing projects is to reveal sequence variation such as copy number variants, insertion-deletions (indels) or single nucleotide polymorphisms (SNPs) by sequencing pools of genetically heterogeneous individuals (Barbazuk et al. 2007; Vera et al. 2008; Wiedmann et al. 2008). SNPs are rapidly becoming popular genetic markers in ecology and evolution (Schlötterer 2004; Moen et al. 2008; Namroud et al. 2008). Their main attraction is that, contrary to most amplified fragment length polymorphisms (AFLP) markers, they can potentially be directly linked to candidate genes of known function and interest. Moreover, as opposed to microsatellites, which may have complex mutations patterns, their genotyping can be highly automated at moderate costs (Schlötterer 2004; Ehrich et al. 2005; Shen et al. 2005; Van Tassell et al. 2008). Lastly, unlike AFLP and microsatellites, SNP data can also easily be standardized across laboratories. Nevertheless, despite their abundance and genotyping automation, SNP markers development may involve several validation steps. Problems with successful SNP locus amplification, low-frequency polymorphisms or gene duplicates render the identification of reliable markers a non-trivial, potentially labour-intensive task (Fredman et al. 2004; Hayes et al. 2007; Namroud et al. 2008).

Identifying sequence variants in transcribed regions of the genome is of primary interest in an attempt to characterize the effects of selection on protein evolution. Sequence polymorphisms within a gene have different impacts depending on their exact genomic location (intron, exon, untranslated region). Mutations within coding regions are especially insightful as their effect on amino acid composition and therefore protein functionality can be easily assessed. Similar to dn/ds ratios, the rate of accumulation of non-synonymous polymorphism (pN) scaled by the rate of synonymous polymorphism (pS) provides a glimpse on the selective forces driving the evolution of a protein-coding sequence. Thus, genes with a high pN/pS (i.e. >1) ratio are likely to be evolving under the influence of positive selection (McDonald & Kreitman 1991; Axelsson et al. 2008; Ellegren 2008). Furthermore, if this is associated with phenotypically distinct populations, either through de novo mutations or sorting of standing genetic variation, such genes may represent candidates potentially involved in an adaptive divergence event.

Lake whitefish species pairs represent excellent model species to study the early onset of reproductive isolation and its effect on genomic divergence (Lu & Bernatchez 1998; Bernatchez 2004; Rogers & Bernatchez 2006; Nolte et al. 2009; Renaut et al. 2009). Geographic isolation during the Pleistocene caused genetic divergence between whitefish populations inhabiting distinct glacial refugia but without distinctive phenotypic divergence between glacial races in allopatry (Bernatchez & Dodson 1990, 1991). Secondary contact of these evolutionary lineages subsequently occurred 12 000 years bp and has led to the parallel evolution of two morphologically and ecologically divergent sympatric whitefish species in several lakes of northeastern North America: benthic Normal and limnetic Dwarf whitefish (Bernatchez & Dodson 1990, 1991; Pigeon et al. 1997). As expected from a recent divergence event, the overall level of genetic differentiation between species pairs is relatively weak (Bernatchez et al. 1999; Campbell & Bernatchez 2004) and hybrids can be found in nature (Lu et al. 2001; Falush et al. 2007). At the same time, it has been shown that intrinsic (genetic) and extrinsic (ecological) post-zygotic isolation mechanisms lead to a fitness decrease in hybrids (Lu & Bernatchez 1998; Rogers & Bernatchez 2006; Whiteley et al. 2009) and this is partially caused by gene deregulation (Renaut et al. 2009).

Genome scan studies using anonymous AFLP markers as well as markers linked to qualitative trait loci (QTLs) suggest that a small proportion of the whitefish genome (∼1–2%) might be under the effect of directional selection in the process of adaptive population divergence (Campbell & Bernatchez 2004; Rogers & Bernatchez 2005, 2007). Identifying such key islands of genomic divergence and isolation (sensuWu 2001) and, more specifically, candidate genes showing evidence of reduced gene flow may represent a daunting task, yet it offers priceless information to pinpoint the causative variations responsible for reproductive isolation and speciation (Wu & Ting 2004; Turner et al. 2005; Schluter 2009). Our ongoing research programme on the ecological functional genomics of whitefish adaptive divergence and speciation involves a combination of both gene mapping and genome scan aiming at identifying more precisely genomic region evolving under the effect of divergent selection in dwarf and normal whitefish. To this end, we herein sequenced the transcriptome of two sympatric dwarf and normal species of lake whitefish and backcross hybrids with four specific objectives; to gather a large data set of candidate SNP markers; secondly, to look at the distribution of these markers within coding regions; thirdly to evaluate the effect of species divergence on allele frequencies and fourthly, as an a posteriori objective, to evaluate rates of transposon activity among normal, dwarf and hybrid whitefish. Our ultimate goal, linking all this information to previous genomic studies in this system (QTL, eQTL, genome scan and gene expression) as an attempt to establish functional and causal links between genotype, phenotype and natural selection, represents one of the main challenges of the 21st century in evolutionary biology (Schluter 2009).

Materials and methods

Sample preparation

RNA samples were isolated separately from 24 individuals and three different tissue types (white muscle, brain, liver), in order to get a diversified representation of genotypes and expressed genes (Table 1). All RNA samples came from previous gene expression studies and had been kept at −80 °C until thawed for this experiment. As such, fish rearing conditions, euthanasia procedure and RNA extraction protocols are described in details in St-Cyr et al. (2008) for pool D and N (liver tissue), Derome et al. (2008) (Pool BC: muscle tissue) and Whiteley et al. (2008) (Pool BC: brain tissue). Pool D and N respectively represent sympatric dwarf and normal whitefish from Cliff Lake. BC whitefish represent backcross hybrids involving dwarf whitefish from Témiscouata Lake and normal whitefish from Aylmer Lake that were previously used in gene and QTL mapping projects (Rogers & Bernatchez 2007; Rogers et al. 2007). In short, total RNA was extracted separately for each individual using the TRIzol Reagent protocol (Invitrogen). Following extraction, all samples were further cleaned by ultra filtration using microcon (Millipore) spin columns. Samples were quantified using Experion™ RNA StdSens Analysis Kit (Bio-Rad). Total RNA was stored in pure water supplemented with Superase-In™ RNase Inhibitor (Ambion) and kept at −80 °C.

Table 1.   Samples used for sequencing and data obtained from 454 GS-FLX pyrosequencing runs
PoolLineageTissue typeNumber of individualsQuantity sequencedNumber of readsLength (mean/median)*
  1. *Length in nucleotides of read after primers and sample-specific tags were removed.

  2. †N and D samples originally used by St-Cyr et al. (2008).

  3. ‡Muscle tissue was previously used by Derome et al. (2008).

  4. §Brain tissue was previously used by Whiteley et al. (2008).

DCliff Lake DwarfLiver†80.75 plate183365194/214
NCliff Lake NormalLiver†80.75 plate210703191/209
BC[(Aylmer Lake normal × Témiscouata Lake dwarf)  × Aylmer Lake normal ]Muscle‡40.75 plate238409195/216

Enrichment for polyA mRNA was conducted using MicroPoly(A)Purist™ Kit (Ambion). Approximately 100 ng of full-length complementary DNA was synthesized from each polyA mRNA sample following SMART™ PCR cDNA Synthesis Protocol (Clontech). All cDNA samples (3–8 ng) were PCR amplified using Advantage 2 PCR Kit (Invitrogen) and modified SMART™ primers (5′-AAGCAGTGGTATCAACGCAGAGT-3′), which comprised an extra five nucleotide at the 5′ end to serve as an individual specific tag. PCR conditions were as follow: initial denaturation for 1 min at 95 °C, followed by 17–20 cycles depending on sample (1 cycle: 15 s at 95 °C, 30 s at 65 °C, 6 min at 68 °C). Following amplification, all samples were quantified using Quant-iT Picogreen dsDNA Assay Kit (Invitrogen) and three separate pools with equal DNA quantities were prepared; Pool D and N consists of RNA extracted from liver of eight individuals (St-Cyr et al. 2008) each whereas Pool BC consisted of four white muscle (Derome et al. 2008) and four brain (Whiteley et al. 2008) tissue of backcross hybrids. Approximately 5 μg of double-stranded cDNA from each of three cDNA pools was sequenced (0.75 run per pool) on a Roche GS-FLX DNA Sequencer using methods previously described (Margulies et al. 2005) at the Genome Quebec Innovation Center (McGill University, Montreal, Canada).

Contig assemblies

Initial quality filtering of whitefish 454 sequences was performed using Roche proprietary analysis software Newbler (Margulies et al. 2005). Base calling was performed using PyroBayes, which produces more confident base calls than the native 454 base-calling programme (Quinlan et al. 2008). Prior to assembling all sequences, primers and sample specific tags sequences were removed from the data set using a custom made Perl script. CLC Genomics Workbench 3.1 (CLC Bio) was used to assemble sequences de novo (similarity 0.97, overlap 0.5). We performed several test assemblies, based on parameters from recent transcriptome-sequencing studies (Barbazuk et al. 2007, >0.95 similarity index; Vera et al. 2008, >0.80; Zhao et al. 2009, >0.96], and found that using a similarity criterion too low (below 0.9) leads to the assembly of dissimilar sequences, riddled with paralogous sequence variants (PSVs) instead of true SNPs (data not shown). On the other hand, a highly restrictive one (above 0.98) discards too many sequences from the assembly (data not shown). Allowing for 3% mismatch was deemed a reasonable estimate based on relatively low whitefish polymorphism previously observed (1.4 SNPs/kb, Whiteley et al. 2008) and average pyrosequencing error (∼0.5%, Margulies et al. 2005). Note also that our threshold should prevent the assembly of duplicated (paralogous) regions that trace back to an ancient salmonid genome duplication (25–100 Ma, Allendorf et al. 1975) as the latter would be expected to have 6–25% sequence divergence, based on a conservative estimate of ∼0.25% nuclear sequence divergence per million years.

Consensus sequences were Matched (blast, Altschul et al. 1997) against a publicly available set of 32 000 salmonids cDNA (cGRASP database, in BioEdit (Hall 1999) (blastne-value <1e-50). This 32 000 cDNA database had been previously assembled from more than 700 000 EST (expressed sequence tags) sequences obtained from a variety of cDNA libraries. Hence, it should comprise the majority of all cDNA expressed at least in Atlantic salmon, a salmonid closely related to lake whitefish (von Schalburg et al. 2008). Mitochondrial genome from the European lake whitefish (Coregonus lavaretus) (Miya & Nishida 2000) was also used to verify the mitochondrial origin of candidate genes. Functional categories (gene ontology biological functions) for genes of interest were identified with either the information provided by the cGRASP database or searches on and

SNP discovery and functional characterization of polymorphism

Assembled contigs were screened for SNPs using the software CLC Genomics Workbench 3.1 under the following criteria; minimum coverage of SNP: 6X, and minimum frequency of the least frequent allele: 20%, whereas the remaining parameters were left as default. The analysis of SNP frequencies between normal and dwarf whitefish as well as other statistical tests were calculated in r (v. 2.8.1; The R Foundation for Statistical Computing®, 2009, 3-900051-07-0). Namely, allele frequencies were analysed to identify SNPs that showed significant divergent allelic frequencies between normal and dwarf whitefish (minimum coverage of SNP of 4X for normal and dwarf, Fisher’s exact test corrected for multiple hypothesis testing by calculating Q-values from P-values distribution, Storey 2002). Following this, we arbitrarily defined strongly divergent SNPs as markers for which the frequency of an allele differed by more than 0.5 between populations (this index has a maximum value of 1) and Q-value <0.05.

Open reading frames (ORF) for each assembled contig were produced using the program getorf in emboss (European Molecular Biology Open Software Suite, Rice et al. 2000). The longest open-ended ORF (minimum length of 200 nucleotides) was kept as the most probable translated region of the gene. Lastly, maximum likelihood was used to estimate the ratio of synonymous SNP per synonymous site against non-synonymous SNP per non-synonymous site using paml 4.2 (runmode = 0, CodonFreq = 2, model = 2; Yang 2007).

Comparison with previous gene expression, QTL an eQTL studies

We used data from previous lake whitefish gene expression (Derome et al. 2006; St-Cyr et al. 2008; Nolte et al. 2009; Renaut et al. 2009), QTL and genome scans (Rogers & Bernatchez 2007) as well as eQTL mapping (Derome et al. 2008; Whiteley et al. 2008) studies to match their gene annotation with genes identified in this study. We provide a legend at the bottom of Table 3 as a summary of the different studies and the rationale for why they were considered as genes of particular interest.

Table 3.   Functional annotation (gene ontology biological functions) of ranked contigs with the highest rate of single nucleotide polymorphisms per kilobase (SNPs/kb >20 or 2%)
Gene product*Functional groupsSNPs/kbMatch to previous studies†
  1. *Note that several contigs may correspond to the same gene annotation. These may be either splice variants of the same gene or different paralogues of that gene.

  2. †Match to previous studies that either showed differential expression between dwarf, normal or hybrid whitefish, or mapped to eQTL:

  3. 1: Parallel non-directional change in gene expression between dwarf and normal natural whitefish (white muscle, adults; Derome et al. 2006).

  4. 2: Parallel directional change in gene expression between dwarf and normal natural whitefish (white muscle, adults; Derome et al. 2006).

  5. 3: Parallel non-directional change in gene expression between dwarf and normal natural and controlled environment populations (liver, adults; St-Cyr et al. 2008).

  6. 4: Parallel directional change in gene expression between dwarf and normal natural and controlled environment populations (liver, adults; St-Cyr et al. 2008).

  7. 5: Parallel directional change in gene expression between dwarf and normal natural populations (liver, adults; St-Cyr et al. 2008).

  8. 6: Change in gene expression between dwarf and normal controlled environment populations (whole fish, juveniles; Nolte et al. 2009).

  9. 7: Change in gene expression between dwarf and normal controlled environment populations (white muscle, adults; Derome et al. 2008).

  10. 8: Change in gene expression between dwarf and normal controlled environment populations whitefish (whole fish, embryos; Nolte et al. 2009).

  11. 9: Highly transgressive gene in hybrid whitefish (whole fish, juveniles, Renaut et al. 2009)

  12. 10: eQTL (white muscle, adults; Derome et al. 2008).

  13. 11: eQTL (brain tissue, adults; Whiteley et al. 2008).

60S ribosomal protein L22Translation (GO:0006412)44.8 
40S ribosomal protein S5Translation (GO:0006412)39.410
Nucleolar RNA helicase 2mRNA splicing (GO:0000398)39.610
Sequestosome-1Regulation of I-kappaB kinase/NF-kappaB cascade (GO:0043122)38.1 
UbiquitinPositive regulation of transcription (GO:0045941)37.81,5,6,10,11
Tubulin alpha chainMitotic spindle organization and biogenesis (GO:0007052)37.710,11
60S ribosomal protein L7Translation (GO:0006412)35.3 
Vacuolar ATP synthase catalytic subunit AProton transport (GO:0015992)34.9 
Tubulin alpha chainMitotic spindle organization and biogenesis (GO:0007052)33.510,11
Transposable element Tc1 transposaseTransposition, DNA-mediated (GO:0006313)31.5 
Transposable element Tc1 transposaseTransposition, DNA-mediated (GO:0006313)29.8 
Collagen alpha-2(I) chain precursorSkin development (GO:0030199)27.4 
Retinol dehydrogenase 3Metabolism (GO:0008152)26.1 
Transcription factor PU.1Negative regulation of transcription from RNA polymerase II promoter (GO:0000122)26 
60S ribosomal protein L27aTranslation (GO:0006412)25.6 
Similar to CalsequestrinCalcium ion binding (GO:0005509)25.5 
Transposable element Tcb1 transposaseTransposition, DNA-mediated (GO:0006313)25.4 
60S ribosomal protein L5Translation (GO:0006412)25.26,7,10
Proteasome subunit beta type-7 precursorUbiquitin-dependent protein catabolic process (GO:0006511)24.6 
Ubiquitin carboxyl-terminal hydrolase 28Ubiquitin-dependent protein catabolic process (GO:0006511)24.5 
Ubiquitin-like protein FUBITranslation (GO:0006412)23.9 
60S ribosomal protein L17Translation (GO:0006412)23.5 
Thimet oligopeptidaseProteolysis (GO:0006508)23.3 
NADH dehydrogenase iron–sulphur protein 2Response to oxidative stress (GO:0006979)23.1 
Probable RNA-directed DNA polymerase from transposon BSTransposition, DNA-mediated (GO:0006313)22.86,8
Zinc finger protein ZIC 2Cell differentiation (GO:0030154)22.2 
Transposable element Tcb1 transposaseTransposition, DNA-mediated (GO:0006313)22.1 
Acetyl-CoA acetyltransferase, cytosolicMetabolism process (GO:0008152)22 
Heterogeneous nuclear ribonucleoprotein GmRNA processing (GO:0006397)21.96
Fibrinogen beta chain precursorBlood coagulation (GO:0007596)21.9 
Protein SEC13 homologProtein transport (GO:0015031)21.6 
Oncorhynchus kisutch 5S ribosomal RNA geneTranslation (GO:0006412)21.5 
Cold-inducible RNA-binding proteinResponse to cold (GO:0009409)21.3 
Histidyl-tRNA synthetase, cytoplasmicTranslation (GO:0006412)20.9 
Tubulin alpha chainMitotic spindle organization and biogenesis (GO:0007052)20.910,11
Transposable element Tcb2 transposaseTransposition, DNA-mediated (GO:0006313)20.7 
14-3-3 protein beta/alphaRas protein signal transduction (GO:0007265)20.5 
StathminMitotic spindle organization (GO:0007052)20.36
THO complex subunit 4mRNA transport (GO:0051028)20.3 
Tubulin alpha chainMitotic spindle organization and biogenesis (GO:0007052)210,11
Schistosoma japonicum SJCHGC04625 proteinUnknown38.9 
14-3-3 protein zetaUnknown38.5 
Protein DJ-1Unknown29.3 

SNP validation

A subset of polymorphic loci (31) were validated using matrix-assisted laser desorption/ionization time-of-flight mass spectroscopy (MALDI-TOF MS) assays (Sequenom) at Genome Quebec Innovation Center in order to test whether these markers were likely to be true SNPs rather than PSVs. Twenty-nine fish from a lake containing a single panmictic population of Normal whitefish, Lake Aylmer (45°54′N, 71°20′W), were genotyped. Deviation from Hardy–Weinberg equilibrium (chi-squared test corrected for multiple hypothesis testing, Q-value; Storey 2002) and expected heterozygosity [Fis = (He −Ho)/He] were calculated in r.


Sequencing, contig assembly and annotation

A total of 632 000 sequences with a median length of 212 nucleotides/sequence and totalizing ∼130 megabases were obtained from sequencing the D, N and BC separate pools of cDNA (0.75 GS-FLX sequencing run per pool; Fig. 1, Table 2 NCBI sequence read archive SRA 009800). By using a similarity criterion of 0.97, we assembled, de novo, 428 068 sequences out of 632 000 (68%) into 2674 separate contigs (Table 2), meaning that 32% of all sequences were left as unassembled singletons. Shorter reads were harder to assemble and usually discarded (Fig. 1). Mean contig length was 576 bp, with the smallest contig having a length of 101 and the longest 6116 bp. Coverage was also highly variable due to the fact that the cDNA sequences were not normalized (1.3X–4140X) as another goal of this research will be to document differential gene transcription between dwarf and normal whitefish from this same data set (Jeukens, J., Bernatchez., in prep.) All consensus sequences were matched to the list of 32 000 cDNA from salmonids and good hits (blastne-value <1e-50) were obtained for 59% (1577) of them.

Figure 1.

 Frequency distribution of the total number of reads (blue) and assembled ones (yellow).

Table 2.   Summary statistics of assembled contigs
  1. *Similarity criterion: 0.97. Minimum overlap: 0.5.

  2. †Mininum length set for accepting open reading frame (ORF): 200 nucleotides.

  3. pS, number of synonymous SNPs per synonymous sites; pN, number of non-synonymous SNPs per non-synonymous sites.

Number of sequences assembled428 068 (68% of total)
Number of contigs*2674
 Mean length576
 Number of SNPs6042
 Mean SNP/kb (min–max)3.4 (0–44.8)
 Mean coverage (min–max)8.9X (1.3X–4140X)
Base substitutions
  A–G1930 (31.7%)
  C–T1867 (30.6%)
  A–T599 (9.8%)
  A–C658 10.8%)
  C–G344 (5.%)
  T–G696 (11.4%)
Number of ORFs†1904
 Mean length of ORF482
 Number of SNPs3274
 pN/pS0.37 (0.0028/0.0075)

SNP discovery and functional characterization

Out of the 6042 putative SNPs, we identified among all 2674 contigs, the proportions of transition substitutions were A/G, 31.7%, and C/T, 30.6%, compared to transversions A/C, 10.8%: G/T, 11.4%: A/T, 9.8% and C/G, 5.6% (Table 2). This corresponds to a transition:transversion ratio of 1.65:1. Mean number of SNP per kilobase was 3.4. A total of 70 contigs out of 2674 (or 2.6%) had a very high polymorphism rate (>20 SNPs/kb). These were involved in several functional classes; mostly mRNA translation and processing (11 hits), DNA transposition (6 hits) and mitotic spindle organization and biogenesis (5 hits), yet only the last two categories were significantly overrepresented compared to observed frequencies of represented functional groups among all contigs assembled (Fishers’s exact test, P. value < 0,05, table 3).

A total of 1904 predicted ORFs with a mean length of 482 bp was identified. These contained 3274 polymorphic sites of which 1734 were synonymous and 1540 non-synonymous. There were 2.8 SNPs per 1000 non-synonymous sites and 7.5 SNPs per 1000 synonymous sites, for a genome-wide non-synonymous to synonymous substitution rate ratio of 0.37 (pS = 0.0075, pN = 0.0028; Fig. 2, Table 2). Twenty-nine contigs had a pS/pN ratio >1, suggestive of positive selection, and these were involved in several biological functions, most notably, mRNA translation and processing (7 hits). Yet, none of the biological functions was significantly overrepresented compared to all contigs assembled (Fisher’s exact test, P > 0.05; Table 4).

Figure 2.

 Non-synonymous mutations per non-synonymous sites compared to synonymous mutations per synonymous sites. Dashed line is the null expectation if mutations were randomly distributed (pN = pS). Solid line is the slope of experimental data (overall average pN for all contigs/overall average pS for all contigs = 0.37).

Table 4.   Contigs with the highest ratio of non-synonymous SNPs per non-synonymous site (pN)/synonymous SNPs per synonymous site (pS)
Gene product*Functional groupspN/pSMatch to previous studies†
  1. *Note that several contigs may correspond to the same gene annotation.

  2. †See legend in Table 3.

40S ribosomal protein S5Translation (GO:0006412)4.3510
Glutamine synthetaseResponse to glucose stimulus (GO:0009749)2.98 
14 kDa apolipoproteinG-protein coupled receptor protein signalling pathway (GO:0007186)2.28 
Basement membrane-specific heparan sulphate proteoglycanCell adhesion (GO:0007155)2.24 
Betaine–homocysteine S-methyltransferase 1Methionine biosynthetic process (GO:0009086)2.173,6
60S ribosomal protein L5Translation (GO:0006412)2.066,7,10
Aldehyde dehydrogenase, mitochondrial precursorCarbohydrate metabolic process (GO:0005975)2 
Complement C3-1G-protein coupled receptor protein signalling pathway (GO:0007186)1.81 
Keratin, type I cytoskeletal 13Epidermis development (GO:0008544)1.756
Tubulin alpha chainMitotic spindle organization (GO:0007052)1.7310,11
40S ribosomal protein S16Translation (GO:0006412)1.6211
40S ribosomal protein S13Translation (GO:0006412)1.59 
Ornithine decarboxylase antizyme 1Polyamine metabolic process (GO:0006595)1.526,11
40S ribosomal protein S8Translation (GO:0006412)1.4411
StathminMitotic spindle organization (GO:0007052)1.2811
Creatine kinase M-typePhosphocreatine biosynthetic process (GO:0046314)1.271,6,9,11
Transposable element Tc1 transposaseTransposition, DNA-mediated (GO:0006313)1.21 
Heterogeneous nuclear ribonucleoprotein GmRNA processing (GO:0006397)1.216
Beta-2-glycoprotein 1 precursorHeparin binding (GO:0008201)1.21 
UnknownProtein amino acid phosphorylation (GO:0006468)1.2 
ATP-binding cassette sub-family F member 1Translation (GO:0006412)1.15 
Nucleolar RNA helicase 2RNA processing (GO:0006396)1.0110
Similar to fatty acid desaturase domain family, member 6Unknown1.41 

SNP frequencies between dwarf and normal whitefish

We analysed a subset of 1504 SNPs that met our criterion for inferring allele frequencies (see Materials and methods). Although most SNPs showed little divergence (Fig. 3), 190 SNPs had significant divergent allelic frequencies (Q-value <0.05) and 89 of these were strongly divergent between normal and dwarf whitefish (above 0.5 in Fig. 3 & Table 5). These 89 SNPs represented 46 different contigs and several biological functions. Of interest among these, seven mitochondrial genes (e-value <1e-50: cytochrome C subunit 1, 2 and 3; NADH-dehydrogenase 1, 4 and 5; and cytochrome B) and seven nuclear genes (cytochrome b-c1 complex subunit 6, ATP synthase subunit d, Malate dehydrogenase, glyceraldehyde-3-phosphate dehydrogenase, creatine kinase, Succinyl-CoA ligase and angiopoietin-related protein 3 precursor) were all involved in energy metabolic pathways.

Figure 3.

 Frequency distribution of allelic frequency differences between normal and dwarf whitefish. Allele divergence value above one (yellow) and with a Q-value <0.05 were considered as highly divergent single nucleotide polymorphism (SNP) markers. Allele divergence value = absolute value of [frequency(allele1Dwarf) − frequency(allele1Normal)]. Note that 1504 SNPs from 387 different contigs were used to draw this distribution.

Table 5.   Single nucleotide polymorphism markers with significant divergent allelic frequencies between sympatric normal and dwarf whitefish
Description†Functional category‡Allele 1 (D)§Allele 1 (N)§Total number of sequences (D, N)¶abs[ƒ(a1D) − ƒ(a1N)]††Match to previous studies‡‡
  1. †Note that several contigs may correspond to the same gene annotation.

  2. ‡Numbers in parentheses indicate that several SNPs within contig were divergent and the subsequent allelic frequencies and Q-value are an average for these SNPs.

  3. §Frequency of the most common dwarf allele and frequency of its corresponding normal allele.

  4. ¶Number of sequences from dwarf and normal fish used to calculate allele frequencies.

  5. ††abs[ƒ(a1D) − ƒ(a1N)] = absolute value of [frequency(allele1Dwarf) − frequency(allele1Normal)]. *Q < 0.05, **Q < 0.01, ***Q < 0.001: Probability value of Fisher’s exact test corrected for multiple hypothesis testing (Q-value) calculated for each SNP as the total number of sites identified to each allele in normal and dwarf.

  6. ‡‡See legend in Table 3.

Angiopoietin-related protein 3 precursorFatty acid metabolic process (GO:0006631)10.29,50.8* 
ATP synthase subunit d, mitochondrialATP synthesis (GO:0015986)0.710.067,160.65* 
Creatine kinase M-type(2) Phosphocreatine biosynthetic process (GO:0046314)0.770.2143,670.56**1,6,9,10
Creatine kinase M-type(13) Phosphocreatine biosynthetic process (GO:0046314)0.730.11391,3830.62**1,6,9,10
Cytochrome bElectron transport chain (GO:0022900)10.0625,310.94*** 
Cytochrome b-c1 complex subunit 6, mitochondrial precursorElectron transport chain (GO:0022900)0.60.0943,350.51*** 
Cytochrome c oxidase subunit 1Oxidation reduction (GO:0055114)0.980.18184,1860.8***6,7
Cytochrome c oxidase subunit 2Oxidation reduction (GO:0055114)0.990.1102,970.89***6,7
Cytochrome c oxidase subunit 3(3) Oxidation reduction (GO:0055114)0.990.22159,1570.77***1,3,6,7,10
Glyceraldehyde-3-phosphate dehydrogenaseGlycolysis (GO:0006094)0.8020,60.8**1,2,4,5,7,10
Malate dehydrogenase, cytoplasmicTricarboxylic acid cycle (GO:0006099 )0.6015,130.6**4
Malate dehydrogenase, cytoplasmicTricarboxylic acid cycle (GO:0006099 )0.670.1424,140.53*4
NADH dehydrogenase subunit 1Electron transport chain (GO:0022900)10.1737,350.83***6
NADH dehydrogenase subunit 1Electron transport chain (GO:0022900)0.990.1697,1530.83***6
NADH dehydrogenase subunit 4Electron transport chain (GO:0022900)10.1824,170.82***6,7
NADH dehydrogenase subunit 5Electron transport chain (GO:0022900)106,71** 
Succinyl-CoA ligase, mitochondrial precursorTricarboxylic acid cycle (GO:0006099)0.680.1528,270.53** 
40S ribosomal protein S9Translation (GO:0006412)0.90.129,200.8***6,10
60S acidic ribosomal protein P2Translation (GO:0006412)0.950.2319,130.72***11
60S ribosomal protein L27a(2) Translation (GO:0006412)10.2326,260.77** 
60S ribosomal protein L39Translation (GO:0006412)0.750.078,280.68**6
Actin, cytoplasmic 1(2) Cytoskeleton (GO:0005856)0.760.0985,1250.67**6
Alpha-1-antitrypsin precursorBlood coagulation (GO:0007596)0.970.4529,330.52** 
C-type lectin domain family 4 member EImmune response (GO:0006955)0.780.259,630.53* 
Coagulation factor × precursor(3) Unknown0.760.1981,900.57* 
Coagulation factor × precursorUnknown0.830.2923,240.54** 
Complement C5 precursorComplement activation, alternative pathway (GO:0006957)10.1712,60.83** 
Complement factor H precursor(5) Complement activation, alternative pathway (GO:0006957)0.820.21553,5160.61** 
Ferritin, heavy subunit(2) Regulation of transcription (GO:0045892)0.820.2861,530.54**5,11
Fibrinogen beta chain precursor(14) Platelet activation (GO:0030168)0.820.03170,1630.79** 
Fibrinogen beta chain precursorPlatelet activation (GO:0030168)0.70.0523,390.65*** 
Fibrinogen beta chain precursor(5) Platelet activation (GO:0030168)0.670.1139,1920.57** 
Fibronectin precursorCell adhesion (GO:0007155)108,41* 
Heat shock protein HSP 90-betaRegulation of nitric oxide biosynthetic process (GO:0045429)0.80.1910,310.61* 
Haemopexin precursorCellular iron homeostasis (GO:0006879)0.980.46206,3710.52*** 
Metallothionein mRNAUnknown0.64011,240.64***6
Nucleolar RNA helicase 2Nuclear mRNA splicing (GO:0000398)0.920.412,480.52*10
Selenoprotein Pa precursorResponse to oxidative stress (GO:0006979)0.690.0229,500.67*** 
Subunit of Ca2+-dependent complex(2) Unknown0.870.1252,190.75** 
Unknown(2) Unknown0.990.1149,800.89*** 
Unknown(2) Unknown0.61038,590.61*** 
Unknown(2) Unknown0.770.1943,330.58* 

Comparison with previous studies

Twelve contigs identified as highly polymorphic (i.e. above 20 SNPs/kb, Table 3) matched to genes previously identified as candidates in different gene expression studies, and the expression of most of those genes had been previously linked to a specific genomic region (eQTL). Two of these (60S ribosomal protein L5, ubiquitin) were also identified as differentially expressed between normal and dwarf in several independent studies (Table 3). Thirteen contigs with a high pN/pS ratio matched to genes previously identified in different gene expression studies, and again the expression of those genes have been lined to an eQTL. Three of those (60S ribosomal protein L5, ornithine decarboxylase antizyme 1 and creatine kinase) were also identified as differentially expressed between normal and dwarf in several independent studies (Table 4).

Eighteen contigs, containing at least one SNP, which showed highly divergent allelic frequencies between normal and dwarf, were annotated to genes previously identified as potential candidates based on expression studies (Table 5). Among these, genes related to energy metabolism (cytochrome C subunit 1, 2 and 3; NADH-dehydrogenase 1, 4 and 5; cytochrome B; cytochrome b-c1 complex subunit 6; ATP synthase subunit d; Malate dehydrogenase; glyceraldehyde-3-phosphate dehydrogenase; creatine kinase; Succinyl-CoA ligase and angiopoietin-related protein 3 precursor) are of particular interest as candidates underlying adaptive divergence between dwarf and normal whitefish since they consistently showed differential expression in independent studies.

High rate of transposition in hybrids

Given that we identified many highly polymorphic contigs annotated to DNA transposition (Table 3), these were further investigated. Forty-four contigs matching to six different DNA transposons and retrotransposons elements (blastne-value <1e-50, Table 6) were detected. These contigs were also, on average, four times more polymorphic than the rest of the assembly (10.8 SNPs/kb compared to 3.4 overall, t-test, P < 0.0001). As sequencing was performed on non-normalized cDNA, the number of reads per population may be used as a proxy for gene expression (Torres et al. 2007; Ledford 2008). A total of 4600 sequences assembled into these 44 contigs and, invariably, there was a strong bias such that 70% of the sequences matching these came from backcross hybrids, whereas the data set was composed of only 38% backcross sequences (chi-squared test, P < 1e-16).

Table 6.   Expression (total number of sequences) annotated to transposon elements in normal and dwarf whitefish as well as backcross hybrids
Gene productNo. contigs†Total number of sequences
  1. †Several assembled contigs were annotated (e-value <1e-50) to the same gene product.

  2. *P = 0.08, **P < 1e-16. Chi-squared tests based on the expected proportion of sequences (in whole assembly, 62% of all sequences are either normal or dwarf, 38% are backcross).

Transposable element Tc1 transposase16133584**157
Transposable element Tcb1 transposase122311142**298
Transposable element Tcb2 transposase696740**151
Non-LTR retrotransposon452320**104
PREDICTED: similar to transposase (Strongylocentrotus purpuratus)119*2
Probable RNA-directed DNA polymerase from transposon BS668394**90

SNP validation

Twenty-nine individual fish were genotyped for a subset of 31 polymorphic SNPs within a single lake (Lake Aylmer). Six markers deviated significantly from expected Hardy–Weinberg frequencies due to heterozygous excess (Q < 0.05, Fig. 4). SNPs genotyped came from contigs with polymorphism ranging from 1.4 to 38 SNPs/kb and there was no apparent correlation between amount of polymorphism and Fis estimates (Pearson’s correlation coefficient = −0.08, P = 0.69).

Figure 4.

 Single nucleotide polymorphism (SNP) validation for 29 individuals originating from a single lake (Lake Aylmer) and genotyped for 31 polymorphic markers. SNPs were ranked according to Fis values (y-axis, left side). Deviation from expected Hardy–Weinberg frequencies (chi-square test, 1 d.f., Q-value <0.05) were included on the y-axis (right side).


By sequencing a total of two and one-quarter runs on the 454 GS-FLX system, 632 000 reads with a mean length, once primers and sample-specific tags were removed, of 193 nucleotides were obtained. The fact that we obtained ∼30% fewer sequencing reads than what would be theoretically expected (400 000 sequences/run) is, at least in part, due to the nature on the DNA itself. First, cDNA sizes, which are quite variable, render the shearing process prior to sequencing more difficult. Second, mature cDNA usually contains large polyA stretches that are harder to sequence and cause many reads to be rejected due to poor quality or very short lengths (Gary Levesque, McGill University, pers. comm.). Nevertheless, we obtained over 130 mb of sequencing reads, which, compared to Sanger sequencing technology, required several orders of magnitude less time and money. As expected, the amount of sequences assembled is strongly dependent not only on the read length (i.e. shorter reads are harder to assemble, Fig. 1) but also on the stringency of the assembly performed. Here, by using a similarity criterion of 0.97 (see Materials and methods for rationale behind using 0.97), 68% of all reads were assembled into 2674 different contigs.

SNP discovery, validation and functional characterization

We identified over 6000 putative SNPs. If all substitutions were equally likely, a 1:2 transition (ts) to transversion (tv) ratio would be expected, as there are twice as many possible transversions than transitions. In reality, a biased ts:tv ratio is thought to be a universal characteristic of the nucleotide composition landscape (Lynch 2007). At the same time, some authors (e.g. Keller et al. 2007) have recently suggested that biased ts:tv ratio may be a sampling artefact as conclusions are based upon experimental data from a few model species (Lynch 2007). Here, in lake whitefish, a strongly biased ratio towards transitions (1.65:1) was identified, supporting the view that this trend is ubiquitous at least among vertebrates.

Determining an exact number of sequence polymorphisms largely depends on the stringency of the assembly and the criteria used to define a true SNP (i.e. coverage and minimum frequency of SNP). Using fairly stringent criteria (minimum similarity: 0.97; minimum coverage of SNP: 6X; and minimum frequency of the least frequent allele: 20%) reduced the amount of false positives. Nevertheless, as salmonids underwent an ancient whole genome duplication event and given that over 50% of their genome is still considered duplicated (Allendorf et al. 1975), we cannot refute the possibility that a significant proportion of putative SNPs may actually be PSVs. For example, in Atlantic salmon, 19% of polymorphic SNPs predicted to be of high quality showed heterozygous excess most likely due to genome duplication (Hayes et al. 2007). This is a problem inherent to SNP markers even in well-characterized species, including humans. Fredman et al. (2004) showed, using fully homozygous cell masses, that only 50% of sequence variants (i.e. putative SNPs) in duplicated regions of the human genome are true SNPs. In fact, our own SNP validation assay revealed that 19% (6/31) of the genotyped loci significantly deviated from expected Hardy–Weinberg frequencies because of heterozygous excess (Fig. 4), a tell-tale sign that these SNPs may be variants between duplicated regions of the genome (Fredman et al. 2004). Although this may be true, several alternative explanations may also be responsible for this pattern: small sample size, heterozygote advantage, frequency-dependent selection or presence of null alleles. Finally, as we address in the last section of the Discussion and Conclusion, although a single SNP only provides circumstantial evidence of its importance in the adaptive divergence of lake whitefish, we strongly emphasize (as suggested by others; cf. Vasemagi & Primmer 2005; Stinchcombe & Hoekstra 2008) that combining experimental evidence targeting different biological levels (e.g. variation at the DNA, gene expression and phenotypic levels) represents the best strategy towards deciphering the genetic basis of evolutionary change. Nonetheless, we recognize that a large data set of SNP markers identified using high-throughput methods probably needs to be validated by alternative methods before being used in further studies as true, experimentally confirmed, genetic markers. Until fully homozygous lines or haploid individuals can be produced, it will be difficult to truly disentangle the effect of gene duplication and genomic divergence.

Several functional categories were identified among the list of highly polymorphic contigs. Namely, ribosomal proteins (mRNA translation), tubulin (mitotic spindle organization) and transposable elements (DNA transposition) are all part of multigenic families found in numerous copies throughout the genome. Such genes are probably particularly prone to biases due to PSVs and therefore putative SNPs for these should be used with vigilance. At the same time, based on our genotyping results, we did not find any significant correlation between Fis (as a potential indication of PSVs) and polymorphism rate (P = 0.69).

Nucleotide substitution effect on predicted ORFs

By identifying 1904 predicted ORFs, this permitted to estimate a transcriptome-wide non-synonymous to synonymous substitution rate ratio (pN/pS) of 0.37. As such, on average, the pN/pS ratio per gene is much lower than a ratio of one expected if mutations were randomly distributed (Fig. 2). This is generally interpreted as indicative of the effect of purifying selection against deleterious amino acid altering changes. Alternatively, ORFs with an elevated pN/pS ratio (e.g. above 1) may indicate genes evolving under the effect of positive selection. Here, 29 contigs had a pN/pS ratio above 1 and were involved in several biological functions. These may constitute candidates under the effect of natural selection responsible for the adaptive divergence of lake whitefish. However, three caveats must be mentioned from such an analytical approach. First, by definition, ORFs represent ‘potential’ region of the genome translated into a protein and therefore do not necessarily code for the actual polypeptide chain. Second, as the number of polymorphic site per base pair is relatively low, only 13% of all contigs detected had an ORF and a pN and pS value above zero. Lastly, with few mutations per gene, ratios can vary drastically if one or a few polymorphic sites are misidentified. As such, although this type of information may be useful to look at general transcriptome-wide trends or in combination with other experimental evidence, inferring the effect of selection on single candidate genes, solely looking at the distribution of synonymous and non-synonymous mutations, must be done with caution.

Differences between normal and dwarf whitefish

As expected based on the young age (<15 000 years) of whitefish species pair, the overall level of divergence between them was relatively weak. In fact, out of 1504 SNPs, only 89, coming from a maximum of 45 different genes (Table 5), had significant highly divergent allelic frequencies between normal and dwarf populations. This represents 6% of all SNPs for which we had enough sequence information to perform this analysis and good candidates for genomic islands of early divergence. In fact, 6% is comparable to what genome scan studies of young species pairs have found looking for genetic loci with divergent allele frequency (5–10%, reviewed in Nosil et al. 2009). For example, Turner et al. (2005) have identified only three genomic regions, encompassing a maximum of 67 genes, showing evidence of reduced gene flow in African malaria mosquitoes (Anopheles gambiae), a system characterized by strong assortative mating. In lake whitefish, using anonymous AFLP markers, previous genome scan studies have suggested that as little as 1.2% of the genome (which may still represent several hundred genes) might be under the effect of directional selection during the adaptive divergence of lake whitefish (Campbell & Bernatchez 2004; Rogers & Bernatchez 2005).

Furthermore, the proportion of divergent SNPs identified in this study may represent an overestimate due to several factors. First, SNP frequencies were estimated from sequences from a maximum of eight dwarf and eight normal individuals that were available. Given the relatively small number of individuals and allele copies, depicted differences should thus be interpreted with caution. Nevertheless, this analytical approach represents a necessary preliminary step towards identifying potential candidate SNPs. Second, as transcribed cDNA was sequenced, it is conceivable that normal and dwarf heterozygous individuals may overexpress a different allele and thus show divergent cDNA allelic patterns despite sharing a common genotype. At this point, it is difficult to clearly distinguish the two alternatives. Yet, both mechanisms point out relevant genetic differences between populations (i.e. differential allele specific expression or true genotypic differences) and we are currently conducting experiments to investigate how these transcriptome allelic frequencies are correlated to genotypic frequencies (Renaut S., Bernatchez L. unpublished).

Increased rate of transposition in hybrids

Aside from sequence or gene expression divergence, a broad variety of mechanisms related to the maintenance of chromatin integrity may be involved in causing hybrid dysfunctions and possibly reproductive isolation (Fontdevila 2005; Michalak 2009). In fact, during her pioneer work on transposable elements, Barbara McClintock was the first to suggest that hybridization in plants might activate dormant transposons and result in genome restructuring (McClintock 1984). Since then, several studies have shown that transposition rates in plant hybrids can increase by several orders of magnitudes (Shan et al. 2005; Ungerer et al. 2006). In animals however, contrasting results and limited direct evidence have casted doubts on the role of transposable elements in speciation processes (Coyne 1989; Labrador et al. 1999; Coyne & Orr 2004). Here, extensive sequencing data provide compelling evidence of an important increase in transposon activity in hybrids, which may be a consequence of partial incompatibility of normal and dwarf genomes reported in previous studies (Rogers & Bernatchez 2006, 2007). Contigs annotated to transposable elements were also, on average, four times more polymorphic than the rest of the assembly. Transposons are, by nature, highly duplicated, and therefore the high polymorphism rate probably reflects the fact that several duplicated copies are activated. Lastly, as cDNA from liver tissue in normal and dwarf and white muscle and brain in the backcross was sequenced, elevated activity of transposon could also be a tissue-specific effect. Nonetheless, it would be peculiar and unheard of in the literature that transposable elements would be more active in muscle and brain than in liver tissues.

Comparison with previous studies

The integration of results from this study with previous analyses of gene expression, QTL and genome scans in whitefish significantly adds to our understanding of the genetic basis of the adaptive divergence of sympatric dwarf and normal whitefish in several ways. First, previous gene expression studies (Derome et al. 2006; St-Cyr et al. 2008; Jeukens et al. 2009; Nolte et al. 2009; Renaut et al. 2009) combined with physiological data (Trudel et al. 2001) have provided ample evidence that changes in the expression of genes involved in energetic metabolism pathways are largely responsible for the adaptation to distinct whitefish benthic (normal) and limnetic (dwarf) niches. Nevertheless, these studies lacked the empirical evidence linking expression differences to actual genotypic divergence for the same genes. Whiteley et al. (2008) addressed this question by combining eQTL information with Fst outlier loci obtained from genome scan studies (Campbell & Bernatchez 2004; Rogers & Bernatchez 2007) to identify genes under the influence of divergent selection. However, they provided only indirect evidence as eQTLs may correspond to the location of the gene itself (cis), or the location of another gene regulating its expression (trans). Our study brings a more direct link between genetic divergence (reduced gene flow) and gene expression divergence. The most salient finding is that 14 genes involved in energy metabolism (both mitochondrial and nuclear) showed pronounced allele frequency differences in this study and were also identified in several previous gene expression studies as differentially expressed in parallel between normal and dwarf whitefish. Namely, very similar allele frequencies observed for mitochondrial SNPs provide confidence that this signal is not a sampling or statistical artefact given that all mitochondrial genes are in full linkage disequilibrium. In addition, previous studies investigating mitochondrial divergence between lake whitefish populations showed that normal and dwarf from the same lake (Cliff Lake) are predominantly associated with distinct mitochondrial lineages from independent glacial refuge origins (Bernatchez & Dodson 1990; Lu et al. 2001). Consequently, although genetic variation and differentiation may have arisen in allopatry during the Pleistocene glaciation, its sorting and maintenance in sympatry during the last 15 000 years appears to be promoted by natural selection. Corroborating this claim is the fact that, in the absence of selection against hybrids, gene flow has been shown to homogenize recently diverged limnetic and benthic three-spined stickleback species pairs in <10 years (Taylor et al. 2006). Therefore, the whole mitochondrial genome, due to its non-recombining nature, is probably under strong selective constraints and we hypothesize that, in conjunction with the maintenance of pronounced allelic divergence at nuclear genes also involved in energy metabolism, it may confer different metabolic efficiencies involved in the adaptive divergence of dwarf and normal whitefish. Consequently, breakdown or mis-regulation of mitochondrial bioenergetics functions in hybrids could play an important role in the speciation process of dwarf and normal whitefish, as revealed recently in other systems (Ellison & Burton 2008; Gershoni et al. 2009).

That metabolic genes associated with the mitochondrion machinery are the underlying targets of selection leading to the adaptive divergence of lake whitefish is further supported by one of the main findings from Whiteley et al. (2008). Namely, their combined eQTL–Fst outlier approach indicated that an eQTL for cytochrome c oxidase (subunit VI) was linked to an Fst outlier locus in three independent lakes inhabited by sympatric normal and dwarf whitefish populations. Hopefully, through ongoing candidate gene mapping efforts, SNP markers will also permit to elucidate the genomic architecture of expression regulation (cis vs. trans regulation) for such candidate genes and strengthen the association between genotype (SNPs from candidate genes) and phenotype (QTLs).

In addition, several contigs with functions unrelated to energy metabolism were matched to previous findings. For example, the 60S ribosomal L5 gene involved in mRNA translation, which was identified as highly polymorphic and potentially evolving under the effect of positive selection (pN/pS ratio = 2.06), had been previously linked to parallel gene expression differences between both wild normal and dwarf adult (Derome et al. 2008) and juvenile whitefish reared in the laboratory (Nolte et al. 2009). Also, ubiquitin, a conserved regulatory protein, was highly polymorphic, previously showed parallel gene expression differences between normal and dwarf in wild adult whitefish (Derome et al. 2006; St-Cyr et al. 2008), laboratory-reared juveniles (Nolte et al. 2009) and associated with an eQTL in white muscle (Derome et al. 2008) and brain tissue (Whiteley et al. 2008). These genes represent examples of additional candidates for divergent selection, which could be either physically linked to other candidate genes or be selected due to strong epistatic interactions with metabolic genes.


Next-generation sequencing technologies are already revolutionizing the way science is done in ecology and evolution. Here, sequencing the transcriptome of two incipient species of lake whitefish and backcross hybrids allowed to gather a large data set of putative SNP markers, analyse their distribution among genes, highlight an apparent increased activity of transposons in hybrids and identify potential targets of divergent selection. Mitochondrial and nuclear genes involved in energy metabolism emerge as prime candidates underlying the adaptive divergence of sympatric species of lake whitefish. Thorough investigations using genome scan in natural population as well as candidate gene mapping will permit to confirm this hypothesis. The rationale of our research programme on the adaptive divergence of lake whitefish is that integrating results targeting different functional and biological levels (e.g. variation at the DNA, gene expression and phenotypic levels) represents the best strategy towards deciphering the genetic basis of evolutionary change and diversification driven by natural selection.


We would like to thank J. Laroche, E. Normandeau and C. Sauvage for help with the bioinformatics, as well as J. Jeukens and N. Derome for insightful suggestions on earlier versions of the manuscript. This project was funded by a Natural Science and Engineering Research Council of Canada (NSERC) and Canadian Research Chair in Genomics and Conservation of Aquatic Resources to LB, a NSERC postgraduate scholarship to SR and a postdoctoral research stipend from the German Research Foundation to AN. This study is a contribution to the research programme of Québec Océan.

Conflicts of interest

The authors have no conflict of interest to declare and note that the sponsors of the issue had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.