Ecological speciation and genome scans
The population genomics approach has been widely used in recent years as a strategy in the study of the genetic basis of local adaptation and ecological speciation (Nosil et al., 2009; Storz, 2005). ‘Next-generation’ sequencing technologies can potentially increase the power of genome scans, ultimately through re-sequencing of whole genomes. Turner et al. (2010) showed how genes involved in local adaptation in Arabidopsis lyrata can be detected in this way when a reference genome is available. However, one of the advantages of genome scans is that they can be applied to organisms with well-characterised ecology that lack genomic resources. In such cases, EST-based genome scans are more informative, because they focus on coding regions (Bonin, 2008). Recent studies have shown how 454 pyrosequencing of the transcriptomes of divergent populations can be used to detect candidate genes for adaptation (Coregonus spp., Renaut et al., 2010; Rhagoletis pomonella, Schwarz et al., 2009). Neither study used simulations to determine the expected distribution of SNP frequency differences, which we consider important because of both population genetic effects and the sampling process during sequencing.
Our study was based on approximately 60 Mb of sequence (compared to the approximately 500 Mb that can be obtained with the recent GS-FLX Titanium Series), but we were able to perform a de novo assembly and SNP detection, and this procedure provided a large number of potentially informative markers. The two assemblers performed quite differently with the settings we used, the more stringent NGen assembly providing around three times more contigs but with fewer reads per contig than for Newbler. For the genome scan analysis, we were primarily interested in contigs with high coverage and the Newbler assembly was preferable in this respect. Although the lower stringency increases the risk of including reads from paralogous loci, there is no obvious reason why this should create false-positive outliers. Despite lower coverage per SNP on average, the NGen assembly did allow the identification of some additional outliers. We expect this sensitivity to differences between assembly packages and settings to be lower for data sets with more and longer reads.
Indexing of cDNA from individuals before 454 sequencing provides an alternative to pooling but it significantly increases costs and restricts the number of individuals that can be analysed. Where the primary question of interest is to detect frequency differences between populations, pooling of large samples of individuals may continue to be the preferred strategy (see Futschik & Schlötterer, in press). Our simulations (Fig. 2) show that variation in RNA amounts among individuals is not a serious problem but indicate that the number of reads per SNP should, ideally, be higher than in our current data set. This will be easily achieved with the current sequencing technology.
Normalization is another potential source of variation in SNP frequencies. However, as argued earlier, it is unlikely to generate marked frequency differences between pools. It has the advantage of evening out the representation of different transcripts and so increasing the number with sufficient coverage for analysis but the disadvantage of removing information about variation in levels of expression among transcripts, within pools. However, transcripts with large expression differences between pools may still be differentially represented in the 454 reads. Therefore, we have identified those transcripts with the greatest differences in representation as targets for future study but do not make any further inferences about them.
SNPs detected using 454 sequencing of pooled cDNA should be validated by direct genotyping from genomic DNA of individuals. Sequencing errors may generate false SNPs, and allele frequencies may not be estimated accurately from pools, as illustrated by comparisons between read frequencies and direct genotyping (e.g. Van Tassell et al., 2008). In our study, sequencing errors may contribute to the total number of SNPs detected but are very unlikely to generate outliers, with strong frequency differences between pools. Our test for outliers incorporates the major sources of variation in SNP frequencies, although clearly not all sources. Nevertheless, the outliers detected will include false positives (and also exclude many loci that are, in fact, influenced by selection) for these experimental reasons as well as the limitations of the demographic model discussed in the following paragraphs. Therefore, we emphasize that the objective of this 454 genome scan approach, as with any such methodology, is to identify candidate loci for further study rather than to provide strong evidence for selection influencing individual loci.
Divergent ecotypes of L. saxatilis provide an example of local adaptation and partial reproductive isolation (see Johannesson et al., 2010 for a discussion of repeated evolution of ecotypes, Rolán-Alvarez, 2007 for a review, Sadedin et al., 2009 for a model). Previous genome scans with AFLPs have detected a small proportion of markers, 2–5% outlier loci, with greater than neutral genetic differentiation (Galindo et al., 2009; Wilding et al., 2001). Our EST-based genome scan showed a similar pattern, with a small proportion of the markers showing greater than neutral differentiation (outlier loci), around 7% when applying a multiple test correction. A recent review by Nosil et al. (2009) on divergent selection and the extent of genomic divergence during population differentiation and/or speciation points out that the percentage of the genome apparently affected by divergent selection (% of outlier loci) lies in the range of 5–10% in most of the studies reviewed (Nosil et al., 2009 and references therein). However, they remark that the results of different genome scans should be compared with caution because of variation among the analyses (number of populations, type of molecular markers, methodology and level of significance). Population structure is also an important variable to take into account (Excoffier et al., 2009). It is clear that false positives (and false negatives) are a feature of all genome scan studies (Hermisson, 2009). Therefore, the percentage of outliers detected in a genome scan is not the most important outcome, and individual outliers should be treated as candidates, in need of further investigation into the effects of natural selection. In studies of ecological speciation, a genome scan is one of the first steps towards the genetics of divergent adaptation, rather than an end in itself.
Outlier loci may be the direct targets of selection, they may be regions tightly linked to selected loci or they may be false positives (Nosil et al., 2009). An advantage of using ESTs is that a proportion of loci can be functionally annotated, and this may reveal loci likely to be associated with traits under selection. However, the annotation step may be problematic for 454 data from nonmodel species. In this study, we were able to annotate only 16% of the contigs with outlier SNPs (30 of 187 contigs, Newbler). This is for several reasons: contigs are often short, they may contain primarily untranslated regions, the Mollusca are not well represented in sequence databases and the molluscan sequences that are available are not themselves well annotated.
A potential confounding factor in using cDNA is the possibility of allele-specific expression (e.g. Pant et al., 2006). As pointed out by Renaut et al. (2010), if different alleles are preferentially expressed in the populations being compared, false outliers may be generated. This would actually represent an interesting form of divergence and should be considered in follow-up studies of EST outliers.
In the case of the ecotypes of L. saxatilis, we expected, a priori, that the outlier loci should be involved or linked to genes involved in local adaptation and assortative mating (Butlin et al., 2008; Johannesson et al., 2010). Shell size represents an important variable in local adaptation (wave exposure and crab predation) and mate choice, but also shell shape and the foot muscle that attaches the snail to the substrate play an important role in adaptation to withstand wave exposure (see Rolán-Alvarez, 2007 for a review). Thus, attention should be directed towards genes related to shell formation, muscle physiology and energy metabolism. Because the females used in our study carried developing embryos in their brood pouches and we used all tissues in our RNA extractions, there is the potential for genes involved in all of these processes to be included in our data. Here, we briefly discuss outliers that stand out as promising targets for future investigation on the basis of their annotation.
Skeletal matrix proteins
Shells in L. saxatilis are composed of calcite and aragonite (Taylor & Reid, 1990), with an associated organic matrix that is thought to be involved in shell formation and so influences the properties of the shell (e.g. size, shape) (Gunthorpe et al., 1990). Some of the SGoF outliers showed matches with skeletal matrix proteins including lithostathine, mucin and dermatopontin. Mucin was also detected in the differential representation analysis (see Tables S5 and S6). Lithostathine is a C-type lectin-like protein, which plays an important role in calcium carbonate biomineralization in a wide variety of organisms (Matsubara et al., 2008 and references therein). Mucins are heavily glycosylated proteins, and there is evidence that they have a role in molluscan shell calcification (Marin et al., 2000). The fact that mucin also stands out in the differential representation analysis could be because members of the mucin gene family are differentially expressed between the ecotypes. Dermatopontin is considered a major component of the shell matrix proteins in molluscs (Marxen et al., 2003).
Skeletal muscle proteins
The muscular foot size of L. saxatilis is associated with the level of wave exposure (Grahame & Mill, 1986). Contigs with matches to myosin and titin were SGoF outliers, and titin was also differentially represented in the two pools. Twitchin is a titin-like protein in molluscan smooth muscle, and it is involved in the ‘catch’ contraction in molluscs, a unique energy-saving contraction (reviewed in Funabara et al., 2005). Molluscan catch muscle can maintain tension for a long time with little energy consumption, and twitchin interacts with myosin in this contraction (Funabara et al., 2001).
Energetic metabolism is likely to vary with environmental stressors (e.g. temperature, anoxia, wave action) that differ between tidal levels. Outlier loci (SGoF) matched genes involved in energetic metabolism, arginine kinase (ARK) and NADH dehydrogenase (NADH-dh) and were detected as outliers in both assemblies (Newbler and NGen). ARK regulates the availability of ATP in cells involved in metabolic work (Morrison, 1973). ARK has shown evidence for local adaptation in previous studies of Littorina spp.: L. fabalis (Johannesson & Mikhailova, 2004; Tatarenkov & Johannesson, 1994), L. obtusata (Schmidt et al., 2007) and L. saxatilis (Martínez-Fernández et al., 2008). The hypothesis in these studies is that certain ARK alleles provide a faster supply of ATP for contracting the foot muscle in wave-exposed habitats. NADH dehydrogenase subunits are encoded by the mitochondrial genome, and they are involved in the respiratory chain. Natural selection may affect genes within the mtDNA (Ballard & Whitlock, 2004 and references therein). It is also possible that these contigs are false positives caused by transcribed nuclear pseudogenes derived from mtDNA (Bensasson et al., 2001). However, previous studies of littorinid mtDNA have not detected nuclear copies (Wilding et al., 2000).
In eukaryotes, reverse transcriptases are responsible for the movement of transposable elements (TEs). TEs account for a major fraction of eukaryotic genomes (Feschotte et al., 2002 and references therein), and they are able to modify gene expression and promote genome evolution (reviewed in Gogvadze & Buzdin, 2009). As the genome of Littorina is large (estimates range from 0.81 to 1.47 pg, 0.79–1.35 × 109 bp, http://www.genomesize.com/index.php), it is likely to contain many TEs and it is not surprising that reverse transcriptase transcripts occur in our cDNA pools. They may occur as outliers because different gene copies are combined within contigs.
Repetitive elements such as TEs have already been described in L. saxatilis (see Wood et al., 2008). In a previous study (Wilding et al., 2001), an AFLP genome scan discovered outlier loci between H and M ecotypes of L. saxatilis, two of which (E10 and E12) were associated with putative TE insertions (Wood et al., 2008). At the same time, BAC clones containing these outlier loci and another two BAC clones containing neutral AFLP loci (A30 and A37) were sequenced and found to contain additional partial copies of these inserted sequences (Wood et al., 2008). In our study, 21 contigs with outlier SNPs matched sequences within these BACs, in most cases matching sequences in several BACs at the same time. This result suggests that these repetitive elements must be very widespread across the Littorina genome because the sequenced BACs represent such a small proportion of the total. It is difficult to interpret their ‘outlier’ status because the expectation against which SNP frequency divergence was compared is not appropriate for repetitive elements. Nevertheless, it is intriguing that repetitive elements appear to differ between ecotypes. In some cases, these elements might play a role in gene regulation, and they could represent a mechanism for adaptation to changing environments through genetic novelty (reviewed in Gogvadze & Buzdin, 2009). However, the number of studies of TEs remains small relative to their ubiquity and abundance in eukaryote genomes, and more studies are needed to address their role in adaptation. Next generation sequencing will play an important role in this respect.
We found several contigs within our data set that have a bacterial origin, which was not unexpected because we used whole snails, including the digestive gland. For example, Vera et al. (2008) also found microbial sequences in their butterfly data set, including an intracellular parasite. In our case, these microbial sequences are interesting because they reflect characteristics of the environment. Littorina saxatilis feeds on the epilithic biofilm of diatoms, cyanobacteria and bacteria (Norton et al., 1990). Microgradients of light, moisture, temperature and wave exposure are common in the rocky intertidal habitats, and this determines the distribution, diversity and abundance of algae and invertebrates (Menge & Branch, 2001). Thus, we might expect the resources available to differ between ecotypes. Some of the contigs identified showed marked differences either in SNP frequency or in abundance of reads between the ecotypes, and most of these had Blastn matches with cyanobacteria (see Table S7). Our results suggest that the microbial biofilm may differ between tidal levels, and this provides another, previously unrecognized, habitat axis that may contribute to the divergent selection that operates on Littorina populations, as well as on other marine gastropods.
Sequences matching the bacterivorous ciliate C. glaucoma were also identified within the contigs with reads present only in the H ecotype. Ciliates can be observed inside of the brood pouch of female L. saxatilis, in contact with the developing embryos (personal observation), and are likely to be this species. We do not know whether they have negative effects on the reproductive success or what proportion of each ecotype is infested. These two questions should be addressed in future studies.