Sequencing goes 454 and takes large-scale genomics into the wild


Hans Ellegren, Fax: +46 18 4716460; E-mail:


Sometimes, science takes a big leap forward. This is often due to new technology that allows the study of questions previously difficult or even impossible to address. An example of this is provided in this issue (Vera et al. 2008) by the first large-scale attempt toward genome sequencing of an ecologically important model, based on the new ‘454-sequencing technology’. Using this new technology, the protein-coding sequences of the Glanville fritillary butterfly genome have now been largely characterized.

In the mid 20th century, the advent of molecular biology allowed rapid genetic advances. Subsequently, the 1990s saw the rise of the genomic era when it became technically feasible and realistic to uncover the complete genetic code of living organisms, initially of small bacterial genomes but more recently of eukaryotic species as well. DNA sequencing has been a cornerstone of molecular biology, genetics, genomics and, needless to say, the field of research covered by this journal, molecular ecology. However, despite the technological advances witnessed in genomics during the last decades, sequencing is still generally performed using the same chain-termination methodology developed by Fred Sanger et al. in 1977. Alternative concepts have been suggested, such as the early base-cleavage ‘Maxam and Gilbert method’ (Maxam & Gilbert 1977), sequencing by hybridization (Southern 1996) and, more recently, pyrosequencing (Ronaghi et al. 1998); however, none of these have come to replace dideoxy sequencing as the method of choice. Protocols and instrumentation needed for Sanger sequencing have been optimized over the years and have made genome sequencing possible. Yet, targeting a new eukaryotic genome with traditional sequencing technology represents such a huge undertaking that in practice, it has essentially been confined to model organisms and to a few laboratories with extraordinary resources and instrumentation.

Fortunately, a major leap in DNA sequencing is now finally being made by the introduction of new concepts for ultra-high throughput sequencing. One of these is the so-called ‘454-sequencing technology’, which is actually based on the above-mentioned pyrosequencing principle (Marguelis et al. 2005). However, a major increase in throughput is accomplished by massive parallelization through emulsion polymerase chain reaction (PCR) and sequence analysis on fibre optic chips. A schematic illustration of the different steps in 454-sequencing is given in Fig. 1; the reader should refer to this figure for technical details of the process.

Figure 1.

A schematic illustration of the different steps in 454-sequencing. (1) Short amplicons, fragmented DNA or cDNA is used as starting template for further processes. (2) 5′- and 3′-end specific adapters are ligated to single-strand fragments, creating a library for PCR amplification. (3) Fragments are immobilized through a biotin tag on one of the adaptors, that binds to streptavidin-coated beads. Each bead will come to carry just one fragment. Beads are then emulsified in a water-in-oil mixture. (4) Each drop of oil contains the necessary ingredients for PCR and thereby forms a microreactor for amplification. Massively parallel amplification is carried out in the emulsion. Beads, with amplified fragments bound to them, are released from oil and are loaded onto a fibre optic chip, a picotiter plate, for sequencing. Only one bead will fit in each ≈ 44 µm well. (5) Pyrosequencing takes place by a sequential flow of sequencing reagents across the plate. When a complementary nucleotide is added to a particular template in an extension reaction, a light signal is generated. (6) The final result is a pyrogram in which the height of each signal is proportional to the number of adjacent nucleotides that are identical. In each cycle of the sequential addition of the four different nucleotides, no signal is seen when noncomplementary nucleotides are added. (7) Description of the chemical reactions leading to a light signal. Illustration: Ola Lundström. The pyrogram was kindly provided by Anders Götherström.

The Glanville fritillary butterfly (Melitaea cinxia) has long been a focus for studies of ecology and population biology, particularly metapopulation dynamics in fragmented landscapes, in the work led by Ilka Hanski in Helsinki (e.g. Saccheri et al. 1998; Ehrlich & Hanski 2004). Recently, genomic approaches have been used to dissect and understand the genetics and mechanisms of variation in phenotypic traits in this species. For example, they have used a candidate gene approach to show that variation in the phosphoglucose-isomerase locus (pgi; second step of glycolysis) is related to flight ability and the growth rate of local populations (Haag et al. 2005; Hanski & Saccheri 2006). However, in order to begin identifying additional genes using top-down approaches such as microarray analyses, genomic resources were needed from the protein-coding part of the genome. For this reason, an international collaboration with Jim Marden's lab at Pennsylvania State University has been formed.

Vera et al. (2008) used complementaryDNA prepared from larvae and pupae, and from adults, to make two runs of 454-sequencing. The obvious rationale for this design was that sequencing is targeted to the coding regions of the genome, just as in expressed sequence tag (EST) sequencing using conventional methodology (Bouck & Vision 2007). Coding regions are candidates for functional variation and although there is increasing support for the hypothesis that phenotypic variability is governed to a large extent by regulation of gene expression, concentrating sequencing on coding regions has the benefit of allowing construction of microarray for expression analysis.

The two runs generated a total of approximately half a million high-quality reads, averaging 110 bp in length. When assembled, i.e. when bioinformatic tools were used to identify homology among reads to stitch them together, 48 000 contigs formed by two or more overlapping reads were identified, as were 60 000 singletons. However, this should not be understood as that butterflies have > 100 000 different genes. For one thing, many of the singletons, and perhaps also the contigs, are likely to originate from the same gene, although it was not possible in all cases to assemble them into a continuous sequence. Also, it is well known that many sequences found in cDNA/EST sequencing do not represent protein-coding genes; for example, they may be noncoding RNAs. The authors used all unique sequences in blast searches against other genomes, particularly the draft sequence of the silkworm Bombyx mori genome, and found some 9000 hits. As B. mori has been estimated to have about 18 000 genes, and assuming that this is representative for Lepidoptera, this may suggest that at least half of all genes in the Melitaea cinxia genome had been sequenced at least in part.

What are the benefits of all this information?

  • 1It gives direct access to sequence information necessary for candidate-gene approaches or for identification of candidate genes in regions showing signals of linkage to phenotypic traits.
  • 2Since pools of individuals were used for sequencing and given the depth of coverage obtained, a large number of sequence variants were detected. About seven segregating sites per 1000 bp (note that this rate is highly dependent on the number of individuals screened) were seen using stringent criteria for polymorphism detection. This will provide a valuable resource of genetic markers for QTL, or association mapping and population genetic analysis.
  • 3With the availability of large-scale sequence information from the transcriptome (the part of the genome that is transcribed into RNA), species-specific microarray can be constructed. Vera and colleagues (2008) constructed an oligonucleotide array (60 mers) and performed a small expression pilot study, providing proof-of-principle for the feasibility of the approach.

Admittedly, the new study contains limited ecological information. It also ignores the potential for molecular evolutionary analysis using a comparative genomic approach in which sequences of M. cinxia would be aligned to orthologs in B. mori, Drosophila melanogaster and other insects, and surveyed for signs of selection. The strength of the study lies in that it offers a resource for further analyses of adaptation and population processes in an ecologically important model; analyses that otherwise would have been more or less impossible to do. A specific example of this was the surprising discovery of over 400 genes from the intracellular fungal parasite microsporidia, which where subsequently added to the microarray. Microsporidia are commonly found in insects and are known to affect insect physiology and population-level dynamics. More generally, it may be expected that this study should stimulate others to take the step into large-scale genomics, paving the way for a new generation of molecular ecological research.

The 454-system (manufactured by 454-Life Sciences, recently incorporated with Roche Applied Science) is not alone on the market of new ultra-high throughput technologies. In fact, two other technologies are commercially available that actually generate more sequence information per run than 454-sequencing, but at the price of shorter read lengths. These include Solexa sequencing (Illumina; Bennet 2004; Bentley 2006) which is a sequence-by-synthesis concept and the solid system based on sequence-by-ligation (Applied Biosystems; Shendure et al. 2005). Read lengths may in these cases be as short as a few tens of bp, basically necessitating access to a reference genome for comparison. However, there is potential for generating up to 1 Gb of DNA sequence or more per run. Moreover, manufacturers of all three systems suggest that read lengths will improve substantially in the near future. For example, since the completion of the butterfly study presented in this issue, a new version of the 454-system (gs flx) has been launched that generates 250 bp reads, and longer reads are promised by next year. Undoubtedly, these new technologies will impact the study of molecular ecology in ways that would have been difficult to imagine just a few years ago.

Hans Ellegren's laboratory is doing research in the interphase of genomics and evolutionary biology.