The study of plant biology in the 21st century is, and will continue to be, vastly different from that in the 20th century. One driver for this has been the use of genomics methods to reveal the genetic blueprints for not one but dozens of plant species, as well as resolving genome differences in thousands of individuals at the population level. Genomics technology has advanced substantially since publication of the first plant genome sequence, that of Arabidopsis thaliana, in 2000. Plant genomics researchers have readily embraced new algorithms, technologies and approaches to generate genome, transcriptome and epigenome datasets for model and crop species that have permitted deep inferences into plant biology. Challenges in sequencing any genome include ploidy, heterozygosity and paralogy, all which are amplified in plant genomes compared to animal genomes due to the large genome sizes, high repetitive sequence content, and rampant whole- or segmental genome duplication. The ability to generate de novo transcriptome assemblies provides an alternative approach to bypass these complex genomes and access the gene space of these recalcitrant species. The field of genomics is driven by technological improvements in sequencing platforms; however, software and algorithm development has lagged behind reductions in sequencing costs, improved throughput, and quality improvements. It is anticipated that sequencing platforms will continue to improve the length and quality of output, and that the complementary algorithms and bioinformatic software needed to handle large, repetitive genomes will improve. The future is bright for an exponential improvement in our understanding of plant biology.
The basis of all biological life is the genetic code. Thus, access to the primary DNA sequence, i.e. the genome, and how genes are encoded within the genome, has become a fundamental resource in biology. Although the pace of genome sequencing in plant biology lags behind that in microbial and mammalian systems (Table 1), application of genomics and the associated data, expertise and hypotheses are rampant among sub-disciplines of plant science, including agronomy, biochemistry, forestry, genetics, horticulture, pathology and systematics. In addition to de novo sequencing of a genome, sequencing technologies and associated bioinformatic and computational processes allow determination of the transcriptome and the epigenome (modified DNA and chromatin state), as well as subsets of the genome such as the complement of exons (the exome) or the regulatory regions (the regulome).
This review provides a historical perspective on the science of genomics in plant biology (Table 1) and highlights use of the rapidly evolving next-generation sequencing technologies in plant biology. Challenges in genomics, especially for plants, are discussed to raise awareness within the greater plant community regarding the current limitations in the field. New computational approaches for handling the size of sequence datasets are discussed to familiarize readers with these resources. References to sequencing technologies are not meant to be comprehensive, but aim to direct readers to at least a single application of the technology in plants.
Historical Perspective on Plant Genome Sequencing: The Early Years with the Sanger Platform
The early 1990s marked the start of genomics, with the development of automated sequencing methods that used dideoxy chain termination with fluorescent molecules, also known as Sanger sequencing. This technology enabled the first large-scale gene discovery effort via sequencing, i.e. expressed sequence tags (ESTs) (Adams et al., 1993). The single-end sequencing of cDNA clones (Figure 1a) permitted discovery of genes from discrete tissues of interest. This approach, while initially controversial, was embraced by plant biologists, and included sequencing of ESTs from the model species Arabidopsis thaliana (Newman et al., 1994). Sequences of ESTs for 733 species of plants (23 390 105 total plant ESTs; 33% of dbEST) are now available in the National Center for Biotechnology database of ESTs (dbEST) (http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html), demonstrating the interest in and utility of single-pass sequences for transcribed genes in plants.
The first de novo sequenced genome of a free-living organism was that of the bacterium Haemophilus influenzae (Fleischmann et al., 1995) (Table 1). This paradigm-changing accomplishment was achieved using Sanger-based sequencing technology, and demonstrated that whole-genome shotgun sequencing (WGS), in which the genome was fragmented randomly, cloned into a plasmid vector (Table 2), paired-end sequenced (Figure 1b), and assembled from the reads using computational algorithms, was feasible. This was quickly followed by a second bacterial WGS effort with Mycobacterium genitalium (Fraser et al., 1995), demonstrating that this approach could be repeated and was not just an isolated accomplishment. Although use of WGS was successful for small microbes, i.e. of the order of several megabases (Mb), application of this method to larger eukaryotes was not possible due to assembly challenges. Thus, early eukaryotic genome WGS projects in the 1990s entailed fragmentation of the genome into large segments such as cosmids, bacterial artificial chromosomes (BAC) and yeast artificial chromosomes (YAC) that were subjected to shotgun sequencing on a per clone basis. The ordered set of clones and their consensus sequences were then stitched together into a pseudomolecule to represent the chromosome(s) (e.g. Lin et al., 1999).
Table 2. DNA amplification methods in sequencing platforms
DNA amplification method
Plasmid replication in Escherichia coli
Exploit bacterial capability to replicate DNA in vivo
Generate micro-PCR reactor by immersing water droplets containing PCR reagents in oil
454 Roche, SoLiD™, Ion Torrent
Amplify DNA directly on flow cell in local clusters
The prototype plant genome, Arabidopsis thaliana
Due to its suitability as a model species for plant research and its small genome, A. thaliana was the first plant species targeted for de novo genome sequencing. The project commenced in 1996 with formation of the Arabidopsis Genome Initiative, a consortium of international laboratories that utilized a BAC-by-BAC approach with Sanger sequencing to complete the genome, culminating in a landmark publication in 2000 (Arabidopsis Genome Initiative 2000) (Tables 1 and 3). The sequenced genome (119 Mb) is considered the gold standard for plant genomes due to the high quality and finished nature of the sequence. It is of interest to note that even the A. thaliana genome is not a complete genome as gaps exist in the sequence (ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/tair9_Assembly_gaps.gff), and experimental determination estimates a genome size of 157 Mb (Bennett et al., 2003). While the majority of these gaps are in highly repetitive sequences that are recalcitrant to all known sequencing and assembly methods, e.g. centromere and ribosomal DNA (rDNA), there are a limited number of gaps in the euchromatic regions, and updates to the sequence are an active component of curation (http://arabidopsis.org).
Using the approach taken by the Arabidopsis Genome Initiative, the genome of rice (Oryza sativa) was also sequenced by the International Rice Genome Sequencing Project using a BAC-by-BAC approach, yielding the only other plant genome sequence that is of ‘finished quality’ (International Rice Genome Sequencing Project 2005) (Table 3). As with the A. thaliana genome, the O. sativa genome is nearly complete, with gaps at most of the centromeres and a limited number of gaps in the euchromatic arms.
Whole-genome shotgun sequencing using the Sanger platform
The throughput of the Sanger platform was greatly improved by the development of capillary sequencing, enabling simultaneous sequencing of 96 reactions. When the higher throughput was coupled with improved assembly algorithms and computational power, direct de novo genome sequence and assembly of whole eukaryotic genomes became possible. This method eliminated the time-consuming and expensive step of identifying, shotgun sequencing and assembling an ordered set of cosmid/BAC/YAC clones. The effectiveness of Sanger WGS for large eukaryotic genomes was first demonstrated in 2000 for Drosophila melanogaster, opening up a new era of genomics (Adams et al., 2000). This revised approach for eukaryotic WGS was embraced by the plant community, and multiple genomes have since been sequenced using this approach (Table 3) (Goff et al., 2002; Yu et al., 2002; Tuskan et al., 2006; Ming et al., 2008; Paterson et al., 2009, International Brachypodium Initiative 2010; Schmutz et al., 2010). However, none of these genomes are ‘finished’, and there are gaps and errors in the assembly, as the process of ‘finishing’ entails inspection and experimental resolution of inconsistencies, which is a time-consuming, laborious and expensive task. To date, A. thaliana and rice remain the only finished dicot and monocot genomes.
Reduced representation sequencing
Although WGS was successful for a range of plant genomes, plant species with genomes >1 Gb or highly repetitive genomes, such as maize (Zea mays) and hexaploid wheat (Triticum aestivum), are recalcitrant to this approach. Reduced representation sequencing, in which the total size of target DNA to sequence is decreased, provides an economical and feasible approach to obtain a portion of the genome sequence of large, repetitive plant genomes. Two methods have been shown to generate robust representations of genic regions: methylation filtration and C0t selection. In methylation filtration, the methylation of gene-rich regions is exploited by cloning plant genomic DNA into a bacterial host that limits replication of hyper-methylated DNA (Rabinowicz et al., 1999). Thus, transformants are enriched in genic regions of the genome. In C0t selection, renaturation kinetics are exploited to enrich for fractions of single-copy DNA, as the low-copy-number sequences, i.e. the genes, anneal more slowly than high-copy-number sequences such as repetitive sequences (Britten and Davidson, 1976). These methods were successful for the maize genome, yielding a reduction in the percentage of repetitive sequences from 68% in unfiltered sequences to 33 and 14% for methylation filtration and C0t, respectively (Whitelaw et al., 2003).
Advent of Next-generation Sequencing Platforms
The first non-Sanger platform available for sequencing on a genome scale was hybridization-based re-sequencing. One commercial platform, developed by Affymetrix (subsequently Perlegen), involves the synthesis of oligonucleotides on a solid surface using photolithography (Chee et al., 1996). A total of eight oligonucleotides are used to interrogate a single base (four oligonucleotides for each strand). The DNA from the query genome is then hybridized to the array, and the intensity of hybridization to each probe is quantified. The intensity is then used to deduce the identity of the interrogated base. Probes are synthesized to represent the whole genome or part of it, leading to genome-scale sequence for the query accessions. The Perlegen-based re-sequencing method has been used to survey 20 A. thaliana accessions (Clark et al., 2007) as well as 20 rice accessions (McNally et al., 2009). These initial forays into determinations of genome diversity were highly informative, resulting in identification of single nucleotide polymorphisms, genes with high diversity between accessions, and genes with limited polymorphisms, thereby providing insight into evolution at a genome level. The customized and expensive nature of the Perlegen arrays, coupled with the emergence of high-throughput next-generation sequencing platforms, has limited the use of hybridization-based re-sequencing to these few plant studies.
Roche 454 platform
In 2005, a high-throughput pyrosequencing method involving sequencing by synthesis was reported (Margulies et al., 2005). This platform, created by 454 Life Sciences Co. and later acquired by Roche (http://www.roche.com), utilizes emulsion PCR, a technique in which a single DNA fragment is clonally amplified by PCR within a water-in-oil microreactor (Table 2). Sequencing occurs on a flow cell with picoliter wells, in which addition of a nucleotide to the growing strand by DNA polymerase results in release of a pyrophosphate. The pyrophosphate is then used in a coupled reaction with ATP sulfurylase, luciferase and luciferin that emits light, which is captured by a CCD camera. Apyrase is used to degrade unused nucleotides and ATP before the next deoxyribonucleotide triphosphate is added. This platform has evolved with respect to throughput and read length. Reads may be single-end (Figure 1a), or fragments may be circularized, ligated and selected to generate mate-pair sequences to provide scaffolding information (Figure 1c). The 2012 upgrade of the platform to the FLX+ generates 1 million reads with read lengths of up to 1000 bp (http://my454.com/products/gs-flx-system/index.asp), providing a replacement for the Sanger platform for longer reads, which are valuable for deciphering repetitive sequences and accurate reconstruction of transcript isoforms.
Another high-throughput method is that of SoLiD™ (Applied Biosystems, http://www.appliedbiosystems.com), which utilizes emulsion PCR to amplify the templates (Table 2) and sequencing by ligation (McKernan et al., 2009). In sequencing by ligation, iterative rounds of ligation of the template to fluorescently labeled di-base probes are performed on a flow cell. Following detection, a portion of the probe is cleaved, and sequential ligation, detection and cleavage are performed. The ligation cycle is repeated until the desired read length is achieved, then the primer is removed and an offset primer (n − 1 to n − 4) is hybridized, and the ligation cycles repeated. In this platform, two-color encoding of every base in color space provides a high-quality sequence, which is a major advantage of this platform. Throughput of the SoLiD™ platform as reported in early 2012 is 7–20 Gb per day, with a maximum read length of 75 bp (http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-next-generation-sequencing/next-generation-systems.html). The SoLiD™ platform has been used for transcriptome analysis (Autran et al., 2011), de novo genome sequencing (Shulaev et al., 2011) and re-sequencing projects (Ashelford et al., 2011). As with the Illumina platform (see below), increasing the lengths of the reads would be a major improvement in the platform that would enhance determination of transcript isoform structure and haplotype structure and de novo genome sequencing in complex heterozygous and polyploid genomes, all of which are challenging with short reads.
The majority of the plant genomes sequenced to date using next-generation sequencing platforms (Table 3) have used a hybrid assembly approach, i.e. more than one platform was used to generate sequences for the WGS assembly. A key limitation in implementation of next-generation sequencing platforms is the assembly of short reads generated by the Illumina and SoLiD™ platforms. The challenges are due to the highly repetitive nature of plant genomes, the lack of robust de novo assembly programs for assembling next-generation sequences, and the need for mate-pair or paired-end sequences from large insert fragments to facilitate scaffolding of the underlying contigs (Figure 1b,c). As a consequence, a number of next-generation plant genome sequencing efforts also included Sanger-derived end sequences from larger insert fosmid and BAC clones to drive the assembly process. One genome for which this approach has been used is cucumber (Huang et al., 2009a), in which 3.9× coverage by Sanger-derived reads and 69.3× coverage by Illumina reads were used to assemble 243 Mb of the 367 Mb genome. Similarly, a combination of Sanger, Illumina and Roche 454 reads were used for de novo assembly of 727 Mb of the approximately 850 Mb genome of potato (Solanum tuberosum) (Potato Genome Sequencing Consortium 2011). In contrast, with more simple genomes such as Thellungiella parvula (Dassanayake et al., 2011), sequencing was performed solely using next-generation platforms (Illumina and Roche 454), with the assembled ‘meta-contigs’ (total 1496) representing of 137.1 Mb of the estimated 160 Mb genome.
Transcriptomics goes deeper
An early application of high-throughput sequencing methods to transcriptomes was in massively parallel signature sequencing, in which short (16–20 nucleotide) sequence tags of mRNAs were generated. This technology was short-lived and not widely used in plants due to cost limitations. However, widely used expression atlases for A. thaliana and rice are available (Meyers et al., 2004a; Nobuta et al., 2007). Using next-generation sequencing platforms, transcriptomes can now be sampled broadly and deeply. The low cost of next-generation sequencing, coupled with the ability for de novo assembly of the transcriptome and quantification of transcript abundances (see below), has provided insight into diverse plant species, including those with medicinal properties (Hao da et al., 2011; Tang et al., 2011) and insectivorous plants (Srivastava et al., 2011).
In humans, for whom 8 315 272 million ESTs (primarily Sanger-derived) are available in the National Center for Biotechnology database of ESTs (release 090111; http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html) and for whom each gene has been hand-curated, whole-transcriptome sequencing (RNA-seq) of a selection of organs, cell culture lines and individuals showed that nearly every multi-exon gene in the human genome is alternatively spliced (Wang et al., 2008). The ability to deeply sequence a transcriptome using next-generation sequencing platforms has increased the number of alternative isoforms identified, even in well-curated genomes such as A. thaliana (Filichkin et al., 2010), and revealed the presence of fusion transcripts in which mRNAs from non-contiguous loci, even on different chromosomes, are spliced to make a novel transcript (Zhang et al., 2010). The throughput of the Illumina and SoLiD™ platforms compared to the Roche 454 platform favors use of these two platforms for deep or quantitative assessments of the transcriptome. However, the near 1000 bp read lengths on the FLX+ platform will allow resolution of transcript isoforms, which is limited with short-read platforms.
Computational programs for transcriptome analyses are maturing rapidly. Mapping of short reads to reference genomes is relatively robust, with programs available and under active improvement. One suite of well-documented and utilized software comprises the Bowtie (Langmead et al., 2009b), TopHat (Trapnell et al., 2009) and Cufflinks (Trapnell et al., 2010) programs. Collectively, these programs provide alignment to the genome in a quality-aware mode, alignment of reads across splice sites, and statistical assessment of transcript abundances. Assembly programs that can address the depth and complexity of the transcriptome are just emerging. As the RNA-Seq coverage of the transcriptome is dependent on gene expression levels, the optimal k-mer length for assembly of highly expressed transcripts is longer than for less abundant transcripts. Transcriptome assembly programs that utilize the multiple k-mer approach assemble the reads multiple times over a range of k-mer lengths, and merge the resulting assemblies. This method allows improved assembly of transcripts across the range of transcript abundances. Transcriptome assemblers such as Trans-ABySS (Robertson et al., 2010) and Oases (http://www.ebi.ac.uk/~zerbino/oases/) include pipelines that automate the assembly of a transcriptome using the multi k-mer method.
More than just four bases
A simplistic and naïve view of a genome is that it comprises four nucleotides organized into genes, with regulatory sequences that control transcription. The reality is that eukaryotic DNA is embedded in chromatin, which is dynamic in nature and whose modification affects transcription. Next-generation sequencing methods with single-base resolution have begun to characterize the epigenome, i.e. modifications to the DNA that do not affect the DNA sequence. This includes DNA methylation, histone modification, nucleosome density and chromatin state. Examination of the A. thaliana epigenome, which was surveyed at the single-base level using the Illumina platform, revealed that DNA methylation was not evenly dispersed throughout the genome or the genes, and that the location and abundance of small RNA targets were positively associated with cytosine methylation (Lister et al., 2008).
Eukaryotic genomes are also highly ordered, with regions of the chromosomes differentiated into euchromatin and heterochromatin. The ratio of heterochromatin to euchromatin varies between species, with a tendency for small genomes such as A. thaliana (Arabidopsis Genome Initiative 2000) and Brachypodium distachyon (International Brachypodium Initiative 2010) to have a lower ratio compared to related species such as Brassica rapa (Brassica rapa Genome Sequencing Project Consortium 2011) and Sorghum bicolor (Paterson et al., 2009), respectively. Even within the highly repetitive sequences of heterochromatin such as the centromeres, next-generation sequencing methods have revealed variation in the degeneracy of the repetitive sequences (Zhang et al., 2008), the methylation status of the repetitive sequences (Zhang et al., 2008), and chromatin binding proteins (Yan et al., 2008). It is anticipated that next-generation sequencing will continue to accelerate our understanding of the dynamic nature of eukaryotic genomes and how nucleic acids and proteins interact to generate transcripts.
Turning the power of genome sequencing to genetics
Quantity of sequence is not rate-liming with next-generation sequencing platforms. As a consequence, sequencing is being adapted to a wide range of uses other than de novo sequencing of genomes and transcriptomes. One of these is genetics. Genome sequencing is starting to replace a wide range of other molecular-based markers for genetic map development. Indeed, genotyping by sequencing is emerging as a high-throughput, high-resolution and inexpensive method to genotype populations (Figure 2). Two approaches have been used in genotyping by sequencing for genetic map construction and subsequent use in linkage map construction and loci identification. In large genomes or for efforts in which a low resolution of markers is required, a reduced-representation approach can be used that permits barcoding (or multiplexing) of multiple individuals from the population within a single sequencing lane, as individuals can be de-multiplexed bioinformatically post-sequencing. For maize, this has involved restriction enzyme digestion of genomic DNA, ligation of individual DNA samples to barcoded adaptors, pooling of 48 or 96 individual ligations (individuals), PCR amplification of the pooled ligation events, and sequencing of the pool on a single lane of an Illumina sequencer (Elshire et al., 2011). Clearly, bioinformatics is critical to this method as the sequence reads need to be de-multiplexed, mapped to a reference genome, and polymorphisms annotated prior to construction of a linkage map. However, the ‘wet lab’ costs are quite minimal, and once a bioinformatics pipeline has been established, this can be applied to an infinite set of individuals and populations. This method has been applied at a high-resolution level in a simple but elegant approach to identify recombination breakpoints in a rice recombinant inbred line population (Huang et al., 2009b). Here, high-quality single nucleotide polymorphisms were first called between the two parents, and then the entire genome was sequenced from an F11 rice recombinant inbred line population, rather than discrete loci, in a reduced-representation approach. Using a WGS approach with barcoded adaptors, each of the 150 rice recombinant inbred lines were sequenced to 0.02× coverage, and genotypes were called using a sliding-window approach to reduce the impact of sequencing errors. This approach permitted robust genotyping of the population with limited sequence depth, which was successful in identifying quantitative trait loci involved in plant height.
Another application of genome sequencing in genetics is re-sequencing of mutants to identify the causal mutation (Figure 3). In A. thaliana, the causal mutation in an unknown gene of an ethyl methanesulfonate mutant that exhibits slow growth and reduced pigmentation (light-green leaves) was identified using next-generation sequencing (Schneeberger et al., 2009). Using a bulk-segregant approach, in which DNA from 500 F2 individuals with the mutant phenotype was pooled, 22× sequence coverage of the genome was used to quantify allele frequency and identify a small interval associated with the mutation that, when manually scanned, contained the causal mutation. Thus, previous genetic approaches such as genetic mapping and positional cloning can be replaced by more efficient, faster approaches that utilize the power of genome sequencing to identify genes and alleles of interest.
Limitations of Next-generation Sequencing with Respect to Plant Genomes
Not only have improvements in the various platforms such as Roche 454, Illumina and SoLiD™ significantly increased throughput, they have reduced the error rate and increased read length, making these methods economically feasible for a wide range of plant species. However, plant genomes can present unique challenges for next-generation sequencing platforms, such as their repetitive nature, which is challenging for reliable assembly of the complete genome. This is due to the high copy number and amplifying nature of transposable elements within a large number of plant genomes. As a consequence, these sequences are challenging to assemble even with the best assembly algorithm. For example, in the maize genome, which was sequenced using a BAC-by-BAC approach, 85% of the genome comprises transposable elements (Schnable et al., 2009), making success of a WGS approach with short read-based platforms unlikely. The frequency of whole-genome, segmental and tandem duplication in plant genomes (e.g. Arabidopsis Genome Initiative 2000, Paterson et al., 2004; Tuskan et al., 2006) also creates assembly issues with paralogous families, as sequence identity among paralogs is high if it is a recent duplication. Thus, while next-generation sequencing platforms are able to sequence and resolve a large proportion of the genome, all WGS projects to date lack full representation of the genome. Quality assessments on the portion of the genome absent in the assembly, as expected, revealed that the majority of the unassembled genome comprises repetitive sequences (e.g. Huang et al., 2009a; Paterson et al., 2009; Potato Genome Sequencing Consortium 2011).
Some scientists may be interested in the order of the genes and assembled sequence on a linkage group. As no WGS approach is able to assemble a super-scaffold that is equivalent in size to a chromosome or chromosome arm, a dedicated effort to link the genome sequence to the genetic map may be required. This can be accomplished using a suite of known genetic markers (for which the sequence is available) and/or by initiating a separate mapping effort in which markers are generated from the genome sequence and used in a segregating population to anchor, order and orient the scaffolds to the genetic map. This approach was used in the case of cucumber (Huang et al., 2009a), and was successful in anchoring 73% of the assembly to the genetic map. Another approach is to utilize the sequence of a syntenic relative to facilitate anchoring of the sequence to the genetic map, such as for potato (Potato Genome Sequencing Consortium 2011), for which the genome sequence of tomato (Solanum lycopersicum) was utilized.
Fingerprint contigs of BAC clones, in which BAC DNA was restricted and overlapping patterns of restriction enzyme fragments were computationally analyzed to generate contiguous sets of BAC clones (Soderlund et al., 1997), were instrumental in early genome projects such as for rice, maize and A. thaliana (Arabidopsis Genome Initiative 2000, International Rice Genome Sequencing Project 2005, Schnable et al., 2009). Whole-genome profiling (van Oeveren et al., 2011), in which sequence-based tags of restriction enzyme sites of BAC clones rather than restriction enzyme fragment patterns are used to assemble BACs into physical contigs, provides a high-resolution physical map resource for assembly of large genomes. Use of physical maps, such as optical maps in which DNA fibers are digested with a restriction enzyme, fragments are sized, and a genome-scale restriction map constructed, has proven useful in confirming and correcting errors in a genome assembly (Zhou et al., 2009; Young et al., 2011).
Another major complication for plant genomes is that not all plant species are homozygous inbred diploids. While it is feasible to sequence heterozygous organisms such as humans (Venter et al., 2001) using WGS, a higher degree of heterozygosity, such as that found in some plant species, including out-crossed and vegetatively propagated species such as potato, can impede WGS assembly. To alleviate this problem for the genome of grapevine (Vitis vinifera), a highly heterozygous diploid species, a strain that had been highly inbred was used, thereby reducing the extent of heterozygosity (Jaillon et al., 2007). Use of longer reads can improve the ability to assemble separate haplotypes within a genome. This may be achieved in the future using long-read platforms such as Sanger, Roche 454 and potentially Pacific Biosciences (see below). An alternative for short-read platforms such as SoLiD™ and Illumina is to generate libraries of small fragments, such as 250 bp, and then perform paired-end sequencing such that the reads overlap, e.g. by 150 bp, thereby generating a consensus sequence that spans 250 bp and a single haplotype.
Ploidy is a substantial challenge in de novo sequencing and assembly of plant genomes, and results are dependent on whether the species is an autopolyploid or an allolopolyploid, as well as the age of the ploidization event, as these affect the degree of difference between the homologous and homeologous genomes. To date, all attempts to sequence polyploids have relied on either a reduction in ploidy or physical separation of the chromosomes. For example, the overwhelming majority of potato lines grown are tetraploids, but even the initial attempts to sequence a heterozygous diploid potato genome (RH89-039-16) were challenging due to high degrees of heterozygosity between the homologous chromosomes (Potato Genome Sequencing Consortium 2011). As a consequence, the Potato Genome Sequencing Consortium utilized a unique potato genotype, a doubled monoploid (DM1-3 516 R44), that was homozygous for a single set of the 12 chromosomes, to generate the reference potato genome. Mapping of the RH89-039-16 haplotypes to the DM1-3 516 R44 reference genome revealed that the two haplotypes within RH89-039-16 were more diverged from each other than from the single haplotype in DM1-3 516 R44. Cultivated strawberry (Fragaria × ananassa) is an allo-octaploid with up to four different progenitor species, and therefore the diploid species, Fragaria vesca (woodland strawberry) was sequenced in order to bypass the difficulties of sequencing multiple genomes (Shulaev et al., 2011). For hexaploid wheat, which has a 17 Gb genome, efforts by the Wheat Genome Initiative (http://www.wheatgenome.org/) have focused on flow cytometry separation of individual or groups of homeologous chromosomes (Paux et al., 2008) and construction of physical BAC maps to initiate sequencing. A parallel approach, similar to that for strawberry, is to work with the diploid progenitors of hexaploid wheat (Akhunov et al., 2005), which simplifies the genome and provides insight onto the evolution of a hexaploid genome.
The next-generation sequencing platforms described above are also referred to as ‘second-generation platforms’, as they constitute the second phase of technologies after Sanger sequencing. In the last year, newer next-generation sequencing platforms have emerged that are referred to as ‘third-generation platforms’, as they represent a further paradigm shift in sequencing technology compared with the Sanger platform and the Illumina, SoLiD™ and Roche 454 platforms. Two such platforms are now commercially available: the Pacific Biosciences PacBio RS (Pacific Biosciences, http://www.pacificbiosciences.com) and the Ion Torrent Personal Genome Machine (Life Technologies, http://www.iontorrent.com).
The Pacific Biosciences PacBio platform measures the enzymatic activity of a single DNA polymerase enzyme in real-time using zero-mode waveguides (Eid et al., 2009). High-throughput sequencing is possible due to the massive and parallel nature of the single-molecule real-time (SMRT) sequencing. This SMRT sequencing permits tens of thousands of zero-mode waveguides to be captured in a 30 min run. The key advantage of the PacBio platform is the read length. A recent microbial genome sequencing study (Rasko et al., 2011) reported a mean read length of approximately 2000 bp, with reads in the 95th percentile longer than 5000 bp. The ability to use DNA sequences at defined intervals along a long DNA template is a powerful approach to resolve assembly challenges. The PacBio RS platform has only recently been launched, and initial reports on single-pass long-read accuracy ranged from 84.2 to 97.8% (Rasko et al., 2011). Thus, while applications for PacBio RS are evident in microbial genome re-sequencing projects and in scaffolding contigs generated using higher-accuracy short-read platforms, its use in de novo sequencing and assembly of large genomes has not yet been demonstrated.
The Ion Torrent platform exploits the generation of hydrogen ions during the polymerase reaction to detect DNA synthesis on an ion chip. Using native dNTPs, the synthesis reaction alters the pH, which is detected by a sensor (Rothberg et al., 2011). The Ion Torrent platform has been used to re-sequence three microbial genomes and a human genome (Rothberg et al., 2011). For the microbial genomes, high coverage of the genomes was obtained (96.80–99.99%), with high read accuracy (99.569%) in the first 50 bases (Rothberg et al., 2011). Reads of 100 bp are routine, with maximum read lengths of 200 bp. Currently, only 20–30% of the sensors generate reads that can be mapped, but there are up to 11 million sensors on the ion chip. One appealing feature of the Ion Torrent platform is its cost; the Ion Torrent Personal Genome Machine costs <US$80 000, compared to > US$500 000 for the Roche 454, Illumina, SoLID™ and Pacific Biosciences platforms. Run times for the Ion Torrent platform (2 h) are comparable to that for the Pacific Biosciences platform (30 min), and much less than the run times for the Roche 454 (up to 10 h), SoLID™ (up to 7 days) and Illumina (up to 11 days) platforms. Thus, it is financially feasible for a single investigator to acquire an Ion Torrent Personal Genome Machine, further expanding access to genome sequencing infrastructure for biologists.
Computational Infrastructure Challenges with Next-generation Sequencing
The success of next-generation sequencing in biology is unquestioned. However, access to these technologies is not without challenges at multiple levels. The discussion above focused primarily on technical challenges encountered in generating the sequence data, all which are being overcome at a relatively fast pace. However, the ability to generate large sequence datasets from an unlimited number of plant species can cause data overload at laboratory, institutional and community scale. Indeed, the infrastructure costs for data storage, processing and handling are typically more than the costs of generating the sequence. Coupled with the promise of even more throughput in the coming years via the third-generation sequencing platforms, data storage and handling issues will continue to grow.
Major changes in how computation is performed on large genome datasets have occurred over the last decade. In the Sanger sequencing era, genome assembly was usually performed at a genome center using a computational infrastructure called a compute cluster that required extensive investment in CPUs and data storage. However, in the last decade, there have been numerous advances in computer technology and bioinformatic tools that now put access to compute resources within the fiscal and practical reach of a single investigator. Below, we aim to provide a brief explanation of new computing approaches that have the potential to ‘democratize’ genome data analysis in the same way that the next-generation sequencing platforms have enabled a wide range of biologists to design and generate their own sequence datasets.
Compute and grid clusters
In a compute cluster, the computers are networked to each other using fast networking technologies such gigabit Ethernet or Infiniband. Storage, a critical issue with genome sequence datasets and the associated bioinformatic data analyses, is local to the compute cluster. Redundant disk arrays are used for optimal data access performance and to maintain data integrity. Jobs are submitted and managed on the compute cluster using software such Sun Grid Engine (now Oracle Grid Engine, http://wikis.sun.com/display/GridEngine/Home) or Condor (http://www.cs.wisc.edu/condor/). High-speed networking enables parallel computing solutions that distribute the jobs across the nodes of the cluster. Several genomics applications now support use of parallel environments, including genome assemblers such as ABySS (Simpson et al., 2009) and Velvet (Zerbino and Birney, 2008), genome annotation pipelines such as MAKER (Cantarel et al., 2008), and sequence aligners such as mpiBLAST (http://www.mpiblast.org/). The power, cooling and system administration requirements for running a compute cluster can place it out of reach for a laboratory or small department. However, access to a compute cluster that is maintained by a dedicated staff is available at most institutions. For groups that wish to run their own compute cluster, cluster management software is available, such as the Rocks cluster distribution (http://www.rocksclusters.org/), which automates many of the tasks related to systems administration of the compute cluster.
Since the early 2000s, grid computing infrastructures, such as National LambdaRail (http://www.nlr.net/), have become available. In grid computing, the computational resources are geographically distributed and linked together using the Internet or next-generation research networks. Grid computing resources available to researchers include the TeraGrid (Beckman, 2005) and the Open Science Grid (http://www.opensciencegrid.org/). Some disadvantages of grid computing for genome informatics include uploading large datasets to the grid (such as databases and sequence read files), installing genomics programs on the grid, and connectivity issues. Data access and storage on the grid can also be complex due to the heterogeneous data storage resources and file systems available at each grid site, slow data transfer speeds between grid sites, limited storage space quotas, and limited or non-existent data back-up options.
Cloud computing has recently come to the forefront of high-performance computing infrastructure for bioinformatics (Stein, 2010). Using a technology called virtualization, the cloud computing infrastructure allows the user to create a virtual compute cluster through ‘instances’ of virtual machines on host servers. A virtual machine is a file that contains an operating system, programs and possibly data that can be ‘booted’ on a host and run as an independent computer. Similar functionality for desktop computer users is provided by programs such as VMWare (http://www.vmware.com) and Virtualbox (https://www.virtualbox.org/). A popular cloud infrastructure platform is the Elastic Compute Cloud provided by Amazon (https://aws.amazon.com/ec2/). The EC2 offers several tiers of virtual machine ‘instances’ at several price points, from micro instances to compute cluster instances. Storage is provided by a Simple Storage Service (S3) or by attaching an Elastic Block Store (EBS) to the virtual machine instance. EBS storage is not persistent, and is suited for storing temporary results and other local storage needs. In contrast, S3 storage is persistent but attracts a fee. Although importing data is currently free, users are charged for exporting data from the EC2.
Cloud infrastructure software that emulates the Amazon EC2 platform, such as Eucalyptus (http://open.eucalyptus.com/), has been developed, thereby allowing creation of private clouds. This has the benefits of running custom virtual machine images, dynamically scaling the cluster size, and retaining control over security and data storage. The compatibility of the cloud infrastructure and virtual machine images also allows for hybrid clouds, i.e. private and commercial/community-based, in the event that additional computing power is needed. Means to make cloud computing infrastructure available to researchers include the FutureGrid (https://portal.futuregrid.org/), a next-generation extension to the TeraGrid.
The creation of virtual machine images requires programming or system administration knowledge. To alleviate this bottleneck, several virtual machine images for genome informatics are available. CloVR contains an entire bacterial assembly/annotation pipeline and a metagenomic sequencing pipeline (Angiuoli et al., 2011), whereas CloudBioLinux (http://cloudbiolinux.org/) contains a number of genome sequencing, assembly and annotation programs. Utilizing on these prebuilt machine images are integrated systems such as Galaxy CloudMan (Afgan et al., 2010) that can create an entire compute cluster in the cloud, allowing users with no computational experience to run a bioinformatics pipeline. The iPlant Collaborative provides computational infrastucture to support plant biology (http://www.iplantcollaborative.org/), including Atmosphere, a cloud infrastructure developed to support computational efforts in plant biology.
The cloud, when combined with frameworks that distribute computational tasks across many nodes, allows new approaches to processing the large amounts of data produced by next-generation sequencing. Early examples of this approach include Cloudburst, an aligner used to map next-generation reads (Schatz, 2009), Crossbow, a pipeline used to map next-generation sequencing reads onto a reference genome and call single nucleotide polymorphisms (Langmead et al., 2009a), Myrna, which is a differential gene expression pipeline (Langmead et al., 2010), and Contrail, a distributed genome assembly program for next-generation reads (http://contrail-bio.sf.net).
Other Components of a Genomics Project: Stuff you Really Need and Want but Don’t Always Get
Poor or inaccurate documentation of the meta-data associated with genome and transcriptome sequences occurs frequently. It has been the standard for over 18 years (http://www.insdc.org/index.html) for sequence data to be archived at one of the three global repositories – the National Center for Biotechnology Information (NCBI), the European Molecular Biology Laboratory (EMBL) and the DNA Databank of Japan (DDBJ) – that conform to the International Nucleotide Sequence Database Collaboration Policy. However, the recent announcement by the National Center for Biotechnology Information that they intended to close the Sequence Read Archive (Genome Biology Editorial Team 2011), the database for next-generation sequence reads, reflects fiscal and logistical challenges in data handling and storage at the international scientific community scale. While the Sequence Read Archive closure was subsequently retracted, their announcement suggests that there will be limits (at some point) in the paradigm of centralized and permanent archiving of large-scale sequence datasets.
Another major issue with next-generation sequencing is the lack of documentation of quality throughout the assembly. Certainly there are errors at the base and assembly level in these unfinished WGS projects, but the lack of established required parameters and their metrics means that errors are waiting to be discovered by the end user. Thus, the phrase ‘caveat utilitor’ (‘user beware’) should be fully subscribed to by biologists using genome-scale datasets. Furthermore, the community should discuss and establish minimum standards for releasing and publishing a plant genome, including (i) the coverage of genome assembly versus empirically determined genome size, (ii) documentation of which sequences are missing in the genome assembly (e.g. repetitive sequences), (iii) provision of a set of defined assembly metrics, such as N50 contig/scaffold/super-scaffold size, N90 contig/scaffold/super-scaffold size, sequence coverage, (iv) documentation of gene and marker coverage in the assembly, and (v) documentation of assembly quality with respect to sequencing errors, mis-assembly and functional utility (e.g. gene length coverage). For example, a genome assembly would have high utility with an N50 contig size >30 kbp, N50 scaffold size >250 kbp, N50 super-scaffold size >1 Mbp, greater than 90% of the genes represented (as measured by EST alignments), and >90% coverage of full-length cDNAs. Although the percentage of the genome present in the assembly will vary greatly due to wide ranges in the percentage of repetitive sequence content among plant species, this should be documented by analyses of empirically derived genome size through flow cytometry, physical maps and/or optical maps with the assembly coupled with bioinformatic analyses. The consensus sequence accuracy is also critical to downstream functional analyses, and sequence coverage and mis-assembly assessments should reveal minimal error rates.
Although the community has developed and adopted archiving of sequence data, the ability to sequence populations and associate these with phenotypes results in the creation of large-scale phenotype datasets and another data access and storage issue. Currently, there are limited public archives for deposition of phenotype data, which, coupled with the lack of full disclosure of underlying phenotype data in publications, undermines advancement of the overall scientific community. Part of this is due to a lack of rigor in the review and publication process, whereby reviewers and editors do not demand provision of phenotype data in a format that can be used in analyses by the whole community. However, this could be addressed through centralized, public repositories for plant phenomic data, in which deposition of phenotype data occurs concomitantly with publication, analogous to the deposition of nucleic acid sequences in to the National Center for Biotechnology Information database upon publication. Another concern that has arisen in the last few years of the genomics revolution is that, while sequence data are being made available, the underlying germplasm is either not available or is subject to significant movement barriers due to quarantine and intellectual property issues. Thus, while the reported sequence is interesting, the notion that the act of publishing implies the authors will provide the biological material to the broader scientific community is decaying.
Going Forward: What’s Next?
Few, if any of us, could have envisaged the pace at which sequencing technologies, computing infrastructure and computational biology would progress since publication of the H. influenzae genome in 1995. Now, we have access to the ‘instructions’ for how plants grow, reproduce and adapt through the DNA sequence. Although work remains on how to read the instructions embedded in the DNA, major insights into basic biological processes have been made. As with molecular biology in the early 1980s, which was restricted to a small number of laboratories and is now a standard, routine and robust technology in almost all plant biology labs, genomics has progressed from a technology used at a few elite centers to being available at most universities and institutions. A consequence of the dissemination of genome sequencing infrastructure to a wider group of scientists is that there is increased breadth and depth in the biological questions being addressed by genomics. With infrastructure costs decreasing substantially, such that individual labs can equip themselves with their own sequencing instrument, there will be further innovation in genomic questions, applications and approaches. However, problems with data handling and data volume are already evident. Certainly, new computing and bioinformatic approaches will advance; however, the sheer amount of data storage and handling required for the ever-growing output of sequencing platforms will continue to be a major challenge for all genomicists. However, given the continuing advances in information technology, these technical challenges will be overcome, and we can focus on biology, not technology, to complete our quest for complete knowledge of the biology of plants.
We thank Kevin Childs, Elsa Gongora, Candy Hansey and Lina Quesada for helpful comments and discussion. We also thank Lina Quesada for assistance with the figures.