1. Introduction
DNA sequencing is a fundamental way for deciphering a broad range of biological phenomena at the molecular level. Although the development of DNA sequencing technologies has a rich and diverse history (Hutchison, 2007; Shendure and Ji, 2008), the dideoxy sequencing established by Sanger, Nicklen and Coulson 1977 has dominated the industry and research community for almost three decades and currently remains the gold standard for decoding DNA sequences.
Many biochemical and technical improvements have endowed modern Sanger sequencers with very low error rates (~0.001%), relatively long read-length (up to ~1000 bp), and high-throughput and robust performance. However, the cost of Sanger-based approaches for large sequencing projects remains expensive (on the order of $0.50 per kilobase). This limitation incentivized the National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) to initiate a funding program in 2004 to develop novel technologies that will enable extremely low-cost, high quality DNA sequencing. With a stated goal of reducing the cost of sequencing mammalian-sized genomes by four orders of magnitude to approximate $1000 per genome in 521 10 years (http://grants.nih.gov/grants/guide/rfa-files/RFA-HG-09-011.html), the initiative has propelled the development and commercialization of novel next-generation sequencing (NGS) technologies.
In 2005, 454 Life Sciences (now part of Roche Applied Science, http://454.com/about-454/index.asp) developed and commercialized its Genome Sequencer (GS) system for ultra-high throughput DNA sequencing (Margulies et al., 2005). As the first commercialized NGS technology, the Roche/454 sequencing system provided a compelling case study for the establishment of a novel, revolutionary technology that enables scientists to carry out massively parallel DNA sequencing reactions at a relatively small cost, as well as at a much faster speed compared to using conventional Sanger's technologies. For example, the sequencing of the first human genome in the Human Genome Project (Collins et al., 2004; Lander et al., 2001) with automated Sanger technology took some 13 years at a cost of about $2.7 billion. In contrast, sequencing a human genome with the Roche/454 sequencer took but five months and approximately $1.5 million (Wheeler et al., 2008). This great stride was enabled by ultra-high throughput, simplified in vitro sample preparation, and miniaturization of sequencing chemical reactions (Rothberg and Leamon, 2008).
Illumina, Life Technologies, and Helicos BioSciences quickly followed Roche's lead and launched their own NGS platforms. Concurrently, new DNA sequencing technologies have been advancing at other companies and research institutions, such as Pacific Biosciences (http://www.pacificbiosciences.com), VisiGen Biotechnologies (http://visigenbio.com), Sequenom (http://www.sequenom.com), Complete Genomics (http://www.completegenomics.com), and the Center for Computational Genetics at Harvard Medical School (http://arep.med.harvard. edu/gmc).
Differing in sequencing chemistry as well as technical details, all NGS platforms share a similar technical strategy miniaturization of individual sequencing chemical reactions to boost sequencing throughput (Metzker, 2010). The miniaturization of sequencing reactions, coupled with other technical breakthroughs, such as overcoming the bottlenecks of library preparation and template preparation (Rothberg and Leamon, 2008) enables millions of simultaneous individual sequencing reactions. Only a single fragment of DNA is sequenced in each miniaturized chemical reaction, but millions of them are spatially arranged so that individual reactions are isolated from one another, and distinctly detected by laser scanning or other approaches. The results are prodigious volumes of short read sequence data, unprecedented detail and resolution of sequence complexity, with consequential challenges in storing, managing, analyzing, and interpreting such wealth of data.
In this chapter, we first describe the fundamental principles of four commercially available NGS platforms, that is, Roche/454 GS FLX, Illumina Genome Analyzer IIx, Life Technologies SOLiD, and Helicos HeliScope. We then discuss the general difficulties to be overcome in the analyses of NGS data. Next, we outline some main applications of NGS technologies and, finally, we compare the analysis of a toxigenomics study using NGS data with that using microarray data.
2. Next-generation Sequencing Technologies
2.1. Roche/454 Pyrosequencing
The Roche/454 GS FLX system (Figure 1(a)) relies on pyrophosphate detection (Nyren, Pettersson and Uhlen, 1993) and emulsion PCR (Tawfik and Griffiths, 1998). A library of DNA templates is prepared by a highly efficient in vitro DNA amplification method known as emulsion PCR (Figure 1(b)), where sheared DNA fragments are ligated to specific oligonucleotide adapters, resulting in each DNA fragment binding to a fragment-carrying bead. The beads are then captured in separate emulsion droplets that function as amplification reactors to produce some 10 million clonal copies of the DNA template that are needed for sufficient light signal intensities (Fuller et al., 2009). On completion of the emulsion PCR amplification, the emulsion is disrupted, and the beads containing clonally amplified template DNAs are enriched, and then the beads are again separated by limiting dilution and deposited into individual picotiter-plate wells. The picotiter-plates serve as sequencing reactors to let individual enzymatic sequencing reactions occur without interference from adjacent wells. Visible light emitted from the subsequent pyrosequencing reactions (Ronaghi et al., 1996) are detected by an imaging charge-coupled device (CCD) that is bonded to a fiber-optical bundle. During each cycle of a pyrosequencing reaction, a single species of unlabeled nucleotide is supplied to the reaction mixture to all beads on the chip, so that the complementary strand of DNA is sequentially synthesized. With the incorporation of each base in the growing chain, an inorganic pyrophosphate group is released that is converted to ATP by sulfurylase. During sequencing, the ATP molecule is next used by luciferase to convert luciferin to oxyluciferin, producing a light pulse (Figure 1(c)). Detecting the light emissions together with knowing the nucleotide identity in each step allows the incorporated base to be determined. Through a series of such pyrosequencing reaction cycles, the sequence of the DNA template carried by individual beads is determined.
Because there is no terminating moiety preventing multiple consecutive incorporations in a given pyrosequencing reaction cycle, the length of homopolymers in sequence reads must be inferred from light signal intensity, with a higher intensity corresponding to more repeats. The error rate of calling consecutive repeats increases when the length of the homopolymers is greater than 34 repeating bases. Thus, the main error type for the Roche/454 system is insertions and deletions (or indels), other than substitutions (Shendure and Ji, 2008).
Compared to other NGS platforms, the strength of the Roche/454 system is its longer sequence reads. The Roche/454 GS FLX, with its newest chemistry, termed GS FLX Titanium series reagents, can generate more than one million individual sequence reads with read length over 400 base pairs over a 10 h time span (http://www.454.com/products-solutions/system-benefits.asp). Although its per-base cost is much higher than that of other NGS platforms (e.g., Life Technologies/SOLiD and Illumina/Genome Analyzer IIx), the Roche/454 system is best suited for certain applications such as de novo sequencing of new genomes, for which long read length is critical for de novo genome assembly.
2.2. Illumina Sequencing Technology
The Illumina GA system is the first short read sequencing platform and currently dominates the NGS market (Metzker, 2010). The first GA system was launched by Solexa in 2006, which was subsequently acquired by Illumina in early 2007 (http://www.illumina.com/technology/sequencing_technology.ilmn).
The Illumina GA system (Figure 2) uses an array technique to achieve cloning-free DNA amplification. Reversible terminator chemistry is the defining characteristic that provides massively parallel sequencing of millions of DNA fragments at low cost. DNA samples are randomly sheared into fragments that are then end-repaired to generate 5¢-phosphorylated blunt ends. The Klenow fragment of DNA polymerase is then used to attach a single A base to the 3¢ end of the DNA fragments, which prepares the DNA fragments for ligation to oligonucleotide adapters (Figure 2(a)). After ligation to adapters at both ends, the DNA fragments are denatured, and single-stranded DNA fragments are attached to reaction chambers that are optically transparent solid surfaces called a flow cells. To obtain sufficient light signal intensity for the reliable detection, attached DNA fragments are extended and amplified by bridge PCR amplification (Figure 2(b)). The bridge PCR amplification can create an ultra-high density sequencing flow cell containing hundreds of millions of clusters that, in turn, contains some 1000 copies of the same DNA template. These templates are finally sequenced through the sequencing-by-synthesis technique that applies reversible terminators with removable fluorescent dyes.
For sequencing and DNA synthesis, the reaction mixtures comprising primers, DNA polymerase, and four reversible terminator nucleotides, each labeled with a different fluorescent dye, are supplied to the flow cell. In each sequencing cycle, a specific terminator is incorporated according to sequence complementarity in each template DNA strand in a clonal cluster. After incorporation, the identity (base calling) and the position of the specifically incorporated terminator on the flow cell is determined according to the fluorescence dye emission, and the signal recorded using a CCD camera. In the following cycle, the reversible terminator is unblocked and the fluorescent dye label is removed from the base, so that a new nucleotide can be incorporated and a new base can be detected using the same strategy. This repetitive sequencing-by-synthesis process takes about 2.5 days to generate 50 million reads per flow cell, with a read-length of some 36 bases. The overall sequencing output of the Illumina GA system is more than one billion base (Gb) pairs per analytical run (Bentley et al., 2008).
The upgraded GA II that is capable of sequencing both single-read and paired-end (sequencing both ends of the template molecules) libraries, generating some 1.5 Gb of outputs per day, corresponding to 80100 million reads per flow cell. Moreover, the Illumina's newest version of GA, HiSeq 2000 extends throughput up to 200 Gb per run and two billion 100+ bp paired-end reads.
In a giving cycle of sequencing, any modified nucleotide could be incorporated with a decreased or an increased efficiency, resulting in an under-or over-incorporation and a heterogeneous mixture of synthesis lengths, and concomitant degradation of signal purity and precision. In addition, chemical cleavage of terminating moieties and florescent dye labels are subject to incompletion. Therefore, the Illumina's sequencing strategy generates much shorter reads and its most common error type is substitutions (Shendure and Ji, 2008). The base-call error rate increases with read length due to dephasing noise (Dohm et al., 2008). In addition, an underrepresentation of AT-rich and GC-rich regions (Dohm et al., 2008; Hillier et al., 2008; Metzker, 2010) has been observed.
2.3. Life Technologies/SOLiD
The SOLiD (Supported Oligonucleotide Ligation and Detection) system is a short-read sequencing platform relying on ligation chemistry. This platform was developed by Life Technologies based on the strategies described by Shendure et al. (2005) and Mckernan et al. (2008).
Library construction for the SOLiD system is similar to Roche/454 technology, in which DNA is stochastically sheared into fragments that are subsequently ligated to oligonucleotide adapters (Figure 3(a)), attached to beads, and clonally amplified by emulsion PCR. After denaturing templates, the template-carrying beads are enriched to separate desired beads from undesired. The templates on the selected beads then are 3¢ modified for the purpose of covalent attachment to the slide. Then, 3¢ modified beads are deposited onto a derivitized-glass flow cell surface to generate a dense, disordered array (Figure 3(b)). Sequencing reactions are started by hybridizing a primer oligonucleotide complementary to the adapter at the adapter-template junction (Figure 3(c)). Unlike the Roche/454 sequencing approach, the sequencing-by-synthesis in the SOLiD system is driven by a DNA ligase rather than a DNA polymerase. Briefly, in the ligation chemistry, a mixture of partially degenerated oligonucleotide octamers is competitively hybridized to the DNA fragments as probes, and a universal primer is oriented to provide a 5¢ phosphate group for ligation. The specificity of the probe ligated to a primer is determined by the 4th and 5th bases of the probe that are complementary to the template, and the identities (base callings) of the 4th and 5th bases of probes are characterized by one of four florescent labels at the end of the octamer, so that the interrogation of the 4th base and 5th base is achieved. After ligation, the ligated octamer oligonucleotides are cleaved off after the fifth base and the fluorescent label is removed, so that the next hybridization and ligation cycle can proceed. In such a way, the bases 4 and 5 in the template are determined in the first cycle, and the bases 9 and 10 in the second cycle, and so on. The ligation-sequencing can also be carried out in the same way with another primer offset by one base in the adapter, so the bases 3 and 4, 8 and 9,
, in the template can be determined (Figure 3(d)). By any given five-cycle rounds, each base is interrogated twice with two different fluorescent labels, resulting in significantly reduced base-call error rate (http://www3.appliedbiosystems.com/AB_Home/applicationstechnologies/SOLiD SystemSequencing/OverviewofSOLiDSequencingChemistry).
The current version, SOLiD 3 Plus system, is capable of using both fragments and mate-paired libraries, and of generating in one run over 60 Gb sequence data and one billion reads with read-length up to 50 bases.
By using ligation-based sequencing-by-syntheses, the SOLiD system mitigates homopolymeric sequencing error. The dominant error type is substitutions. Furthermore, according to the manufacturer (http://www3.appliedbi osystems.com/AB_Home/applicationstechnologies/SOLiDSystemSequencing/SOLiD-Accuracy), an over all accuracy of 99.94% can be achieved by using the two-base encoding system that can recognize and eliminate two-thirds of measurement errors.
2.4. Helicos HeliScope Genetic Analysis System
The HeliScope Genetic Analysis System, developed by Helicos BioSciences (http://www.helicosbio.com) in 2007, is the first commercialized single-molecule DNA sequencer. It is based on the True Single Molecule Sequencing (tSMS) technology stemmed from the work by Braslavsky et al. (2003) and relied on the cyclic interrogation of a dense array of sequencing features. By directly sequencing single molecules of DNA or RNA without requiring clonal amplification like other systems, the Helicos' tSMS technology significantly increases the speed and decreases of the cost of sequencing.
In the HeliScope system (Figure 4), a DNA library is constructed by random fragmentation of DNA samples, and 3¢ end ployadenylation of DNA fragments with the adenosine terminal transferase (Figure 4(a)). Denatured poly-A fragments are captured on a flow cell surface by hybridization to surface-tethered poly-T oligomers to yield a disordered array of primed single molecule sequencing templates. In each cycle of sequencing (Figure 4(b)), DNA polymerase and one of four fluorescently labeled nucleotides are supplied to the flow cell. The template dependent incorporation of single dye-labeled nucleotide is imaged with a CCD camera to make a base calling. Followed by dye-label cleavage and washing, the next cycle of nucleotide extension and imaging is repeated. Each sequencing cycle consists of the successive addition of polymerase and a different type of dye-labeled nucleotide. The total number of sequencing cycles performed ranges from 25 to 55, resulting in read-lengths from 25 to 55 bases. The HeliScope instrument is currently capable of imaging billions of single molecules per run and producing over 1 Gb of usable sequence data per day.
Similar to the Roche/454 platform, the HeliScope system is asynchronous, meaning that some DNA strands will fall behind or ahead of others in a sequence-dependent manner, and some DNA templates just fail to incorporate by chance on a given cycle; therefore, base substitution error is likely to occur. However, the substitution error rate is quite low (0.011% with one pass and 0.001% with two passes). On the other hand, there are no terminating moieties present on the labeled nucleotides, so homopolymers could be problematic. Helicos has since developed a Virtual Terminator technology to correct the homopolymer errors, increasing sequencing accuracy (Bowers et al., 2009). In general, as a result of incorporation of unlabeled bases, deletion is the dominant error type in the HeliScope system. The deletion error rate is 27% with one pass and 0.21% with two passes (Harris et al., 2008).
There are important differences among the aforementioned NGS technologies in terms of costs, advantages, limitations, and practical aspects of use for specific applications. For example, the Illumina and the Life Technologies platforms are particularly well suited for variant discovery by resequencing the human genome (Metzker, 2010), where a reference genome is available. The Roche/454 sequencer may be preferable for de novo sequencing due to its longer read-length. The Helicos platform is well suited for RNA-Seq that is relying on tag counting (Wang, Gerstein and Snyder, 2009) or direct RNA sequencing (Ozsolak et al., 2009). Table 1 provides a summary of the characteristics of the NGS platforms from the four manufacturers mentioned above.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3. Analyses of NGS Data
3.1. Overview of NGS Data Analyses
The new massively parallel sequencing technologies promise to refine and advance science across many fields. Moreover, the now tractable costs enable the powerful systems to reach increasing numbers of hands, thereby, broadly accelerating science. However, the realization of many promises is predicated on progress in overcoming obstacles in handling massive datasets and in developing tools to check and assure sequence quality, conduct sequence alignment and assembly, and biologically interpret and draw inferences from the data. NGS experiments generate immense volumes of short-read sequence data (Voelkerding, Dames and Durtschi, 2009) (Table 1). Data acquisition for such volumes is problematic alone, requiring an infrastructure with high bandwidth pipelines between processes that will be computationally intensive.
Translating such volumes of short-read data to biological results can be described as requiring three analysis stages, as depicted in Figure 5. In the first stage, images from NGS sequencers are analyzed and converted into sequence reads using the manufacturer's base-calling system. The reads are filtered and aligned in the second stage. Depending on the intended biological application as well as considerations of the cost, labor intensity and time requirement, the alignment can be done by de novo assembly or by mapping to a reference sequence that can be a complete genome, subsets of a genome (e.g., expressed genes and individual chromosomes of interest), a transcriptome, or an exome. In the final third stage, mapped and unmapped reads can be used to answer specific biological questions, such as the profiling of expression of genes, exons or isoforms; the discovery of novel transcripts, genes, splice variants, or single nucleotide polymorphisms (SNPs); and, the detection of transcription factors, methylation status, and histone modifications.
3.2. NGS Quality Control
The rapid expansion of applications of NGS technologies in solving biological, biomedical, and clinical problems makes the topic of NGS quality control including data quality, reliability, reproducibility, and biological relevance more and more important because of the inherent relatively high error rate in raw sequence data. It is preferable to establish an early consensus of standardized benchmarks for sequencing quality metrics (Editorial, 2008) to avoid the future dilemmas when comparing data from different NGS platforms, such as occurred for microarray platforms the past few years (Shi et al., 2006; Shi et al., 2008). The third phase of the MicroArray Quality Control (MAQC) (Shi et al., 2006) project, also called sequencing quality control (SEQC) is such an endeavor that is aimed at assessing the technical performance of NGS platforms. The SEQC project plans to generate benchmark datasets with reference samples and evaluate advantages and limitations of various NGS platforms and bioinformatics strategies in RNA and DNA sequencing.
3.3. Bioinformatics Tools for NGS Data Analyses
Currently, a number of bioinformatics tools are available for analyzing NGS data (Table 2) that can be grouped in four general categories: (i) base calling and polymorphism detection, (ii) alignment of reads to a reference, (iii) de novo assembly, and (iv) genome browsing and annotation. However, these current tools have some limitations, and many challenges and questions remain. Efficient data analysis pipelines are still needed for many applications and the relative advantages and limitations of existing tools need to be objectively evaluated.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
For base calling, most researchers simply use the calls generated with the data-pipeline software provided by manufacturers, but alternative approaches implementing more advanced statistical methodologies are also being developed. For example, Erlich et al. (2008) created an Alta-Cyclic approach that uses machine learning to reduce noise factors, substantially improving the number of accurate reads. Rougemont et al. (2008) proposed an algorithm using model-based clustering and probability theory to improve base-call quality by identifying and removing ambiguous bases from read ends. However, these improvements must be evaluated for cost effectiveness given the need for substantial investment to handle large volumes of raw image data (Voelkerding, Dames and Durtschi, 2009).
Proper alignment is mandatory to render NGS data biologically meaningful. Because of the short read-length, relatively high error rate in base calling, and a huge volume of data, alignment of data from NGS platforms is much more difficult than that from Sanger sequencing platforms (Trapnell and Salzberg, 2009). One limitation of aligning and assembling of reads is that a large portion of reads cannot be uniquely aligned to a reference when sequence reads are too short and the reference is too complex (Voelkerding, Dames and Durtschi, 2009). In addition, the chance of unique alignment or assembly is reduced not only by the presence of repeat sequences in complex genomes, but also by shared homologies within closely related gene families and pseudogenes (Voelkerding, Dames and Durtschi, 2009).
Conventional alignment solutions like BLAST (http://blast.ncbi.nlm.nih.gov) (Altschul et al., 1990) and BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat) (Kent, 2002) are efficient to align long reads such as those generated by Sanger sequencing, but inadequate to handle NGS short reads. Recently, a variety of sequence alignment algorithms and software packages have been developed specifically for processing a large number of short reads. Table 2 provides an overview of such programs. The algorithms implemented in these software packages vary with the applications, but they include sequence alignment, de novo assembly, alignment viewing, and variant discovery. However, the state-of-the-art in short-read alignment and assembly remains the trade-off between speed and accuracy, with a tradeoff needing to be made between ideal alignment and computational efficiency.
4. Applications of NGS Technologies
Over the past five years, the NGS technologies have markedly accelerated multiple research areas, making feasible experiments that previously were not affordable or even technically feasible. Novel fields and applications in biology, life sciences and biomedicine are becoming reality. In this section, we describe some major applications of NGS.
4.1. De novo Sequencing or Resequencing of Genomes
The ultra-high throughput and low cost of NGS technologies have made sequencing numerous whole genomes tractable. NGS platforms have been used for de novo sequencing many bacterial genomes (Chaisson and Pevzner, 2008; Margulies et al., 2005), viral genomes (Harris et al., 2008), the giant panda genome (Li et al., 2010a), and resequencing human genomes at dramatically increased speed and decreased cost (Li et al., 2010a; Lin et al., 2008; Pushkarev, Neff and Quake, 2009; Wheeler et al., 2008). These applications have demonstrated the power of NGS technologies for de novo sequencing or sequencing of personal genomes that will be critical toward moving to the realm of personalized genomics and medicine.
4.2. Target Genomic Resequencing
Resequencing of genomic sub-regions or gene sets is fundamental in basic and clinical research seeking causative and predisposition mutations within populations (Dahl et al., 2007; Ding et al., 2008; Okou et al., 2007). The target resequencing strategy involves comparative analysis of candidate genes or genomic sub-regions from two groups of people with different phenotypes, and requires a high level of accuracy to identify low frequency causative SNPs and structural variants/mutations of diseases that are implicated by linkage studies and whole-genome wide association studies (Porreca et al., 2007; Yeager et al., 2008). Traditional capillary electrophoresis methods provide the highest accuracy and are the best suited for analyzing a limited set of amplicons in a large number of patient samples. However, this is burdensome in cost and labor for investigating a large number of genes or large sub-regions. In contrast, NGS technologies are highly advantageous in terms of both cost and labor, as evidenced by numerous recent studies (Albert et al., 2007; Chou et al., 2010; Hodges et al., 2007; Li et al., 2009b; Okou et al., 2007).
4.3. Chromatin Immunoprecipitation Followed by Sequencing (ChIP-Seq)
ChIP-Seq is a strategy that combines ChIP (chromatin immunopreciptation used to determine the location of DNA binding sites for proteins) technique with the NGS technologies to directly sequence DNA fragments to interrogate DNA-protein interactions, and was an early application of NGS (Barski et al., 2007; Johnson et al., 2007; Mikkelsen et al., 2007). By directly sequencing DNA fragments that interact with proteins, ChIP-Seq provides substantially improved data than microarray-based ChIP-chip method that is the most commonly used for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes (Park, 2009). Compared to ChIP-chip, ChIP-Seq has higher resolution, fewer artifacts, greater coverage, and a larger dynamic range. ChIP-Seq can also be used to identify the cistrome of DNA-associated proteins and precisely map global binding sites for any protein of interest (Kaufmann et al., 2010; Ouyang, Zhou and Wong, 2009; Visel et al., 2009).
4.4. Next-generation RNA Sequencing (RNA-Seq)
Applying NGS technologies to sequence RNA or complementary DNA (cDNA) reverse transcribed from the RNAs offers an alternative methodology for high-throughput transcriptome analysis (Marioni et al., 2008; Wang, Gerstein and Snyder, 2009; Wilhelm et al., 2010). In a typical RNA-Seq experiment, RNAs or cDNAs are first directly sequenced with NGS technologies; and then the sequence reads are mapped to a reference genome to construct a whole-genome transcriptome map (Wang, Gerstein and Snyder, 2009); finally, the transcripts (genes of interest) are characterized (e.g., alternative splicing) and quantified (Wang, Gerstein and Snyder, 2009).
Thanks to the deep coverage and base level resolution provided by next-generation sequencing instruments, RNA-Seq provides researchers with efficient ways to measure transcriptome data experimentally, allowing them to get information such as how different alleles of a gene are expressed, detect post-transcriptional mutations or identifying gene fusions.
By directly sequencing the entire transcriptome without prior knowledge of transcribed regions and at deep coverage and base level resolution, RNA-Seq is revolutionary in its abilities to provide precision in measuring transcriptome data (Li et al., 2010a; Marioni et al., 2008). The far higher resolution improves discovery of novel transcripts, differential allele expression, alternative splice variants, post-transcriptional mutations and isoforms compared with more conventional Sanger sequencing and microarray-based approaches (Chepelev et al., 2009; Hittinger et al., 2010; Jiang and Wong, 2009; Perkins et al., 2009; Richard et al., 2010; Sultan et al., 2008; Tang et al., 2009; Trapnell, Pachter and Salzberg, 2009; Wilhelm et al., 2010). Recent studies (Guttman et al., 2009; Li et al., 2009a; Pan et al., 2008; Porreca et al., 2007; Wang et al., 2008) that used RNA-Seq to characterize the RNA populations have provided more complicated pictures of RNA regulation and expression, through alternative splicing, alternative polyadenylation, and RNA editing. These findings have expanded our traditional view of the extent and complexity of gene expression (Licatalosi and Darnell, 2010), and advanced our understanding of mechanisms of RNA expression regulation in both eukaryotic (Jacquier, 2009) and prokaryotic (Sorek and Cossart, 2010) genomes.
4.5. Comparison Between RNA-Seq and Microarrays
To evaluate the technical performance of NGS technologies on quantifying the expression level of transcripts, we recently used data generated from a rat toxicogenomics study to compare the performance of NGS (Illumina Genome Analyzer II) with a microarray-based approach (Affymetrix Rat Genome 230 2.0 arrays) to detect differentially expressed genes (DEGs) (Su et al., 2010). The RNA samples were the same as those used in the MAQC-I (Shi et al., 2006) validation study, for which the microarray data already existed (Guo et al., 2006). Eight RNA samples, four treatment and four control, were collected from the kidneys of rats treated/or not-treated (controls) with 10 mg kg
5. Future Perspective
NGS technologies are substantially impacting basic genomics research, and many more and far-reaching impacts are anticipated. Over the past few years, NGS technologies have demonstrated their immense potential for enabling scientific advancement in an ever-increasing diversity of biological and medical research areas (Sultan et al., 2008). In the next several years, NGS technologies are anticipated to transition into broader areas including disease etiology, new drug development, clinical-diagnostics, personalized medicine and nutrition, as well as toxicogenomics. Requisite to continuing successful transition will be further sequencing cost reduction, improved read accuracy, more streamlined sample preparation, and perhaps more importantly, computer-based analytics for data acquisition, management, validation, analyses, and biological interpretation.
6. Disclaimer
This article is not an official guidance or policy statement of U.S. Food and Drug Administration (FDA). No official support or endorsement by the US FDA is intended or should be inferred.








