Standard Article

You have free access to this content

Next-Generation Sequencing: A Revolutionary Tool for Toxicogenomics

Systems Toxicology

Image Analysis, Sequencing and Systems Modeling

  1. Zhenqiang Su1,
  2. Baitang Ning2,
  3. Hong Fang1,
  4. Huixiao Hong2,
  5. Roger Perkins2,
  6. Weida Tong2,
  7. Leming Shi2

Published Online: 15 SEP 2011

DOI: 10.1002/9780470744307.gat232

General, Applied and Systems Toxicology

General, Applied and Systems Toxicology

How to Cite

Su, Z., Ning, B., Fang, H., Hong, H., Perkins, R., Tong, W. and Shi, L. 2011. Next-Generation Sequencing: A Revolutionary Tool for Toxicogenomics. General, Applied and Systems Toxicology. .

Author Information

  1. 1

    US Food and Drug Administration, Z-Tech, National Center for Toxicological Research, Jefferson, AR, USA

  2. 2

    US Food and Drug Administration, Division of Systems Biology, National Center for Toxicological Research, Jefferson, AR, USA

Publication History

  1. Published Online: 15 SEP 2011

1 Introduction

  1. Top of page
  2. Introduction
  3. Next-generation Sequencing Technologies
  4. Analyses of NGS Data
  5. Applications of NGS Technologies
  6. Future Perspective
  7. Disclaimer
  8. Related Articles
  9. References

DNA sequencing is a fundamental way for deciphering a broad range of biological phenomena at the molecular level. Although the development of DNA sequencing technologies has a rich and diverse history (Hutchison, 2007; Shendure and Ji, 2008), the dideoxy sequencing established by Sanger, Nicklen and Coulson 1977 has dominated the industry and research community for almost three decades and currently remains the gold standard for decoding DNA sequences.

Many biochemical and technical improvements have endowed modern Sanger sequencers with very low error rates (∼0.001%), relatively long read-length (up to ∼1000 bp), and high-throughput and robust performance. However, the cost of Sanger-based approaches for large sequencing projects remains expensive (on the order of $0.50 per kilobase). This limitation incentivized the National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) to initiate a funding program in 2004 to develop novel technologies that will enable extremely low-cost, high quality DNA sequencing. With a stated goal of reducing the cost of sequencing mammalian-sized genomes by four orders of magnitude to approximate $1000 per genome in 521 10 years (, the initiative has propelled the development and commercialization of novel next-generation sequencing (NGS) technologies.

In 2005, 454 Life Sciences (now part of Roche Applied Science, developed and commercialized its Genome Sequencer (GS) system for ultra-high throughput DNA sequencing (Margulies et al., 2005). As the first commercialized NGS technology, the Roche/454 sequencing system provided a compelling case study for the establishment of a novel, revolutionary technology that enables scientists to carry out massively parallel DNA sequencing reactions at a relatively small cost, as well as at a much faster speed compared to using conventional Sanger's technologies. For example, the sequencing of the first human genome in the Human Genome Project (Collins et al., 2004; Lander et al., 2001) with automated Sanger technology took some 13 years at a cost of about $2.7 billion. In contrast, sequencing a human genome with the Roche/454 sequencer took but five months and approximately $1.5 million (Wheeler et al., 2008). This great stride was enabled by ultra-high throughput, simplified in vitro sample preparation, and miniaturization of sequencing chemical reactions (Rothberg and Leamon, 2008).

Illumina, Life Technologies, and Helicos BioSciences quickly followed Roche's lead and launched their own NGS platforms. Concurrently, new DNA sequencing technologies have been advancing at other companies and research institutions, such as Pacific Biosciences (, VisiGen Biotechnologies (, Sequenom (, Complete Genomics (, and the Center for Computational Genetics at Harvard Medical School ( edu/gmc).

Differing in sequencing chemistry as well as technical details, all NGS platforms share a similar technical strategy –miniaturization of individual sequencing chemical reactions to boost sequencing throughput (Metzker, 2010). The miniaturization of sequencing reactions, coupled with other technical breakthroughs, such as overcoming the bottlenecks of library preparation and template preparation (Rothberg and Leamon, 2008) enables millions of simultaneous individual sequencing reactions. Only a single fragment of DNA is sequenced in each miniaturized chemical reaction, but millions of them are spatially arranged so that individual reactions are isolated from one another, and distinctly detected by laser scanning or other approaches. The results are prodigious volumes of short read sequence data, unprecedented detail and resolution of sequence complexity, with consequential challenges in storing, managing, analyzing, and interpreting such wealth of data.

In this chapter, we first describe the fundamental principles of four commercially available NGS platforms, that is, Roche/454 GS FLX, Illumina Genome Analyzer IIx, Life Technologies SOLiD, and Helicos HeliScope. We then discuss the general difficulties to be overcome in the analyses of NGS data. Next, we outline some main applications of NGS technologies and, finally, we compare the analysis of a toxigenomics study using NGS data with that using microarray data.

2 Next-generation Sequencing Technologies

  1. Top of page
  2. Introduction
  3. Next-generation Sequencing Technologies
  4. Analyses of NGS Data
  5. Applications of NGS Technologies
  6. Future Perspective
  7. Disclaimer
  8. Related Articles
  9. References

2.1 Roche/454 Pyrosequencing

The Roche/454 GS FLX system (Figure 1 (1)) relies on pyrophosphate detection (Nyren, Pettersson and Uhlen, 1993) and emulsion PCR (Tawfik and Griffiths, 1998). A library of DNA templates is prepared by a highly efficient in vitro DNA amplification method known as emulsion PCR (Figure 1 (2)), where sheared DNA fragments are ligated to specific oligonucleotide adapters, resulting in each DNA fragment binding to a fragment-carrying bead. The beads are then captured in separate emulsion droplets that function as amplification reactors to produce some 10 million clonal copies of the DNA template that are needed for sufficient light signal intensities (Fuller et al., 2009). On completion of the emulsion PCR amplification, the emulsion is disrupted, and the beads containing clonally amplified template DNAs are enriched, and then the beads are again separated by limiting dilution and deposited into individual picotiter-plate wells. The picotiter-plates serve as sequencing reactors to let individual enzymatic sequencing reactions occur without interference from adjacent wells. Visible light emitted from the subsequent pyrosequencing reactions (Ronaghi et al., 1996) are detected by an imaging charge-coupled device (CCD) that is bonded to a fiber-optical bundle. During each cycle of a pyrosequencing reaction, a single species of unlabeled nucleotide is supplied to the reaction mixture to all beads on the chip, so that the complementary strand of DNA is sequentially synthesized. With the incorporation of each base in the growing chain, an inorganic pyrophosphate group is released that is converted to ATP by sulfurylase. During sequencing, the ATP molecule is next used by luciferase to convert luciferin to oxyluciferin, producing a light pulse (Figure 1 (3)). Detecting the light emissions together with knowing the nucleotide identity in each step allows the incorporated base to be determined. Through a series of such pyrosequencing reaction cycles, the sequence of the DNA template carried by individual beads is determined.

thumbnail image

Figure 1.

    Overview of the Roche/454 GS FLX system workflow:

  1. fragmentation of DNA and ligation to Roche specific adapters (denoted as blue A and orange B);

  2. clonal amplification by emulsion PCR; and

  3. real time sequencing-by-synthesis.

Because there is no terminating moiety preventing multiple consecutive incorporations in a given pyrosequencing reaction cycle, the length of homopolymers in sequence reads must be inferred from light signal intensity, with a higher intensity corresponding to more repeats. The error rate of calling consecutive repeats increases when the length of the homopolymers is greater than 3–4 repeating bases. Thus, the main error type for the Roche/454 system is insertions and deletions (or indels), other than substitutions (Shendure and Ji, 2008).

Compared to other NGS platforms, the strength of the Roche/454 system is its longer sequence reads. The Roche/454 GS FLX, with its newest chemistry, termed GS FLX Titanium series reagents, can generate more than one million individual sequence reads with read length over 400 base pairs over a 10 h time span ( Although its per-base cost is much higher than that of other NGS platforms (e.g., Life Technologies/SOLiD and Illumina/Genome Analyzer IIx), the Roche/454 system is best suited for certain applications such as de novo sequencing of new genomes, for which long read length is critical for de novo genome assembly.

2.2 Illumina Sequencing Technology

The Illumina GA system is the first short read sequencing platform and currently dominates the NGS market (Metzker, 2010). The first GA system was launched by Solexa in 2006, which was subsequently acquired by Illumina in early 2007 (

The Illumina GA system (Figure 2) uses an array technique to achieve cloning-free DNA amplification. Reversible terminator chemistry is the defining characteristic that provides massively parallel sequencing of millions of DNA fragments at low cost. DNA samples are randomly sheared into fragments that are then end-repaired to generate 5′-phosphorylated blunt ends. The Klenow fragment of DNA polymerase is then used to attach a single “A” base to the 3′ end of the DNA fragments, which prepares the DNA fragments for ligation to oligonucleotide adapters (Figure 2 (1)). After ligation to adapters at both ends, the DNA fragments are denatured, and single-stranded DNA fragments are attached to reaction chambers that are optically transparent solid surfaces called a flow cells. To obtain sufficient light signal intensity for the reliable detection, attached DNA fragments are extended and amplified by bridge PCR amplification (Figure 2 (2)). The bridge PCR amplification can create an ultra-high density sequencing flow cell containing hundreds of millions of clusters that, in turn, contains some 1000 copies of the same DNA template. These templates are finally sequenced through the sequencing-by-synthesis technique that applies reversible terminators with removable fluorescent dyes.

thumbnail image

Figure 2.

    The Illumina Genome Analyzer sequencing strategy:

  1. DNA templates are prepared by fragmentation of DNA, end repair, and ligation to adapters; and then are amplified by;

  2. bridge PCR in which single-stranded fragments are randomly bound to the inside surface of the flow cell channels and are amplified to generate clusters with some 1000 copies of the same DNA template; and finally DNA templates are sequenced with; and

  3. reversible terminator-based sequencing-by-synthesis technique.

For sequencing and DNA synthesis, the reaction mixtures comprising primers, DNA polymerase, and four reversible terminator nucleotides, each labeled with a different fluorescent dye, are supplied to the flow cell. In each sequencing cycle, a specific terminator is incorporated according to sequence complementarity in each template DNA strand in a clonal cluster. After incorporation, the identity (base calling) and the position of the specifically incorporated terminator on the flow cell is determined according to the fluorescence dye emission, and the signal recorded using a CCD camera. In the following cycle, the reversible terminator is unblocked and the fluorescent dye label is removed from the base, so that a new nucleotide can be incorporated and a new base can be detected using the same strategy. This repetitive sequencing-by-synthesis process takes about 2.5 days to generate 50 million reads per flow cell, with a read-length of some 36 bases. The overall sequencing output of the Illumina GA system is more than one billion base (Gb) pairs per analytical run (Bentley et al., 2008).

The upgraded GA II that is capable of sequencing both single-read and paired-end (sequencing both ends of the template molecules) libraries, generating some 1.5 Gb of outputs per day, corresponding to 80–100 million reads per flow cell. Moreover, the Illumina's newest version of GA, HiSeq 2000 extends throughput up to 200 Gb per run and two billion 100+ bp paired-end reads.

In a giving cycle of sequencing, any modified nucleotide could be incorporated with a decreased or an increased efficiency, resulting in an under-or over-incorporation and a heterogeneous mixture of synthesis lengths, and concomitant degradation of signal purity and precision. In addition, chemical cleavage of terminating moieties and florescent dye labels are subject to incompletion. Therefore, the Illumina's sequencing strategy generates much shorter reads and its most common error type is substitutions (Shendure and Ji, 2008). The base-call error rate increases with read length due to “dephasing noise” (Dohm et al., 2008). In addition, an underrepresentation of AT-rich and GC-rich regions (Dohm et al., 2008; Hillier et al., 2008; Metzker, 2010) has been observed.

2.3 Life Technologies/SOLiD

The SOLiD (Supported Oligonucleotide Ligation and Detection) system is a short-read sequencing platform relying on ligation chemistry. This platform was developed by Life Technologies based on the strategies described by Shendure et al. (2005) and Mckernan et al. (2008).

Library construction for the SOLiD system is similar to Roche/454 technology, in which DNA is stochastically sheared into fragments that are subsequently ligated to oligonucleotide adapters (Figure 3 (1)), attached to beads, and clonally amplified by emulsion PCR. After denaturing templates, the template-carrying beads are enriched to separate desired beads from undesired. The templates on the selected beads then are 3′ modified for the purpose of covalent attachment to the slide. Then, 3′ modified beads are deposited onto a derivitized-glass flow cell surface to generate a dense, disordered array (Figure 3 (2)). Sequencing reactions are started by hybridizing a primer oligonucleotide complementary to the adapter at the adapter-template junction (Figure 3 (3)). Unlike the Roche/454 sequencing approach, the sequencing-by-synthesis in the SOLiD system is driven by a DNA ligase rather than a DNA polymerase. Briefly, in the ligation chemistry, a mixture of partially degenerated oligonucleotide octamers is competitively hybridized to the DNA fragments as probes, and a universal primer is oriented to provide a 5′ phosphate group for ligation. The specificity of the probe ligated to a primer is determined by the 4th and 5th bases of the probe that are complementary to the template, and the identities (base callings) of the 4th and 5th bases of probes are characterized by one of four florescent labels at the end of the octamer, so that the interrogation of the 4th base and 5th base is achieved. After ligation, the ligated octamer oligonucleotides are cleaved off after the fifth base and the fluorescent label is removed, so that the next hybridization and ligation cycle can proceed. In such a way, the bases 4 and 5 in the template are determined in the first cycle, and the bases 9 and 10 in the second cycle, and so on. The ligation-sequencing can also be carried out in the same way with another primer offset by one base in the adapter, so the bases 3 and 4, 8 and 9, …, in the template can be determined (Figure 3 (4)). By any given five-cycle rounds, each base is interrogated twice with two different fluorescent labels, resulting in significantly reduced base-call error rate ( SystemSequencing/OverviewofSOLiDSequencingChemistry).

thumbnail image

Figure 3.

    Outline of the Life Technologies SOLiD sequencing:

  1. preparation of libraries by ligation of two different DNA adapters, P1 and P2, to the 5′-and 3′-ends of DNA fragments, respectively;

  2. emulsion PCR is performed, and DNA templates on enriched beads undergo a 3′ modification, and are deposited onto a flow cell to generate a dense, disordered array;

  3. sequencing by ligation: (I) primers hybridize to the P1 adapter on template-carrying beads; a set of four fluorescently labeled di-base probes compete for ligation to the primers. Specificity of di-base probes is achieved by interrogating every 1st and 2nd bases in each ligation reaction. Each ligation is followed by fluorescence detection, after which, (II) the fluorescent label is cleaved and (III–IV) multiple cycles of ligation, detection and cleavage are performed with the number of cycles determining the eventual read length. Following a series of ligation cycles, the extension product is removed and (V) the template is reset with a primer complementary to the n−1 position for the second round of ligation cycles. For each DNA template,

  4. a total of five rounds of primer reset are needed, through which, virtually each base is interrogated in two independent ligation reactions by two different primers.

The current version, SOLiD 3 Plus system, is capable of using both fragments and mate-paired libraries, and of generating in one run over 60 Gb sequence data and one billion reads with read-length up to 50 bases.

By using ligation-based sequencing-by-syntheses, the SOLiD system mitigates homopolymeric sequencing error. The dominant error type is substitutions. Furthermore, according to the manufacturer (http://www3.appliedbi, an over all accuracy of 99.94% can be achieved by using the two-base encoding system that can recognize and eliminate two-thirds of measurement errors.

2.4 Helicos HeliScope Genetic Analysis System

The HeliScope Genetic Analysis System, developed by Helicos BioSciences ( in 2007, is the first commercialized single-molecule DNA sequencer. It is based on the True Single Molecule Sequencing (tSMS) technology stemmed from the work by Braslavsky et al. (2003) and relied on the cyclic interrogation of a dense array of sequencing features. By directly sequencing single molecules of DNA or RNA without requiring clonal amplification like other systems, the Helicos' tSMS technology significantly increases the speed and decreases of the cost of sequencing.

In the HeliScope system (Figure 4), a DNA library is constructed by random fragmentation of DNA samples, and 3′ end ployadenylation of DNA fragments with the adenosine terminal transferase (Figure 4 (1)). Denatured poly-A fragments are captured on a flow cell surface by hybridization to surface-tethered poly-T oligomers to yield a disordered array of primed single molecule sequencing templates. In each cycle of sequencing (Figure 4 (2)), DNA polymerase and one of four fluorescently labeled nucleotides are supplied to the flow cell. The template dependent incorporation of single dye-labeled nucleotide is imaged with a CCD camera to make a base calling. Followed by dye-label cleavage and washing, the next cycle of nucleotide extension and imaging is repeated. Each sequencing cycle consists of the successive addition of polymerase and a different type of dye-labeled nucleotide. The total number of sequencing cycles performed ranges from 25 to 55, resulting in read-lengths from 25 to 55 bases. The HeliScope instrument is currently capable of imaging billions of single molecules per run and producing over 1 Gb of usable sequence data per day.

thumbnail image

Figure 4.

    The Helicos true single molecule sequencing (tSMS) system:

  1. DNA templates are prepared by fragmentation, 3′ poly (A) tail addition, labeling, and blocking by terminal transferase. In contrast to other NGS technologies, no ligation or PCR amplification is required; and

  2. sequencing single-molecule by synthesis: the flow cell is incubated with a mixture of polymerase and one fluorescently labeled nucleotide (C, G, A or T); then the mixture is rinsed away and fluorescent labels are detected. The fluorescent group is finally removed from the incorporated nucleotide and the process continues through each of the other three nucleotides. In such multiple four-base nucleotide addition and detection cycles the sequence of each fragment is determined.

Similar to the Roche/454 platform, the HeliScope system is asynchronous, meaning that some DNA strands will fall behind or ahead of others in a sequence-dependent manner, and some DNA templates just fail to incorporate by chance on a given cycle; therefore, base substitution error is likely to occur. However, the substitution error rate is quite low (0.01–1% with one pass and 0.001% with two passes). On the other hand, there are no terminating moieties present on the labeled nucleotides, so homopolymers could be problematic. Helicos has since developed a Virtual Terminator technology to correct the homopolymer errors, increasing sequencing accuracy (Bowers et al., 2009). In general, as a result of incorporation of unlabeled bases, deletion is the dominant error type in the HeliScope system. The deletion error rate is 2–7% with one pass and 0.2–1% with two passes (Harris et al., 2008).

There are important differences among the aforementioned NGS technologies in terms of costs, advantages, limitations, and practical aspects of use for specific applications. For example, the Illumina and the Life Technologies platforms are particularly well suited for variant discovery by resequencing the human genome (Metzker, 2010), where a reference genome is available. The Roche/454 sequencer may be preferable for de novo sequencing due to its longer read-length. The Helicos platform is well suited for RNA-Seq that is relying on tag counting (Wang, Gerstein and Snyder, 2009) or direct RNA sequencing (Ozsolak et al., 2009). Table 1 provides a summary of the characteristics of the NGS platforms from the four manufacturers mentioned above.

Table 1. The characteristics of next-generation sequencing platforms
ProductManufacturerRead length (bases)Number of reads (M/run)Data volume (Gb/run)Run time (days)Price (US$)Comments
GS FLX TitaniumRoche/454∼400> 10.4 ∼ 0.6∼0.5500 000Longer reads and fast run times; high reagent cost; high error rate in homopolymer repeats
Genome Analyzer IIxIllumina1 × 35225 ∼ 2508.0 ∼ 9.0∼2540 000Currently the most widely used system
 2 × 35225 ∼ 25016.0 ∼ 18.0∼4   
 2 × 50225 ∼ 25022.5 ∼ 50.0∼5   
 2 × 75225 ∼ 25034.0 ∼ 38.0∼7.5   
 2 × 100225 ∼ 25045.0 ∼ 50.0∼9.5   
HiSeq 2000Illumina1 × 35100026 ∼ 35∼1.5NA 
 2 × 50200075 ∼ 100∼4   
 2 × 1002000150 ∼ 200∼8   
SOLiD 3 plusApplied Biosystems50 or 2 × 50∼500 or 100025 ∼ 30 or 50–603.5–4.5 for 35 bp, 6–7 for 50 bp; or 8–9 for 2 × 35 bp; 12–14 for 2 × 50 bp595 000Two-base encoding provides inherent error correction
HeliScope Genetic Analysis SystemHelicos BioSciences30 ∼ 35600 ∼ 80021 ∼ 288999 000Non-bias single-molecule sequencing, high error rate

3 Analyses of NGS Data

  1. Top of page
  2. Introduction
  3. Next-generation Sequencing Technologies
  4. Analyses of NGS Data
  5. Applications of NGS Technologies
  6. Future Perspective
  7. Disclaimer
  8. Related Articles
  9. References

3.1 Overview of NGS Data Analyses

The new massively parallel sequencing technologies promise to refine and advance science across many fields. Moreover, the now tractable costs enable the powerful systems to reach increasing numbers of hands, thereby, broadly accelerating science. However, the realization of many promises is predicated on progress in overcoming obstacles in handling massive datasets and in developing tools to check and assure sequence quality, conduct sequence alignment and assembly, and biologically interpret and draw inferences from the data. NGS experiments generate immense volumes of short-read sequence data (Voelkerding, Dames and Durtschi, 2009) (Table 1). Data acquisition for such volumes is problematic alone, requiring an infrastructure with high bandwidth pipelines between processes that will be computationally intensive.

Translating such volumes of short-read data to biological results can be described as requiring three analysis stages, as depicted in Figure 5. In the first stage, images from NGS sequencers are analyzed and converted into sequence reads using the manufacturer's base-calling system. The reads are filtered and aligned in the second stage. Depending on the intended biological application as well as considerations of the cost, labor intensity and time requirement, the alignment can be done by de novo assembly or by mapping to a reference sequence that can be a complete genome, subsets of a genome (e.g., expressed genes and individual chromosomes of interest), a transcriptome, or an exome. In the final third stage, mapped and unmapped reads can be used to answer specific biological questions, such as the profiling of expression of genes, exons or isoforms; the discovery of novel transcripts, genes, splice variants, or single nucleotide polymorphisms (SNPs); and, the detection of transcription factors, methylation status, and histone modifications.

thumbnail image

Figure 5.

    A typical workflow for the analyses of NGS data:

  1. conversion of images to sequence reads;

  2. alignment of sequence reads (map to a reference or de novo assembly); and

  3. experiment-specific downstream analyses that depends on the applications.

3.2 NGS Quality Control

The rapid expansion of applications of NGS technologies in solving biological, biomedical, and clinical problems makes the topic of NGS quality control including data quality, reliability, reproducibility, and biological relevance more and more important because of the inherent relatively high error rate in raw sequence data. It is preferable to establish an early consensus of standardized benchmarks for sequencing quality metrics (Editorial, 2008) to avoid the future dilemmas when comparing data from different NGS platforms, such as occurred for microarray platforms the past few years (Shi et al., 2006; Shi et al., 2008). The third phase of the MicroArray Quality Control (MAQC) (Shi et al., 2006) project, also called sequencing quality control (SEQC) is such an endeavor that is aimed at assessing the technical performance of NGS platforms. The SEQC project plans to generate benchmark datasets with reference samples and evaluate advantages and limitations of various NGS platforms and bioinformatics strategies in RNA and DNA sequencing.

3.3 Bioinformatics Tools for NGS Data Analyses

Currently, a number of bioinformatics tools are available for analyzing NGS data (Table 2) that can be grouped in four general categories: (i) base calling and polymorphism detection, (ii) alignment of reads to a reference, (iii) de novo assembly, and (iv) genome browsing and annotation. However, these current tools have some limitations, and many challenges and questions remain. Efficient data analysis pipelines are still needed for many applications and the relative advantages and limitations of existing tools need to be objectively evaluated.

Table 2. Bioinformatics tools for next-generation sequencing data analyses
454 Analysis ToolsIntegrated solutionRoche
BfastAlignment(Homer, Merriman and Nelson, 2009)
BowtieAlignment(Langmead et al., 2009)
BreakDancerVariation detection(Chen et al., 2009)
CLC Genomics WorkbenchIntegrated solutionCLCBio
Cross_matchAlignmentPhil Green
EagleViewAssembler viewer(Huang and Marth, 2008)
ELANDAlignmentAnthony J. Cox, Illumina
ExonerateAlignment(Slater and Birney, 2005)∼guy/exonerate
GalaxyIntegrated(Taylor et al., 2007)
GenomatixIntegrated solutionGenomatrix
GMAPAlignment(Wu and Watanabe, 2005)
JMP GenomicsViewer and statistical analysisSAS Institute
LookSeqViewer(Manske and Kwiatkowski, 2009)
MapViewViewer(Bao et al., 2009)
MAQAlignment and assembly(Li, Ruan and Durbin, 2008a)
Matlab BioinformaticsAlignment and statistical analysisThe Mathworks™
MUMmerAlignment(Ossowski et al., 2008)
NextGENeIntegrated solutionSoftgenetics
PeakSeqChip-Seq(Rozowsky et al., 2009)
PIQAIntegrated pipeline(Martinez-Alcantara et al., 2009)
RtracklayerViewer(Lawrence, Gentleman and Carey, 2009)
SAMSequence Assembly Manager(Warren et al., 2005)
SeqMapAlignment(Jiang and Wong, 2008)∼jiangh/SeqMap
SeqMan NGenIntegrated solutionDNAStar
SHOREMapping and analysis pipelineKorbinian Schneeberger et al.
ShortReadInput, quality assessment(Morgan et al., 2009)
SHRiMPAlignment(Rumble et al., 2009)
SliderAlignment and SNP detection(Malhis et al., 2009)
SOAPAlignment and analysis(Li et al., 2008a)
SSAHSAlignment(Ning, Cox and Mullikin, 2001)
VmatchAlignmentStefan Kurtz (Delcher et al., 1999)
ZOOMAlignment(Lin et al., 2008)

For base calling, most researchers simply use the calls generated with the data-pipeline software provided by manufacturers, but alternative approaches implementing more advanced statistical methodologies are also being developed. For example, Erlich et al. (2008) created an Alta-Cyclic approach that uses machine learning to reduce noise factors, substantially improving the number of accurate reads. Rougemont et al. (2008) proposed an algorithm using model-based clustering and probability theory to improve base-call quality by identifying and removing ambiguous bases from read ends. However, these improvements must be evaluated for cost effectiveness given the need for substantial investment to handle large volumes of raw image data (Voelkerding, Dames and Durtschi, 2009).

Proper alignment is mandatory to render NGS data biologically meaningful. Because of the short read-length, relatively high error rate in base calling, and a huge volume of data, alignment of data from NGS platforms is much more difficult than that from Sanger sequencing platforms (Trapnell and Salzberg, 2009). One limitation of aligning and assembling of reads is that a large portion of reads cannot be uniquely aligned to a reference when sequence reads are too short and the reference is too complex (Voelkerding, Dames and Durtschi, 2009). In addition, the chance of unique alignment or assembly is reduced not only by the presence of repeat sequences in complex genomes, but also by shared homologies within closely related gene families and pseudogenes (Voelkerding, Dames and Durtschi, 2009).

Conventional alignment solutions like BLAST ( (Altschul et al., 1990) and BLAT ( (Kent, 2002) are efficient to align long reads such as those generated by Sanger sequencing, but inadequate to handle NGS short reads. Recently, a variety of sequence alignment algorithms and software packages have been developed specifically for processing a large number of short reads. Table 2 provides an overview of such programs. The algorithms implemented in these software packages vary with the applications, but they include sequence alignment, de novo assembly, alignment viewing, and variant discovery. However, the state-of-the-art in short-read alignment and assembly remains the trade-off between speed and accuracy, with a tradeoff needing to be made between ideal alignment and computational efficiency.

4 Applications of NGS Technologies

  1. Top of page
  2. Introduction
  3. Next-generation Sequencing Technologies
  4. Analyses of NGS Data
  5. Applications of NGS Technologies
  6. Future Perspective
  7. Disclaimer
  8. Related Articles
  9. References

Over the past five years, the NGS technologies have markedly accelerated multiple research areas, making feasible experiments that previously were not affordable or even technically feasible. Novel fields and applications in biology, life sciences and biomedicine are becoming reality. In this section, we describe some major applications of NGS.

4.1 De novo Sequencing or Resequencing of Genomes

The ultra-high throughput and low cost of NGS technologies have made sequencing numerous whole genomes tractable. NGS platforms have been used for de novo sequencing many bacterial genomes (Chaisson and Pevzner, 2008; Margulies et al., 2005), viral genomes (Harris et al., 2008), the giant panda genome (Li et al., 2010a), and resequencing human genomes at dramatically increased speed and decreased cost (Li et al., 2010a; Lin et al., 2008; Pushkarev, Neff and Quake, 2009; Wheeler et al., 2008). These applications have demonstrated the power of NGS technologies for de novo sequencing or sequencing of personal genomes that will be critical toward moving to the realm of personalized genomics and medicine.

4.2 Target Genomic Resequencing

Resequencing of genomic sub-regions or gene sets is fundamental in basic and clinical research seeking causative and predisposition mutations within populations (Dahl et al., 2007; Ding et al., 2008; Okou et al., 2007). The target resequencing strategy involves comparative analysis of candidate genes or genomic sub-regions from two groups of people with different phenotypes, and requires a high level of accuracy to identify low frequency causative SNPs and structural variants/mutations of diseases that are implicated by linkage studies and whole-genome wide association studies (Porreca et al., 2007; Yeager et al., 2008). Traditional capillary electrophoresis methods provide the highest accuracy and are the best suited for analyzing a limited set of amplicons in a large number of patient samples. However, this is burdensome in cost and labor for investigating a large number of genes or large sub-regions. In contrast, NGS technologies are highly advantageous in terms of both cost and labor, as evidenced by numerous recent studies (Albert et al., 2007; Chou et al., 2010; Hodges et al., 2007; Li et al., 2009b; Okou et al., 2007).

4.3 Chromatin Immunoprecipitation Followed by Sequencing (ChIP-Seq)

ChIP-Seq is a strategy that combines ChIP (chromatin immunopreciptation used to determine the location of DNA binding sites for proteins) technique with the NGS technologies to directly sequence DNA fragments to interrogate DNA-protein interactions, and was an early application of NGS (Barski et al., 2007; Johnson et al., 2007; Mikkelsen et al., 2007). By directly sequencing DNA fragments that interact with proteins, ChIP-Seq provides substantially improved data than microarray-based ChIP-chip method that is the most commonly used for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes (Park, 2009). Compared to ChIP-chip, ChIP-Seq has higher resolution, fewer artifacts, greater coverage, and a larger dynamic range. ChIP-Seq can also be used to identify the cistrome of DNA-associated proteins and precisely map global binding sites for any protein of interest (Kaufmann et al., 2010; Ouyang, Zhou and Wong, 2009; Visel et al., 2009).

4.4 Next-generation RNA Sequencing (RNA-Seq)

Applying NGS technologies to sequence RNA or complementary DNA (cDNA) reverse transcribed from the RNAs offers an alternative methodology for high-throughput transcriptome analysis (Marioni et al., 2008; Wang, Gerstein and Snyder, 2009; Wilhelm et al., 2010). In a typical RNA-Seq experiment, RNAs or cDNAs are first directly sequenced with NGS technologies; and then the sequence reads are mapped to a reference genome to construct a whole-genome transcriptome map (Wang, Gerstein and Snyder, 2009); finally, the transcripts (genes of interest) are characterized (e.g., alternative splicing) and quantified (Wang, Gerstein and Snyder, 2009).

Thanks to the deep coverage and base level resolution provided by next-generation sequencing instruments, RNA-Seq provides researchers with efficient ways to measure transcriptome data experimentally, allowing them to get information such as how different alleles of a gene are expressed, detect post-transcriptional mutations or identifying gene fusions.

By directly sequencing the entire transcriptome without prior knowledge of transcribed regions and at deep coverage and base level resolution, RNA-Seq is revolutionary in its abilities to provide precision in measuring transcriptome data (Li et al., 2010a; Marioni et al., 2008). The far higher resolution improves discovery of novel transcripts, differential allele expression, alternative splice variants, post-transcriptional mutations and isoforms compared with more conventional Sanger sequencing and microarray-based approaches (Chepelev et al., 2009; Hittinger et al., 2010; Jiang and Wong, 2009; Perkins et al., 2009; Richard et al., 2010; Sultan et al., 2008; Tang et al., 2009; Trapnell, Pachter and Salzberg, 2009; Wilhelm et al., 2010). Recent studies (Guttman et al., 2009; Li et al., 2009a; Pan et al., 2008; Porreca et al., 2007; Wang et al., 2008) that used RNA-Seq to characterize the RNA populations have provided more complicated pictures of RNA regulation and expression, through alternative splicing, alternative polyadenylation, and RNA editing. These findings have expanded our traditional view of the extent and complexity of gene expression (Licatalosi and Darnell, 2010), and advanced our understanding of mechanisms of RNA expression regulation in both eukaryotic (Jacquier, 2009) and prokaryotic (Sorek and Cossart, 2010) genomes.

4.5 Comparison Between RNA-Seq and Microarrays

To evaluate the technical performance of NGS technologies on quantifying the expression level of transcripts, we recently used data generated from a rat toxicogenomics study to compare the performance of NGS (Illumina Genome Analyzer II) with a microarray-based approach (Affymetrix Rat Genome 230 2.0 arrays) to detect differentially expressed genes (DEGs) (Su et al., 2010). The RNA samples were the same as those used in the MAQC-I (Shi et al., 2006) validation study, for which the microarray data already existed (Guo et al., 2006). Eight RNA samples, four treatment and four control, were collected from the kidneys of rats treated/or not-treated (controls) with 10 mg kg−1 body weight carcinogen aristolochic acid (AA) (Guo et al., 2006) and then sequenced with an Illumina Genome Analyzer II platform. The RNA sample from each rat was sequenced in one lane, generating over 16 million 36 bp reads per sample. Figure 6 (1) shows a scatter plot of NGS log2 FCs (Fold Changes) versus microarray log2 FCs for 11,202 common genes; the Pearson's correlation coefficient (r) is 0.52. Figure 6 (2) plotted the 4169 DEGs that were detected by either NGS or microarrays, where in both cases differential expression of genes was determined by the same two criteria, FC > 1.5 and p-value < 0.05. The 3322 red points and 372 magenta points represent genes detected only by the NGS and microarray, respectively, whereas the 522 blue points and 402 green points represent DEGs commonly selected and either up-regulated, or down-regulated, respectively. One circled point represents a singular common gene with differential expression in the opposite directions. Among all 4619 DEGs, the Pearson's coefficient for log2 FCs is 0.65 (Figure 6 (2)). Although 71% of the DEGs (Figure 6 (4)) selected from microarray data were also selected from NGS data, only 22% of the DEGs selected from the NGS data were selected from the microarray data, supporting conjecture that NGS is more sensitive than microarrays in detecting DEGs under the same selection criteria. The log2 FCs for 925 genes commonly selected by the NGS and microarrays are shown in a scatter plot in Figure 6 (2) with r = 0.91. Only one out of 925 genes disagrees in regulated directions. Hence, the concordance between genes commonly detected as differentially expressed by both platforms is very high.

thumbnail image

Figure 6.

    Comparison between NGS and microarray data, based on the log2 fold changes (FCs):

  1. all 11 202 common transcripts;

  2. 4619 differentially expressed genes (DEGs) identified from either NGS or microarray data;

  3. 925 DEGs commonly selected from both NGS and microarray data. For each platform, FCs were calculated by comparing four aristolochic acid treated samples to four control samples and DEGs were selected with an FC cut-off of 1.5 and p-value < 0.05. The magenta and red spots in panel (b) are DEGs determined only by microarrays and NGS, respectively. The blue and green spots in (b) and (c) represent up-or down-regulated DEGs commonly determined by NGS and microarrays. The circle in (b) and

  4. denotes a gene up-regulated in NGS but down-regulated in microarrays. The Pearson's correlation coefficients (r) between NGS log2 FCs and microarrays log2 FCs are 0.51, 0.65, and 0.91 in (a), (b), and (c), respectively. The Venn diagram in panel

  5. shows the overlap between NGS and microarray DEG lists of panel (b). Of 925 common DEGs (b), 924 are altered in the same directions (c).

5 Future Perspective

  1. Top of page
  2. Introduction
  3. Next-generation Sequencing Technologies
  4. Analyses of NGS Data
  5. Applications of NGS Technologies
  6. Future Perspective
  7. Disclaimer
  8. Related Articles
  9. References

NGS technologies are substantially impacting basic genomics research, and many more and far-reaching impacts are anticipated. Over the past few years, NGS technologies have demonstrated their immense potential for enabling scientific advancement in an ever-increasing diversity of biological and medical research areas (Sultan et al., 2008). In the next several years, NGS technologies are anticipated to transition into broader areas including disease etiology, new drug development, clinical-diagnostics, personalized medicine and nutrition, as well as toxicogenomics. Requisite to continuing successful transition will be further sequencing cost reduction, improved read accuracy, more streamlined sample preparation, and perhaps more importantly, computer-based analytics for data acquisition, management, validation, analyses, and biological interpretation.

6 Disclaimer

  1. Top of page
  2. Introduction
  3. Next-generation Sequencing Technologies
  4. Analyses of NGS Data
  5. Applications of NGS Technologies
  6. Future Perspective
  7. Disclaimer
  8. Related Articles
  9. References

This article is not an official guidance or policy statement of U.S. Food and Drug Administration (FDA). No official support or endorsement by the US FDA is intended or should be inferred.


  1. Top of page
  2. Introduction
  3. Next-generation Sequencing Technologies
  4. Analyses of NGS Data
  5. Applications of NGS Technologies
  6. Future Perspective
  7. Disclaimer
  8. Related Articles
  9. References
  • Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock GM, Gibbs RA. 2007. Direct selection of human genomic loci by microarray hybridization. Nat. Methods 4(11): 903905.
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J. Mol. Biol. 215(3): 403410.
  • Bao H, Guo H, Wang J, Zhou R, Lu X, Shi S. 2009. MapView: visualization of short reads alignment on a desktop computer. Bioinformatics 25(12): 15541555.
  • Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. 2007. High-resolution profiling of histone methylations in the human genome. Cell 129(4): 823837.
  • Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IM, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DM, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara ECM, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Rogers J, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456(7218): 5359.
  • Bowers J, Mitchell J, Beer E, Buzby PR, Causey M, Efcavitch JW, Jarosz M, Krzymanska-Olejnik E, Kung L, Lipson D, Lowman GM, Marappan S, McInerney P, Platt A, Roy A, Siddiqi SM, Steinmann K, Thompson JF. 2009. Virtual terminator nucleotides for next-generation DNA sequencing. Nat. Methods 6(8): 593595.
  • Braslavsky I, Hebert B, Kartalov E, Quake SR. 2003. Sequence information can be obtained from single DNA molecules. Proc. Natl. Acad. Sci. USA 100(7): 39603964.
  • Chaisson MJ, Pevzner PA. 2008. Short read fragment assembly of bacterial genomes. Genome Res. 18(2): 324330.
  • Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS, McGrath SD, Wendl MC, Zhang Q, Locke DP, Shi X, Fulton RS, Ley TJ, Wilson RK, Ding L, Mardis ER. 2009. BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods 6(9): 677681.
  • Chepelev I, Wei G, Tang Q, Zhao K. 2009. Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq. Nucleic Acids Res. 37(16): e106.
  • Chou LS, Liu CS, Boese B, Zhang X, Mao R. 2010. DNA sequence capture and enrichment by microarray followed by next-generation sequencing for targeted resequencing: neurofibromatosis type 1 gene as a model. Clin. Chem. 56(1): 6272.
  • Collins FS, Lander ES, Rogers J, Waterston RH, Consortium IHGS. 2004. Finishing the euchromatic sequence of the human genome. Nature 431(7011): 931945.
  • Dahl F, Stenberg J, Fredriksson S, Welch K, Zhang M, Nilsson M, Bicknell D, Bodmer WF, Davis RW, Ji H. 2007. Multigene amplification and massively parallel sequencing for cancer mutation discovery. Proc. Natl. Acad. Sci. USA 104(22): 93879392.
  • Delcher AL, Kasif S, Fleischmann RD, Peterson J, White O, Salzberg SL. 1999. Alignment of whole genomes. Nucleic Acids Res. 27(11): 23692376.
  • Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, Sougnez C, Greulich H, Muzny DM, Morgan MB, et al., 2008. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 455(7216): 10691075.
  • Dohm JC, Lottaz C, Borodina T, Himmelbauer H. 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36(16): e105.
  • Editorial. 2008. Prepare for the deluge. Nat. Biotechnol. 26(10): 1099.
  • Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ. 2008. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat. Methods 5(8): 679682.
  • Fuller CW, Middendorf LR, Benner SA, Church GM, Harris T, Huang X, Jovanovich SB, Nelson JR, Schloss JA, Schwartz DC. 2009. The challenges of sequencing by synthesis. Nat. Biotechnol. 22(11): 10131023.
  • Guo L, Lobenhofer EK, Wang C, Shippy R, Harris SC, Zhang L, Mei N, Chen T, Herman D, Goodsaid FM, 2006. Rat toxicogenomic study reveals analytical consistency across microarray platforms. Nat. Biotechnol. 24(9): 11621169.
  • Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP. 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458(7235): 223227.
  • Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, Braslavsky I, Causey M, Colonell J, Dimeo J, Efcavitch JW. 2008. Single-molecule DNA sequencing of a viral genome. Science 320(5872): 106109.
  • Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ, Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nat. Methods 5(2): 183188.
  • Hittinger CT, Johnston M, Tossberg JT, Rokas A. 2010. Leveraging skewed transcript abundance by RNA-Seq to increase the genomic depth of the tree of life. Proc. Natl. Acad. Sci. USA 107(4): 14761481.
  • Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR. 2007. Genome-wide in situ exon capture for selective resequencing. Nat. Genet. 39(12): 15221527.
  • Homer N, Merriman B, Nelson SF. 2009. BFAST: an alignment tool for large scale genome resequencing. PLoS One 4(11): e7767.
  • Huang W, Marth G. 2008. EagleView: a genome assembly viewer for next-generation sequencing technologies. Genome Res. 18(9): 153843.
  • Hutchison CA, 3rd. 2007. DNA sequencing: bench to bedside and beyond. Nucleic Acids Res. 35(18): 62276237.
  • Jacquier A. 2009. The complex eukaryotic transcriptome: unexpected pervasive transcription and novel small RNAs. Nat. Rev. Genet. 10(12): 833844.
  • Jiang H, Wong WH. 2008. SeqMap: mapping massive amount of oligonucleotides to the genome. Bioinformatics 24(20): 23952396.
  • Jiang H, Wong WH. 2009. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25(8): 10261032.
  • Johnson DS, Mortazavi A, Myers RM, Wold B. 2007. Genome-wide mapping of in vivo protein-DNA interactions. Science 316(5830): 14971502.
  • Kaufmann K, Muino JM, Osteras M, Farinelli L, Krajewski P, Angenent GC. 2010. Chromatin immunoprecipitation (ChIP) of plant transcription factors followed by sequencing (ChIP-SEQ) or hybridization to whole genome arrays (ChIP-CHIP). Nat. Protoc. 5(3): 457472.
  • Kent WJ. 2002. BLAT–the BLAST-like alignment tool. Genome Res. 12(4): 656664.
  • Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann N, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowski J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ. 2001. Initial sequencing and analysis of the human genome. Nature 409(6822): 860921.
  • Langmead B, Trapnell C, Pop M, Salzberg SL. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10(3): R25.
  • Lawrence M, Gentleman R, Carey V. 2009. rtracklayer: an R package for interfacing with genome browsers. Bioinformatics 25(14): 18411842.
  • Li H, Ruan J, Durbin R. 2008a. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11): 18511858.
  • Li R, Li Y, Kristiansen K, Wang J. 2008b. SOAP: short oligonucleotide alignment program. Bioinformatics 24(5): 713714.
  • Li JB, Levanon EY, Yoon JK, Aach J, Xie B, Leproust E, Zhang K, Gao Y, Church GM. 2009a. Genome-wide identification of human RNA editing sites by parallel DNA capturing and sequencing. Science 324(5931): 12101213.
  • Li R, Li Y, Fang X, Yang H, Wang J, Kristiansen K. 2009b. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19(6): 11241132.
  • Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN. 2010a. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4): 493500.
  • Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K. 2010b. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20(2): 265272.
  • Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, Zhang Z, Zhang Y, Wang W, Li J, Wei F, Li H, Jian M, Li J, Zhang Z, Nielsen R, Li D, Gu W, Yang Z, Xuan Z, Ryder OA, Leung FC, Zhou Y, Cao J, Sun X, Fu Y, Fang X, Guo X, Wang B, Hou R, Shen F, Mu B, Ni P, Lin R, Qian W, Wang G, Yu C, Nie W, Wang J, Wu Z, Liang H, Min J, Wu Q, Cheng S, Ruan J, Wang M, Shi Z, Wen M, Liu B, Ren X, Zheng H, Dong D, Cook K, Shan G, Zhang H, Kosiol C, Xie X, Lu Z, Zheng H, Li Y, Steiner CC, Lam TT, Lin S, Zhang Q, Li G, Tian J, Gong T, Liu H, Zhang D, Fang L, Ye C, Zhang J, Hu W, Xu A, Ren Y, Zhang G, Bruford MW, Li Q, Ma L, Guo Y, An N, Hu Y, Zheng Y, Shi Y, Li Z, Liu Q, Chen Y, Zhao J, Qu N, Zhao S, Tian F, Wang X, Wang H, Xu L, Liu X, Vinar T, Wang Y, Lam TW, Yiu SM, Liu S, Zhang H, Li D, Huang Y, Wang X, Yang G, Jiang Z, Wang J, Qin N, Li L, Li J, Bolund L, Kristiansen K, Wong GK, Olson M, Zhang X, Li S, Yang H, Wang J, Wang J. 2010c. The sequence and de novo assembly of the giant panda genome. Nature 463(7279): 311317.
  • Licatalosi DD, Darnell RB. 2010. RNA processing and its regulation: global insights into biological networks. Nat. Rev. Genet. 11(1): 7587.
  • Lin H, Zhang Z, Zhang MQ, Ma B, Li M. 2008. ZOOM! Zillions of oligos mapped. Bioinformatics 24(21): 24312437.
  • Malhis N, Butterfield YS, Ester M, Jones SJ. 2009. Slider–maximum use of probability information for alignment of short sequence reads and SNP detection. Bioinformatics 25(1): 613.
  • Manske HM, Kwiatkowski DP. 2009. LookSeq: a browser-based viewer for deep sequencing data. Genome Res. 19(11): 21252132.
  • Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057): 376380.
  • Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. 2008. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18(9): 15091517.
  • Martinez-Alcantara A, Ballesteros E, Feng C, Rojas M, Koshinsky H, Fofanov VY, Havlak P, Fofanov Y. 2009. PIQA: pipeline for Illumina G1 genome analyzer data quality assessment. Bioinformatics 25(18): 24382439.
  • Mckernan K, Blanchard A, Kotler L, Costa G. 2008. Reagents, methods, and libraries for bead-based sequencing. US patent US/20080003571A1.
  • Metzker ML. 2010. Sequencing technologies -the next generation. Nat. Rev. Genet. 11(1): 3146.
  • Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE. 2007. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448(7153): 553560.
  • Morgan M, Anders S, Lawrence M, Aboyoun P, Pages H, Gentleman R. 2009. ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics 25(19): 26072608.
  • Ning Z, Cox AJ, Mullikin JC. 2001. SSAHA: a fast search method for large DNA databases. Genome Res. 11(10): 17251729.
  • Nyren P, Pettersson B, Uhlen M. 1993. Solid phase DNA minisequencing by an enzymatic luminometric inorganic pyrophosphate detection assay. Anal. Biochem. 208(1): 171175.
  • Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. 2007. Microarray-based genomic selection for high-throughput resequencing. Nat. Methods 4(11): 907909.
  • Ossowski S, Schneeberger K, Clark RM, Lanz C, Warthmann N, Weigel D. 2008. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 18(12): 20242033.
  • Ouyang Z, Zhou Q, Wong WH. 2009. ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. USA 106(51): 2152121526.
  • Ozsolak F, Platt AR, Jones DR, Reifenberger JG, Sass LE, McInerney P, Thompson JF, Bowers J, Jarosz M, Milos PM. 2009. Direct RNA sequencing. Nature 461(7265): 814818.
  • Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. 2008. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 40(12): 14131415.
  • Park PJ. 2009. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10(10): 669680.
  • Perkins TT, Kingsley RA, Fookes MC, Gardner PP, James KD, Yu L, Assefa SA, He M, Croucher NJ, Pickard DJ, Maskell DJ, Parkhill J, Choudhary J, Thomson NR, Dougan G. 2009. A strand-specific RNA-Seq analysis of the transcriptome of the typhoid bacillus Salmonella typhi. PLoS Genet 5(7): e1000569.
  • Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F. 2007. Multiplex amplification of large sets of human exons. Nat. Methods 4(11): 931936.
  • Pushkarev D, Neff NF, Quake SR. 2009. Single-molecule sequencing of an individual human genome. Nat. Biotechnol. 27(9): 847852.
  • Richard H, Schulz MH, Sultan M, Nurnberger A, Schrinner S, Balzereit D, Dagand E, Rasche A, Lehrach H, Vingron M, Haas SA, Yaspo ML. 2010. Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments. Nucleic Acids Res. 38(10): e112
  • Ronaghi M, Karamohamed S, Pettersson B, Uhlen M, Nyren P. 1996. Real-time DNA sequencing using detection of pyrophosphate release. Anal. Biochem. 242(1): 8489.
  • Rothberg JM, Leamon JH. 2008. The development and impact of 454 sequencing. Nat. Biotechnol. 26(10): 11171124.
  • Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F. 2008. Probabilistic base calling of Solexa sequencing data. BMC Bioinf. 9: 431.
  • Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB. 2009. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat. Biotechnol. 27(1): 6675.
  • Rumble SM, Lacroute P, Dalca AV, Fiume M, Sidow A, Brudno M. 2009. SHRiMP: accurate mapping of short color-space reads. PLoS Comput. Biol. 5(5): e1000386.
  • Sanger F, Nicklen S, Coulson AR. 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. USA 74(12): 54635467.
  • Shendure J, Ji H. 2008. Next-generation DNA sequencing. Nat. Biotechnol. 26(10): 11351145.
  • Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM. 2005. Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309(5741): 17281732.
  • Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, Collins PJ, de Longueville F, Kawasaki ES, Lee KY, Luo Y, Sun YA, Willey JC, Setterquist RA, Fischer GM, Tong W, Dragan YP, Dix DJ, Frueh FW, Goodsaid FM, Herman D, Jensen RV, Johnson CD, Lobenhofer EK, Puri RK, Schrf U, Thierry-Mieg J, Wang C, Wilson M, Wolber PK, Zhang L, Amur S, Bao W, Barbacioru CC, Lucas AB, Bertholet V, Boysen C, Bromley B, Brown D, Brunner A, Canales R, Cao XM, Cebula TA, Chen JJ, Cheng J, Chu TM, Chudin E, Corson J, Corton JC, Croner LJ, Davies C, Davison TS, Delenstarr G, Deng X, Dorris D, Eklund AC, Fan XH, Fang H, Fulmer-Smentek S, Fuscoe JC, Gallagher K, Ge W, Guo L, Guo X, Hager J, Haje PK, Han J, Han T, Harbottle HC, Harris SC, Hatchwell E, Hauser CA, Hester S, Hong H, Hurban P, Jackson SA, Ji H, Knight CR, Kuo WP, LeClerc JE, Levy S, Li QZ, Liu C, Liu Y, Lombardi MJ, Ma Y, Magnuson SR, Maqsodi B, McDaniel T, Mei N, Myklebost O, Ning B, Novoradovskaya N, Orr MS, Osborn TW, Papallo A, Patterson TA, Perkins RG, Peters EH, Peterson R, Philips KL, Pine PS, Pusztai L, Qian F, Ren H, Rosen M, Rosenzweig BA, Samaha RR, Schena M, Schroth GP, Shchegrova S, Smith DD, Staedtler F, Su Z, Sun H, Szallasi Z, Tezak Z, Thierry-Mieg D, Thompson KL, Tikhonova I, Turpaz Y, Vallanat B, Van C, Walker SJ, Wang SJ, Wang Y, Wolfinger R, Wong A, Wu J, Xiao C, Xie Q, Xu J, Yang W, Zhang L, Zhong S, Zong Y, Slikker W Jr. 2006. The MicroArray Quality Control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 24(9): 11511161.
  • Shi L, Jones WD, Jensen RV, Harris SC, Perkins RG, Goodsaid FM, Guo L, Croner LJ, Boysen C, Fang H, Qian F, Amur S, Bao W, Barbacioru CC, Bertholet V, Cao XM, Chu TM, Collins PJ, Fan XH, Frueh FW, Fuscoe JC, Guo X, Han J, Herman D, Hong H, Kawasaki ES, Li QZ, Luo Y, Ma Y, Mei N, Peterson RL, Puri RK, Shippy R, Su Z, Sun YA, Sun H, Thorn B, Turpaz Y, Wang C, Wang SJ, Warrington JA, Willey JC, Wu J, Xie Q, Zhang L, Zhang L, Zhong S, Wolfinger RD, Tong W. 2008. The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies. BMC Bioinf. 9 (Suppl 9): S10.
  • Slater GS, Birney E. 2005. Automated generation of heuristics for biological sequence comparison. BMC Bioinf. 6: 31.
  • Sorek R, Cossart P. 2010. Prokaryotic transcriptomics: a new view on regulation, physiology and pathogenicity. Nat. Rev. Genet. 11(1): 916.
  • Su Z, Li Z, Chen T, Li Q-z, Fang H, Ding D, Ge W, Ning B, Hong H, Perkins R, Tong W, Shi L. 2010. Comparative Analysis of Gene Expression Detected by Next-Generation Sequencing (NGS) and Microarray Technologies in a Rat Toxicogenomics Study. BMC Bioinf. 9 (Suppl 9): S10
  • Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML. 2008. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321(5891): 956960.
  • Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, Xu N, Wang X, Bodeau J, Tuch BB, Siddiqui A. 2009. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 6(5): 377382.
  • Tawfik DS, Griffiths AD. 1998. Man-made cell-like compartments for molecular evolution. Nat. Biotechnol. 16(7): 652656.
  • Taylor J, Schenck I, Blankenberg D, Nekrutenko A. 2007. Using Galaxy to Perform Large-Scale Interactive Data Anal. Curr. Protoc. Bioinf. 19(10): 110.
  • Trapnell C, Salzberg SL. 2009. How to map billions of short reads onto genomes. Nat. Biotechnol. 27(5): 455457.
  • Trapnell C, Pachter L, Salzberg SL. 2009. TopHat: discovering splice junctions with RNA-Seq. Bioinf. 25(9): 11051111.
  • Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, Afzal V, Ren B, Rubin EM, Pennacchio LA. 2009. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature 457(7231): 854858.
  • Voelkerding KV, Dames SA, Durtschi JD. 2009. Next-generation sequencing: from basic research to diagnostics. Clin. Chem. 55(4): 641658.
  • Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. 2008. Alternative isoform regulation in human tissue transcriptomes. Nature 456(7221): 470476.
  • Wang Z, Gerstein M, Snyder M. 2009. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10(1): 5763.
  • Warren RL, Butterfield YS, Morin RD, Siddiqui AS, Marra MA, Jones SJ. 2005. Management and visualization of whole genome shotgun assemblies using SAM. Biotechniques 38(5): 715716, 718, 720.
  • Wheeler DA, Srinivasan M, Egholm M, Shen Y, Chen L, McGuire A, He W, Chen YJ, Makhijani V, Roth GT, Gomes X, Tartaro K, Niazi F, Turcotte CL, Irzyk GP, Lupski JR, Chinault C, Song XZ, Liu Y, Yuan Y, Nazareth L, Qin X, Muzny DM, Margulies M, Weinstock GM, Gibbs RA, Rothberg JM. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452(7189): 872876.
  • Wilhelm BT, Marguerat S, Goodhead I, Bahler J. 2010. Defining transcribed regions using RNA-seq. Nat. Protoc. 5(2): 255266.
  • Wu TD, Watanabe CK. 2005. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21(9): 18591875.
  • Yeager M, Xiao N, Hayes RB, Bouffard P, Desany B, Burdett L, Orr N, Matthews C, Qi L, Crenshaw A, Markovic Z, Fredrikson KM, Jacobs KB, Amundadottir L, Jarvie TP, Hunter DJ, Hoover R, Thomas G, Harkins TT, Chanock SJ. 2008. Comprehensive resequence analysis of a 136 kb region of human chromosome 8q24 associated with prostate and colon cancers. Hum. Genet. 124(2): 161170.