Discovering genetic polymorphisms in next-generation sequencing data
* Correspondence (fax +61 (0)73346 2101; e-mail Dave.Edwards@uq.edu.au)
The ongoing revolution in DNA sequencing technology now enables the reading of thousands of millions of nucleotide bases in a single instrument run. However, this data quantity is often compromised by poor confidence in the read quality. The identification of genetic polymorphisms from this data is therefore problematic and, combined with the vast quantity of data, poses a major bioinformatics challenge. However, once these difficulties have been addressed, next-generation sequencing will offer a means to identify and characterize the wealth of genetic polymorphisms underlying the vast phenotypic variation in biological systems. We describe the recent advances in next-generation sequencing technology, together with preliminary approaches that can be applied for single nucleotide polymorphism discovery in plant species.
Advances in sequencing technology have enabled the production of massive amounts of data, and ‘next-generation sequencing technology’ can produce very large amounts, typically millions of short sequence reads (25–400 bp). However, these large numbers of relatively short reads are usually achieved at the expense of read accuracy. The first commercially available next-generation sequencing system was developed by 454 and commercialized by Roche (Basel, Switzerland) as the GS20, capable of sequencing over 20 million base pairs, in the form of 100-bp reads, in just over 4 h. The GS20 was replaced during 2007 by the GS FLX model, capable of producing over 100 million base pairs of sequence in a similar amount of time. Roche and 454 continue to improve data production with the expectation of 4–500 Mbp of sequence per run, and an increase to 500-bp reads with the release of their Titanium system towards the end of 2008. Two alternative ultrahigh-throughput sequencing systems now compete with the GS FLX: Solexa technology, commercialized by Illumina (San Diego, California, USA), and the SOLiD system from Applied Biosystems (AB) (Carlsbad, California, USA). A summary of each of these technologies is given in Table 1. Next-generation sequencing systems can be conveniently split into those that produce short reads and those that produce relatively longer reads.
Table 1. Properties of the next-generation sequencing systems, including year of launch, sequence read length, number of reads, throughput per run and approximate cost per GB, current for September 2008
|Read length (bp)||800||250||35–75||25–35||32||?|
|Reads per run||96||400K||130M||150M||85M||?|
|Throughput per run||0.1 MB||100 MB||10 GB||5 GB||2 GB||?|
|Cost per GB||> $2500K||$84K||$2K||$4K||?||?|
Short-read sequencing systems
The Solexa sequencing system, sold as the Illumina Genome Analyser II, uses reversible terminator chemistry to generate more than 10 000 million bases of usable data per run. Sequencing templates are immobilized on a flow cell surface. Solid phase amplification creates clusters of up to 1000 identical copies of each DNA molecule, with densities of up to 10 million clusters per square centimetre. The sequencing reaction then proceeds with four proprietary fluorescently labelled nucleotides to sequence the millions of clusters on the flow cell surface. These labelled nucleotides possess a reversible termination property, allowing each cycle of the sequencing reaction to occur simultaneously in the presence of the four nucleotides. Each raw read base has an assigned quality score, assisting assembly and sequence comparisons. Solexa sequencing has been developed predominantly for re-sequencing, with more than 10-fold coverage ensuring high confidence in the determination of genetic differences. The system may also be suitable for de novo sequence assembly once suitable bioinformatics software becomes available.
The AB SOLiD System enables parallel sequencing of clonally amplified DNA fragments linked to beads (Schuster, 2008). The sequencing method is based on sequential ligation with dye-labelled oligonucleotides, and can generate more than 5000 million bases of ‘mappable’ data per run. The system features a two-base encoding mechanism that interrogates each base twice, providing a form of built-in error detection when applied to re-sequencing, and uses a novel form of sequence reading, termed ‘colour space’, to describe this two-base encoded sequence. This two-base encoding system enhances the reliability of re-sequencing, compensating for the relatively high raw read error rate.
Long-read sequencing systems
The Roche 454 FLX system performs amplification and sequencing in a highly parallelized picolitre format. In contrast with Solexa and AB SOLiD, GS FLX read lengths average 200–300 bases per read and, with more than 400 000 reads per run, it is capable of producing over 1 Mbp of sequence in around 4 h with a single-read accuracy of greater than 99.5%. Emulsion polymerase chain reaction (PCR) enables the massively parallel amplification of DNA fragments immobilized on beads, going from a single DNA fragment on a bead to 10 million identical copies, generating sufficient DNA for the subsequent sequencing reaction. Sequencing involves the sequential flow of both nucleotides and enzymes over the picolitre plate holding the beads. The enzymes convert chemicals generated during nucleotide incorporation into a chemiluminescent signal that can be detected by a charge-coupled device (CCD) camera, and the light signal is quantified to determine the number of nucleotides incorporated during the extension of the DNA sequence.
Over the horizon
The current next-generation sequence manufacturers are constantly increasing output in terms of number of reads, increasing read length as well as working to improve read quality. The figures presented in Table 1 represent a snapshot of the current claimed status, which is likely to change rapidly. As a result of these rapid changes and some of the innovative methods that the machine manufacturers use to measure data attributes, it is often difficult to compare technologies directly. In addition to the available next-generation sequencing technologies, new systems are being developed and commercialized which offer the potential to further dramatically increase sequence data generation capability. Helicos (Cambridge, Massuchusetts, USA) has recently come on to the market with the first single molecule sequencing system, offering sequencing without the DNA amplification required for all previous systems. Both Pacific Biosciences (Menlo Park, California, USA) and Nanopore Technologies (Oxford, UK) have commercial products in the pipeline, and it is expected that DNA sequencing will continue to grow at an exponential rate for several years to come.
Next-generation sequencing is still a nascent technology and has mainly been applied to the re-sequencing of genomes, although more recent applications have focused on the de novo assembly of 454 and Solexa sequence data. Re-sequencing is only an option where a reference genome is available, and efforts have been concentrated on human and bacterial applications. The de novo assembly of sequence data enables the discovery of new genes, and provides a template for single nucleotide polymorphism (SNP) discovery for the many plant species for which there is no closely related reference genome. In addition to the choice between re-sequencing and de novo sequencing, it is also possible to choose between a whole-genome or reduced-genome-complexity approach. For large and complex genomes, reduced-complexity sequencing approaches provide adequate sequence depth for SNP discovery without the requirement to sample the complete genome.
There are several methods which aim to reduce the complexity of the sequencing template, in particular, by reducing the representation of low-information-content repetitive sequences. One such method is expressed sequence tag (EST) generation. EST sequencing is a routine method for gene discovery, and it has been shown that EST data provide a valuable resource for the mining of plant SNPs (Batley and Edwards, 2007). A major advantage of using ESTs for SNP discovery is that the identified SNPs are within known expressed genes.
One of the first applications of next-generation sequencing in plants identified over 36 000 putative maize SNPs using 260 000 and 280 000 ESTs, sequenced using the Roche GS20. These SNPs were identified between B73 and Mo17 inbred maize lines (Barbazuk et al., 2007). Stringent post-processing reduced this number to > 7000 putative SNPs, and over 85% (94/110) of a sample of these putative SNPs were successfully validated by Sanger sequencing. Based on this validation rate, this pilot experiment conservatively identified > 4900 valid SNPs within > 2400 maize genes, demonstrating the suitability and potential of the approach. Next-generation sequencing platforms are also being used by Monsanto to identify SNPs and to characterize genetic variation in maize lines, with the hope of applying the technique to other agronomically important crop species.
SNP discovery from next-generation sequencing of complex plant genomes has been demonstrated in maize by combining 454 technology with amplification fragment length polymorphism (AFLP) analysis. The CRoPS (complexity reduction of polymorphic sequences) system (van Orsouw et al., 2007) overcomes problems associated with highly duplicated regions in complex plant genomes, which can hamper SNP discovery through the inability to distinguish between true SNPs and differences between paralogous genes. AFLP analysis (Vos et al., 1995) can be used for genome complexity reduction, which has been exploited in the CRoPS system through the parallel generation of very similar genome representations of many accessions of crop species for high-throughput sequencing analysis. Within the CRoPS system, tagged reduced-complexity libraries from at least two samples are produced by AFLP, using a methylation-sensitive enzyme. The AFLP fragment libraries are sequenced to 5–10-fold redundancy using 454 next-generation technology, and the sequences are aligned and mined for polymorphisms, with measures in place to distinguish between amplification and sequencing errors, and true polymorphisms. The system was validated in maize, and 75% of identified SNPs were confirmed as true polymorphisms.
Recently, Roche 454 technology was used to sequence and assemble 148 Mbp of EST sequences for Eucalyptus grandis (Novaes et al., 2008). The EST sequences were generated from a normalized cDNA pool comprising multiple tissues and genotypes. By aligning sequencing reads from multiple genotypes, 23 742 SNPs were predicted, 83% of which were validated.
SNP discovery from next-generation sequencing is not limited to long-read technology. The relatively high level of DNA methylation in repetitive regions of the genome has been used to enrich for and sequence the gene-rich regions of several genomes. Deschamps et al. (2008) have taken this method further and demonstrated SNP discovery from a subset of maize genomic sequences selected using a methylation-sensitive restriction endonuclease. The approach involves whole-genome DNA digestion with a methyl-sensitive restriction enzyme and the 4-bp cutter DpnII, followed by selective enrichments and unilateral end sequencing of digested fragments to generate 16-bp unique tag sequences immediately flanked by a 4-bp DpnII signature sequence. In a preliminary experiment to demonstrate the utility of the procedure, several million tags were produced, corresponding to 292 948 and 166 455 unique high-quality 16-bp tag sequences for the two maize inbred lines B73 and GA209, respectively. Alignment of the B73 tag sequences to assembled Zea mays contigs from the TIGR maize database produced 103 776 B73 tag sequences that perfectly matched the contigs, and these were aligned with the 166 455 GA209 tag sequences. This led to the identification of numerous putative SNPs, a significant proportion of which were successfully validated by Sanger sequencing. These results demonstrate that even very short reads generated by the Illumina Genome Analyser can be used for SNP discovery in complex crop genomes.
The use of next-generation sequencing technologies for SNP discovery and characterization has been demonstrated in large-scale animal studies, which were then used to generate genotyping tools for breeding applications. The most recognized of these applications is a United States Department of Agriculture (USDA) study in cattle (Van Tassell et al., 2008), in which reduced-representation library sequencing was performed using the Illumina Genome Analyser to identify, and validate by genotyping, approximately 23 000 new bovine SNPs. Three pools of DNA from 66 cattle were sequenced in five sequencing runs. The genome content of each was reduced using a restriction enzyme and by selecting fragments of a certain size for sequencing. This approach provides a random distribution of fragments from throughout the genome, with few over- or under-represented regions. Alignment of the sequencing data to the bovine genome identified over 62 000 candidate SNPs, and a false-positive rate of approximately 8% was identified. The application of this technology also allowed the comparison of predicted and actual allele frequencies, and the research led to a bovine genotyping array now commercialized by Illumina. Taking advantage of the low cost of the approach, researchers are now pursuing similar projects in other animals, including swine, sheep, songbirds, fish and water buffalo. However, the approach is also applicable to plant species, and is a cost-effective way to develop SNPs and molecular markers for both well-characterized species and non-model species with little or no available prior sequence information.
The reduced-representation sequencing approach is being used for SNP discovery in the agronomically important crop soybean. A recently funded USDA proposal (project number 1275-21000-263-26) aims to identify and characterize 50 000 SNPs in soybean using the Solexa Genetic Analyser. A panel of 1536 SNP markers will then be applied to determine the positions of genes that control high and stable seed protein content in soybean breeding lines. The project will use the whole-genome sequence of the cultivar Williams 82, recently generated by the Department of Energy, Joint Genome Institute (JGI). Reduced-representation libraries of six soybean genotypes (Minsoy, Noir 1, Archer, Peking, Evans and Essex) will be sequenced and aligned with the Williams 82 whole-genome sequence for SNP discovery and to analyse SNP allele frequency. A selection of these SNPs will also be genetically mapped in a large population of 1000 recombinant inbred lines, and used to define the genome positions of genes that control the level of protein in the soybean seeds.
Re-sequencing is used to identify genetic variation between individuals, which can provide molecular genetic markers and insights into gene function. The process of whole-genome re-sequencing using short-read technologies involves the alignment of a set of literally millions of reads to a reference genome sequence. Once this has been achieved, it is possible to determine the variation in nucleotide sequence between the sample and the reference. Re-sequencing has proved to be a valuable tool for studying genetic variation and, with the advent of James Watson's genome being sequenced using this method (Wheeler et al., 2008), the challenge of whole-genome re-sequencing has largely been conquered. Although whole-genome re-sequencing can be very useful, there are some drawbacks with this approach. First, a reference genome sequence is required and the quality of the re-sequenced genome is highly dependent on the quality of the reference sequence. In addition, for very large and complex plant genomes, a vast amount of sequence data is required to confidently call SNPs, with SNPs in repetitive sequences being particularly difficult to call. To date, there have only been preliminary reports of whole-genome re-sequencing in plants in the small genome of Arabidopsis (R. Clark, pers. commun.). However, as sequencing technology continues to improve, it is expected that whole-genome re-sequencing of crop genomes will become common.
Whole-genome re-sequencing for SNP discovery has been demonstrated in Caenorhabditis elegans (Hillier et al., 2008). Solexa technology was used to sequence two C. elegans strains, which were then compared with the reference genome sequence for SNP and indel identification. The software applications PyroBayes and Mosaik (described below) were used to differentiate between true polymorphisms and sequence errors.
There does not appear to be a major disadvantage to using short-read technology for SNP discovery where a reference genome is available. The approach is also low cost, at approximately $0.25 per SNP, compared with about $2.95 using Sanger sequencing. Where a draft reference genome is not available, it may be possible to combine the long- and short-read next-generation sequencing technologies, and use 454 sequencing to generate an assembly against which to align the short reads.
We are at the beginning of a revolution in which cheap, accurate genome sequencing is commonplace. Companies are quickly making advances in increasing sequence generation and improving sequence quality, and bioinformatics applications must be designed or adapted to be able to handle this type of data. To date, the main application of this sequencing technology has focused on re-sequencing, including whole-genome re-sequencing for SNP discovery. Software developed for SNP identification from Sanger sequence data is usually not suitable for next-generation sequence data because of the volume of data, incorrect estimation of homopolymer length and problems with base calling quality scores (Pop and Salzberg, 2008).
PyroBayes is a modification of the SNP discovery software PolyBayes, designed for pyrosequencing reads from 454 sequencing technology (Quinlan et al., 2008). PyroBayes permits accurate SNP calling in re-sequencing applications, even in shallow read coverage, primarily because it produces more confident base calls than the native base calling program. The method has been demonstrated in Drosophila and Escherichia coli, but can be applied to any organism for which suitable 454 data are available. Roche also produces two software applications which can be used for SNP identification from 454 data. GS Reference Mapper Software generates a consensus DNA sequence through the mapping and alignment of sequence reads to a reference sequence. The program also generates a list of high-confidence SNPs by the identification of individual bases that differ between the generated consensus DNA sequence and the reference sequence. This application can be used for re-sequencing applications in any organism in which a reference genome is available for alignment. For the discovery of rare SNPs, GS Amplicon Variant Analyser Software can be used to compare amplicon sequencing reads with a reference sequence. Through very deep amplicon sequencing, multiple alignment of several hundreds of clonal amplicon reads can be compared with a reference sequence. This permits the detection of rare sequence variations, even in heterogeneous samples, allowing SNP discovery at a population level.
Maq (http://maq.sourceforge.net/maq-man.shtml) is software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. It was designed for Illumina-Solexa 1G Genetic Analyser data. However, it has preliminary functions to handle short-read data produced from the AB SOLiD system. Maq aligns reads to reference sequences and then produces the consensus sequence. When mapping to the reference sequence, it performs ungapped alignments. For single-end reads, Maq identifies all hits with up to two or three mismatches. For paired-end reads, it identifies all paired hits, with one of the two reads containing up to one mismatch. At the assembling stage of the program, Maq produces a consensus based on a statistical model, calling the base which maximizes the posterior probability and calculating a phred (Ewing and Green, 1998; Ewing et al., 1998) quality score at each position along the consensus. Heterozygotes are also called in this process. Through the identification of heterozygotes and mismatch sequences, the program can be used for SNP discovery.
A beta release of Mosaik software has also been produced by the Marth Laboratory (http://bioinformatics.bc.edu/marthlab/Mosaik). Mosaik consists of three modular programs: MosaikBuild, MosaikAligner and MosaikAssembler. MosaikBuild converts various sequence formats into mosaik's native read format, MosaikAligner pair-wise aligns each read to a specified series of reference sequences, and MosaikAssembler parses the aligned sequence archive and produces a multiple sequence alignment which is then saved into an assembly file format. It is written in C++, with versions available for Microsoft Windows and Linux operating systems. Cluster-aware (MPI) versions have been tested on up to 160 processors. The software produces results in ace format, which can be viewed with utilities such as consed, Sequencher or EagleView (http://bioinformatics.bc.edu/marthlab/EagleView).
EagleView (Huang and Marth, 2008) has been developed to view genome assemblies of 454 and Solexa sequence data for the discovery of polymorphisms, allowing the combined viewing of data from both long- and short-read technologies. The software offers a compact assembly view and annotation for the interpretation of SNPs in a genomic context. An advantage of this program is the ability to view trace files from the different data types to identify candidate SNPs. The software has been demonstrated in C. elegans, but is applicable to any species with suitable available data.
The SNP discovery software AutoSNP (Barker et al., 2003; Batley et al., 2003) has been extended to produce the recently developed AutoSNPdb (Duran et al., 2009), combining the SNP discovery pipeline of AutoSNP with a custom relational database, hosting information on the polymorphisms, cultivars and gene annotations. This enables efficient mining and interrogation of the data, and users may search for SNPs within genes with specific annotation, or for SNPs between defined cultivars. AutoSNPdb can integrate both Sanger and Roche 454 pyrosequencing data, enabling efficient SNP discovery from next-generation sequencing technologies. AutoSNPdb principally uses redundancy to differentiate between sequence errors and real SNPs. Although this approach ignores potential SNPs that are poorly represented in the sequence data, it offers the advantage that trace files are not required and sequences may be used directly from GenBank. A co-segregation score is also calculated on the basis of whether multiple SNPs define a haplotype, and this is used as a second, independent measure of confidence.
Genetic markers have played a major role in our understanding of heritable traits. In the current genomics era, molecular genetic markers are bridging the divide between these traits and increasingly available genome sequence information. With the expansion of next-generation sequencing technologies, there has been a rapid growth in genetic polymorphism information and the use of genetic markers for diverse applications. As the cost of genome sequencing continues to decrease, it will become routine to re-sequence the genome of individual plants in place of the targeted genotyping with current SNP platforms. Bioinformatics tools are being developed to mine this vast sequence data resource for genetic diversity, and to present these results in a biologist friendly manner. The combination of continued advances in sequencing technology and more advanced bioinformatics tools will undoubtedly lead to novel biological discoveries and advanced methods for crop improvement.