Plant genome sequencing: applications for crop improvement
*(fax +61 (0) 7 3365 1177; e-mail email@example.com)
DNA sequencing technology is undergoing a revolution with the commercialization of second generation technologies capable of sequencing thousands of millions of nucleotide bases in each run. The data explosion resulting from this technology is likely to continue to increase with the further development of second generation sequencing and the introduction of third generation single-molecule sequencing methods over the coming years. The question is no longer whether we can sequence crop genomes which are often large and complex, but how soon can we sequence them? Even cereal genomes such as wheat and barley which were once considered intractable are coming under the spotlight of the new sequencing technologies and an array of new projects and approaches are being established. The increasing availability of DNA sequence information enables the discovery of genes and molecular markers associated with diverse agronomic traits creating new opportunities for crop improvement. However, the challenge remains to convert this mass of data into knowledge that can be applied in crop breeding programs.
The recent advances in genome sequencing, through the development of second generation sequencing technologies and beyond, provide opportunities to develop millions of novel markers, in non-model crop species, as well as identification of genes of agronomic importance. Identification of all genes within a species permits an understanding of how important agronomic traits are controlled, knowledge of which can be directly translated into crop improvement. Reference genome sequences for several crop species are now becoming available and this information permits both the rapid identification of candidate genes through bioinformatics analysis, and single nucleotide polymorphism (SNP) discovery through comparison of the reference with sequence data from different cultivars. For crop species such as wheat and barley where no reference sequence is available, gene discovery relies on unassembled genome sequence data and expressed sequence tags (EST). This data can also be applied for SNP and simple sequence repeats (SSR) molecular marker discovery, though without a reference genome sequence, genetic mapping of these markers is required to determine their genomic location. In case of orphan crops where no sequence data is available, it is now relatively easy to generate sufficient data for computational gene and molecular marker discovery. Molecular genetic markers are based on variation in the genome that can be assayed and monitored between individuals and across generations. The association of markers with heritable traits is used to associate the genotype of an organism with the expressed phenotype, and the ability to develop millions of novel markers will revolutionize plant genomic research. These markers can be used routinely in crop breeding programs, for rapid crop improvement, for genetic diversity analysis, cultivar identification, phylogenetic analysis, characterization of genetic resources and association with agronomic traits. Markers are used in agricultural breeding programs to incorporate genetically characterized traits in place of field trials or glass house screens. The inheritance of many agronomic traits are difficult to quantify in field experiments. The assessment of disease resistance is dependent of the presence of a virulent pathogen, and complex traits such as drought tolerance and yield are influenced by many genetic and environmental factors. In these cases, the use of molecular markers to select for the underlying genetic determinants of the trait increases the efficiency of crop breeding.
Overview of second and third generation sequencing technologies
Second generation sequencing
The ability to generate sequence data is being advanced by increasingly high throughput technology. This is driven by what has been termed next or second generation sequencing. Second generation sequencing describes platforms that produce large amounts (usually millions) of short DNA sequence reads of length typically between 25 and 400 bp.
The first approach to second generation sequencing was pyrosequencing, commercialized by Roche (Basel, Switzerland) as the GS20 (Margulies et al., 2005). The current Roche 454 GS FLX Titanium system produces read lengths on average 300–500 bp and is capable of producing over 400 Mbp of sequence with a single-read accuracy of >99.5%. Two alternative ultra high throughput sequencing systems now compete with the Roche GS FLX; the SOLiD system from Applied Biosystems (ABI) (Carlsbad, California, USA); and Solexa Genome Analyser technology, now commercialized by Illumina (San Diego, California, USA). The Illumina Solexa Genome Analyzer (currently the GAIIx) uses reversible terminator chemistry to generate up to 50 Gbp of data per run with read lengths over 100 bp (Simpson et al., 2009). The Illumina system is used for genome re-sequencing, transcriptome profiling (RNA Seq) and Chromatin ImmunoPrecipitation sequencing (ChIP Seq), and the relatively low error rate of this system supports de novo sequencing applications. The AB SOLiD System (currently version 3) has been used for tag based applications such as gene expression and ChIP Seq. The AB SOLiD system is predominantly used for re-sequencing where comparison with a reference enables the identification and removal of erroneous sequence reads (Ondov et al., 2008). Each of these three systems are capable of producing mate-paired sequences enabling the resolution of repetitive regions.
Third generation sequencing
Several companies are currently bringing to market different technologies termed third generation sequencing, taking DNA sequence production to a further level of scale and reducing costs. The first of these technologies to come to market uses single-molecule sequencing and was commercialized by Helicos Biosciences (Cambridge, MA, USA). Termed true single-molecule sequencing (tSMS), the tSMS approach differs from the existing second generation systems by sequencing without the requirement for DNA amplification, and this method has been used to sequence the genome of the virus M13 (Harris et al., 2008). Pacific biosciences have developed a single-molecule real-time sequencing system that promises to produce several Gbp of relatively long reads (>1 kbp) (Eid et al., 2009). Pacific Biosciences expect to have commercial machines available during 2010, and the ability to produce vast numbers of long sequence reads is likely to provide further opportunities for crop genomics. Other companies are also working on third generation sequencing systems and it is likely that yet more advanced sequencing technology will be available in the relatively near future.
Mining for genes of agronomic importance
The genes underlying many simply inherited traits have been identified and characterized in detail. In these cases, the biochemical functions of the encoded protein may be studied to gain a greater understanding of the mechanisms underlying the trait and whether variation in the gene structure or expression may further improve the trait. Knowledge of the gene underlying a trait enables the transfer of the trait between cultivars and even species using genetic modification. Alternatively, the gene conferring the favourable trait may be incorporated into a cultivar by marker-assisted selection (MAS) breeding.
While many of the simple traits have been well characterized at the genome level, there are many other traits which are poorly understood. This is particularly true for complex traits which are controlled by interacting gene networks. Although many aspects of a complex trait, such as yield, may be characterized individually, it is unlikely that the genetic basis underlying all components of yield heritability will be understood in the near future. The identification of all the genes for a crop is only one step towards understanding the inheritance of agronomic traits. The functions of many of the genes identified by genome sequencing remain unknown and the genetic control of the majority of agronomic traits has yet to be determined. Producing a finished genome sequence for a crop is an important first step and is becoming feasible for an increasing number of crop species (Imelfort and Edwards, in press).
Examples of crop genome sequencing projects
Rice was the first crop genome to be sequenced (Goff et al., 2002; Yu et al., 2002; Matsumoto et al., 2005), following shortly on from the sequencing of the first model plant genome, Arabidopsis thaliana (Arabidopsis Genome, 2000). Current crop genome sequencing projects are rapidly changing pace with the new technology and researchers are quickly adopting second generation sequencing to gain insight into their favourite genome. Roche 454 technology is being used to sequence the 430 Mbp genome of Theobroma cacao (Scheffler et al., 2009), while a combination of Sanger and Roche 454 sequencing (4× vs. 12× coverage respectively) is being used to interrogate the apple genome (Velasco, 2009; Velasco et al., 2009). A similar approach is being applied to develop a draft consensus sequence for the 504 Mbp grape genome (Velasco et al., 2007) where a combination of 6.5× Sanger paired read sequences and 4.2× unpaired Roche 454 reads were assembled into 2093 metacontigs representing an estimated 94.6% of the genome. A combined Illumina Solexa and Roche 454 sequencing approach has been used to characterize the genomes of cotton (Wilkins et al., 2009). Roche 454 sequencing has been used to survey the genome of Miscanthus (Swaminathan et al., 2009), while Sanger, Illumina Solexa and Roche 454 sequencing are being used to interrogate the genome of banana (Hribova et al., 2009). Roche 454 sequencing, using a combination of whole genome shotgun and bacterial artificial chromosomes (BAC) sequencing, has been used to recently complete the 1.7 Gbp oil palm genome (http://tinyurl.com/palmgenome). This is the first application of a complex genome sequence being completed without the use of Sanger sequencing.
Illumina GAII sequencing has been applied to generate more than 50× coverage of the Brassica rapa genome in a collaboration between the Beijing Genome Institute in Shenzhen (BGI) and the Institute of Vegetables and Flowers in Beijing, and it is expected that a high-quality genome sequence will be released during 2009 or early 2010. The sequencing of Brassica rapa follows from the success of sequencing cucumber, where a hybrid strategy consisting of 4× Sanger, 68× Illumina Solexa, as well as 20× and 10× coverage from end sequenced fosmid and BAC libraries were used to assemble more than 96% of the cucumber genome (Huang et al., 2009). The BGI are currently sequencing a range of other crop species, though sequence assembly remains problematic for many of the larger genomes. The large cereal genomes in particular remain elusive, but advances in technology are starting to make de novo sequencing of these genomes feasible. Roche 454 technology has been applied for the sequencing of complex BAC from barley (Wicker et al., 2006; Stein, 2009) and this has been complemented by repeat characterization using whole genome shotgun Illumina Solexa data (Wicker et al., 2008). The size and hexaploid nature of the wheat genome creates significant problems in elucidating its genome sequence. There are currently several projects internationally which will generate a substantial quantity of sequence data for wheat, and expectations are that by 2012, the majority of the gene-rich regions of hexaploid wheat will have been sequenced.
Gene identification in orphan or large complex crop species
While the sequencing and assembly of large and complex crop genomes remains a valuable goal, a significant amount of knowledge can be gained from low coverage shotgun sequencing of these genomes. Short paired read data produced using second generation technologies is particularly suited for the discovery of genes and gene promoters in crop plants. By generating up to 1× coverage of a crop genome sequence with short paired read sequence data, it is possible to identify numerous reads which correspond to homologous genes in related species. Designing polymerase chain reaction (PCR) primers to the read pairs enables the amplification and sequencing of the gene and corresponding genomic region in the target species. This relatively inexpensive approach to gene discovery offers the potential to identify genes, gene promoters and polymorphisms in a wide range of agronomically important crop species.
Molecular marker applications
With the development of molecular techniques, MAS is now used to enhance traditional breeding programs to improve crops, and modern plant breeding is dependent on molecular markers for the rapid and precise analysis of germplasm and for trait mapping (Koebner and Summers, 2002). Molecular markers are complementary tools to traditional selection, used to select parental genotypes in breeding programs, eliminate linkage drag in back-crossing and select for traits that are difficult to measure using phenotypic assays. They can increase our understanding of phenotypic characteristics and their genetic association, which may modify the breeding strategy. MAS allows the breeder to achieve early selection of a trait in a breeding program, and it is particularly useful when the trait is under complex genetic control, or when phenotypic trials are unreliable or expensive. By increasing favourable allele frequency early in the breeding process, a larger number of small populations can be carried forward in the breeding process, each of which has been pre-screened to remove or reduce the frequency of unfavourable alleles. While second generation sequencing can readily be applied for the discovery of markers which can be applied for MAS, there is little if any benefit in using whole genome sequencing during selection as the vast majority of SNP are not associated with agronomic traits. The cost of genotyping SNP in large populations continues to decline and it is unlikely that whole genome sequencing will replace specific marker genotyping for MAS in the near future.
Molecular markers have revolutionized genome mapping over the last two decades, and the high density of markers that can now be generated from second generation sequence data offers the potential for generating very high density genetic maps. These markers can be used to develop haplotypes for genes or regions of interest, and complete genome mapping is now becoming a reality. Genetic mapping places molecular genetic markers in linkage groups based on their co-segregation in a population. The genetic map predicts the linear arrangement of markers on a chromosome and maps are prepared by analysing populations derived from crosses of genetically diverse parents, and estimating the recombination frequency between genetic loci. Many types of markers can be used for map construction, with population size and marker density being important for map resolution. SNP identified within whole genome sequence or large genomic fragments maintained within BAC can be applied for the genetic mapping of complex traits. This enables the genetic mapping of specific genes of interest and assists in the identification of linked or perfect markers for traits, as well as increasing the density of markers on genetic maps (Rafalski, 2002). The development of these markers also allows the integration of genetic and physical maps. The use of common molecular genetic markers across related species permits the comparison of linkage maps. This allows the translation of information between model species with sequenced genomes and non-model species (Moore et al., 1995). Furthermore, the integration of molecular marker data with genomics, proteomics and phenomics data allows researchers to link sequenced genome data with observed traits, bridging the genome to phenome divide. These markers can then be used routinely in crop breeding programs.
Association mapping is a further statistical method to identify genetic loci associated with phenotypic trait variation. Association mapping shares much in common with quantitative trait loci (QTL) mapping. QTL mapping generally involves the use of structured populations and relatively distant markers can segregate with the QTL, providing a wide genetic region within which the gene is located. The use of unstructured populations in association mapping means that they represent many more recombination events and are often many generations from a common ancestor, providing the potential of a greater resolution for a set population size. The advances in genome sequencing technology, allowing the production of millions of markers, provides an increasing ability to generate large quantities of molecular marker genotyping data, which favours association studies over traditional QTL mapping, because of this, association studies are likely to become more common.
Molecular marker discovery
Single nucleotide polymorphisms now dominate molecular marker applications, because of recent advances in DNA sequence technology enabling their discovery, and the development of high throughput assays. As with most molecular markers, the factor limiting the implementation of SNP is the initial cost of their development (Duran et al., 2009b). SNP discovery involves finding differences between two sequences. Traditionally this has been performed through PCR amplification of genes/genomic regions of interest from multiple individuals selected to represent diversity in the species or population of interest, followed by either direct sequencing of these amplicons, or the more expensive method of cloning and sequencing. Sequences are then aligned and any polymorphisms identified. This approach is frequently prohibitively expensive and time consuming for the identification of the large number of SNP required for most applications such as genetic mapping and association studies.
In silico methods of SNP and SSR discovery are now being adopted, providing cheap and efficient methods for marker identification (Barker et al., 2003; Batley et al., 2003; Robinson et al., 2004; Jewell et al., 2006; Duran et al., 2009a,c). Large quantities of sequence data are being generated by the latest second generation sequencing technologies and these provide a valuable resource for the mining of molecular markers (Imelfort et al., 2009). While the large volume of next generation sequencing data are generally produced at the expense of sequence quality, the over sampling of genome data enables the differentiation between true SNP and sequence error. Whole genome sequencing is the most robust method to identify the great variety of genetic diversity in a population and gain a greater understanding of the relationship between the inherited genome and observed heritable traits. The continued rapid advances in genome sequencing technology will likely lead to whole genome sequencing becoming the standard method for genetic polymorphism discovery.
One of the first applications of next generation sequencing in plants identified over 36 000 putative maize SNP using 260 000 and 280 000 EST, sequenced using the Roche GS20. These SNP were identified between B73 and Mo17 inbred maize lines (Barbazuk et al., 2007). Stringent post-processing reduced this number to >7000 putative SNP, and over 85% (94/110) of a sample of these putative SNP were successfully validated by Sanger sequencing. Based on this validation rate, this pilot experiment conservatively identified >4900 valid SNP within >2400 maize genes, demonstrating the suitability and potential of the approach.
Recently, Roche 454 technology was used to sequence and assemble 148 Mbp of expressed sequences (EST) for Eucalyptus grandis (Novaes et al., 2008). The EST sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes. By aligning sequencing reads from multiple genotypes 23 742 SNP were predicted, 83% of which were validated. The draft consensus sequence for the grape genome identified 1.7 million SNP which were mapped to chromosomes (Velasco et al., 2007). With the continual increase in read length and introduction of read pairs for Roche 454 sequencing, similar future projects may be undertaken using next generation sequencing without the expense of Sanger sequencing.
The SNP discovery from next generation sequencing isn’t limited to long read technology. The relatively high level of DNA methylation in repetitive regions of the genome has been used to enrich for and sequence the gene-rich regions of several genomes. Deschamps et al. (2008) have taken this method further and demonstrated SNP discovery from a subset of maize genomic sequences selected using a methylation sensitive restriction endonuclease. The approach involves whole genome DNA digestion with a methyl-sensitive restriction enzyme and the 4 bp cutter DpnII followed by selective enrichments and unilateral end sequencing of digested fragments to generate 16 bp unique tag sequences immediately flanked by a 4 bp DpnII signature sequence. In a preliminary experiment to demonstrate the utility of the procedure, several millions tags were produced, corresponding to 292 948 and 166 455 unique high-quality 16 bp tag sequences for the two maize inbred lines B73 and GA209 respectively. Alignment of the B73 tag sequences to assembled Zea mays contigs from the TIGR maize database produced 103 776 B73 tag sequences that perfectly matched the contigs, and these were aligned with the 166 455 GA209 tag sequences. This led to the identification of numerous putative SNP, of which a significant portion was successfully validated by Sanger sequencing. These results demonstrate that even relatively short reads generated by the Illumina Genome Analyzer can be used for SNP discovery in complex crop genomes.
Second generation sequencing has been applied to study methylation of the Arabidopsis genome (Lister et al., 2008), and a greater understanding of the epigenetic modification of genomes and the impact of such modification on gene expression is likely to have outcomes for crop improvement approaches.
Simple sequence repeats have also been identified from second generation sequence data using the Roche 454 technology (Abdelkrim et al., 2009; Allentoft et al., 2009; Santana et al., 2009). These markers are predominantly applied for the study of non-model organisms. The SSR can be used along with SNP in MAS or for comparative genomics to bring useful traits from wild relatives.
Recent advances in DNA sequencing technology are radically changing biological and biomedical research and will have a major impact on crop improvement. It is now possible to consider sequencing cereal genomes which were only recently considered intractable. Relatively inexpensive survey sequencing can identify all the genes in a genome as well as the gene promoters, making EST sequencing for gene discovery redundant. For moderately sized genomes, we can now cost effectively sequence multiple varieties for genome wide SNP discovery. This new wealth of information, combined with advanced genotyping methods permits the application of detailed genome wide association studies to make the link between genetic variation and agronomic traits. The identification of genes and molecular markers underlying these agronomic traits will help accelerate the breeding process and lead to improved varieties with improved yield and quality, tolerance to unfavourable environmental conditions and resistance to disease.
The authors would like to acknowledge funding support from the Grains Research and Development Corporation (Project DAN00117) and the Australian Research Council (Projects LP0882095, LP0883462 and DP0985953). Support from the Australian Genome Research Facility (AGRF), the Queensland Cyber Infrastructure Foundation (QCIF), the Australian Partnership for Advanced Computing (APAC) and Queensland Facility for Advanced Bioinformatics (QFAB) is gratefully acknowledged.