Author for correspondence: Scott A. Jackson Tel: +1 706 542 4021 Email: email@example.com
Genomics and crop improvement
Complexity of plant genomes
Evolution of genome sequencing
Future of genome sequencing
Application of genomics for crop improvement
Unlocking the potential of genetic diversity through genomic approaches
Many challenges face plant scientists, in particular those working on crop production, such as a projected increase in population, decrease in water and arable land, changes in weather patterns and predictability. Advances in genome sequencing and resequencing can and should play a role in our response to meeting these challenges. However, several barriers prevent rapid and effective deployment of these tools to a wide variety of crops. Because of the complexity of crop genomes, de novo sequencing with next-generation sequencing technologies is a process fraught with difficulties that then create roadblocks to the utilization of these genome sequences for crop improvement. Collecting rapid and accurate phenotypes in crop plants is a hindrance to integrating genomics with crop improvement, and advances in informatics are needed to put these tools in the hands of the scientists on the ground.
Genomics, as a scientific era, is relatively new. Advances in biology and molecular genetics technology predate the advent of genomics, for example development of cloning vectors by Paul Berg (Jackson et al., 1972), but the inception of genomics was coincident with the genesis of the human genome project (HGP), which was conceptualized and endorsed at a US Department of Energy-sponsored meeting in Santa Fe, NM (USA), in 1986. The draft sequence of the human genome arrived nearly 17 yr later (Lander et al., 2001). In many ways, the HGP was similar to the ‘race to the moon’ of the 1960s, in that it spurred technological advances that have had an impact beyond the human genome. Advances in cloning, robotics, DNA preparation, automation of DNA sequencing, computing, and informatics have led to a democratization of genomics such that producing a genome sequence is now affordable and conceivable for many, if not all, crop genomes.
Some of the questions for crop genomics are: how to sequence a crop genome; when is a sequence complete (complete enough for use); is the community prepared for a genome sequence (how will the community use it and how will it be maintained and curated?); and how best to integrate genomics with crop improvement, which in the end is the real goal? Sequencing of plant and crop genomes is becoming routine; however, the results are at various levels of completeness. Arabidopsis thaliana was the first plant genome to be sequenced (Initiative, 2000), even predating the completed human genome. And, to date, it is probably the best sequenced and assembled plant genome in terms of completeness. Rice (IRGSP, 2005) followed quickly, along with a litany of other plant/crop genomes (Ming et al., 2008; Paterson et al., 2009; Schnable et al., 2009; Schmutz et al., 2010). So, to some extent, the question has turned from ‘How do I sequence my genome of interest?’ to ‘What do I do with all this sequence’? How does one make sense of the mountain of sequence data and turn it into something useful for crop improvement? A further consideration is the preparedness or sophistication of the end-user community to effectively implement genome-based tools into their improvement pipelines.
Crop improvement has been described as an ‘art’ (Lewis, 1945; Crow, 2001), and to some extent this is still the case. However, in the face of mounting global challenges the ‘art’ needs to become empowered such that yield gains are predictable and adaptable to challenging environments. The global population is predicted to increase by as much as 50% by 2050 (United Nations, 2004), so crop production will also have to increase. However, the arable land is predicted to either stay static or even decrease over the same time-frame (Bouwman, 1997) and more marginal land will be need to be used for crop production. Exacerbating this problem is that water and crop production agents (fertilizer, in particular) will become limiting and the vagaries of climate change and shifts in weather patterns will lead to greater unpredictability in the target environments. Thus, the challenge to crop scientists and geneticists/breeders, in particular, is to take advantage of all available tools in order to tackle these issues. Genomics will be a key part of the crop improvement toolbox.
Breeding is an evolving science that traces back to the possibly unintentional domestication of plants and animals c. 10 000 yr ago, and to the breeding/hybridization societies and later Mendel and the elucidation of the laws of genetics which later merged with quantitative genetics to bring science to bear on breeding. The application of science to the breeding process was incredibly successful, as evidenced by the development of hybrid seed and the seed business (Crow, 1998) and the Green Revolution (Evenson & Gollin, 2003). Crop breeding follows a general cycle: evaluation of phenotypes (genetic diversity); selection of superior phenotypes; crossing; and back to evaluation, restarting the process. Out of this process come superior genotypes that can be tested and developed into varieties. The process is much more complex, though, as many types of phenotypes have to be evaluated (e.g. disease resistance, stress adaptation, yield, quality, etc.). Until the 1980s this process was done almost entirely at the phenotypic level with little consideration of the underlying genetic processes that contributed to the various phenotypes. With genomics, it is possible to identify all the genes in a plant and then to begin to understand the genetic properties and networks that contribute to the development of a superior plant; however, even with these tools, breeding a better variety is still a complicated process. But genomics tools and technological advances will continue to increase the rate of gain from breeding and the precision by which superior genotypes are chosen, and will be a major player in the production of enough food for a growing world population.
II. Complexity of plant genomes
Early targets for genome sequencing, apart from the human genome, were genomes that were relatively small and, therefore, easier and less expensive to sequence, such as Arabidopsis– a five-chromosome, c. 120 Mbp haploid genome plant (Meinke et al., 1998). In fact, Arabidopsis is much smaller than most plant genomes, by orders of magnitude. The average plant genome is > 6000 Mbp per haploid genome for angiosperms (Gregory et al., 2007), approximately twice the size of the human genome. Many economically important plant genomes are even larger. Wheat, for instance, is c. 15 Gbp per haploid genome and pine has at least a 26 Gbp genome (Valkonen et al., 1993). Genome size is one contributor to plant genome complexity; other contributors include polyploidy and repetitive DNA sequences, and, in particular, transposable elements. Together, these attributes of plant genomes increase the cost of sequencing and negatively impact the quality of the resulting sequence, especially as the field migrates from map-based sequencing (to be described later) to short-read whole-genome shotgun (WGS) sequencing.
Two primary factors that contribute to plant genome size and complexity are polypoidy and repetitive DNA sequences (crop examplars shown in Fig. 1). Polyploidy is the accumulation of additional sets of chromosomes through either autopolyploidy, doubling of the same genome, or allopolyploidy, two diverged genomes in the same nucleus. Increased chromosome number and DNA content are immediate consequences of polyploidy, but depending on when the polyploidy event occurred, increased chromosome number may not be immediately apparent as ancient polyploidy events are likely to be shared by sister taxa and/or diploidization of chromosome number may have occurred (reduction of chromosome number via loss and rearrangements). Most, if not all, land plants have undergone polyploidy events at various times in their evolution (reviewed in Soltis et al., 2004). For example, soybean (Glycine max) has undergone at least three polypoid events that, as a consequence of having a high-quality genome sequence (Schmutz et al., 2010), can now be examined. The first, and most difficult, event, to detect was one early in plant evolution shared by many land plants (Bowers et al., 2003). The second event was c. 45–55 million yr ago (Mya) and should be shared with legumes that diverged after that event, such as Medicago (Cannon et al., 2006). The most recent event, c. 5 Ma, was most likely an allopolyploid event (Gill et al., 2009) that was coincident with the emergence of the Glycine genus (Innes et al., 2008). Thus, the 1.1 Gbp soybean genome has relics of at least three polyploidy events that resulted in a genome that is a mosaic of duplicated segments (Schmutz et al., 2010). In the Glycine genus, however, there is an even more recent allopolyploid event that occurred in perennial species found in Australia (Doyle et al., 2002). Thus, polyploidy is a recurrent process that molds and shapes plant genomes during evolution.
In addition to polyploidy, repetitive DNA sequences and, in particular, transposable elements (TEs) compose large fractions of most plant genomes and are impediments to efficient genome sequencing. TEs have been reviewed in depth (Bennetzen et al., 2005); here we will focus only on contribution to genome obesity in plants and organization in plant genomes as it contributes to obtaining accurate genome sequences. There are several instances of rapid amplification of a few TE families that have resulted in increased genome size. Oryza australiensis, for example, is approximately twice the size of its nearest relatives as a result of the amplification of three TE families (Piegu et al., 2006). Maize is the most prominent example of genome obesity resulting from TE amplification (SanMiguel et al., 1996).
The complicating factor of TE amplification on genome sequencing is not primarily the increase in the amount of DNA to sequence, but rather the effect of many copies of the same sequence throughout the genome that make mapping and assembly difficult. If a TE family amplified recently, it can have thousands of copies scattered throughout the genome, all with very high sequence identity. If a genome is sequenced via a shotgun approach, as most genomes are nowadays, then these highly similar TEs will complicate assembly unless there are sufficient mate-pair reads that span the repeats (Fig. 2).
III. Evolution of genome sequencing
Plant genome sequencing methodology paralleled the sequencing of the human genome. Arabidopsis was the first plant genome completed, in 2000 (Initiative, 2000), concurrently with the draft of the human genome project HGP. The Arabidopsis project, like the HGP, was a multi-laboratory, multi-nation collaborative project that worked together to produce a clone-by-clone sequenced and finished genome. Currently, according to the TAIR 9 release (http://www.arabidopsis.org), the genome includes 119.1 Mb of sequence. Typically, these early clone-by-clone projects started with a physical map of cosmid or bacterial artificial chromosome (BAC) clones (Mozo et al., 1999) assembled with FingerPrinted Contigs software (FPC; Soderlund et al., 1997) to form clone contigs, from which a tiling path of clones could be selected to cover the mapped genome space. Once selected, these clones could then form the basis for a large-scale distributed sequencing project and be stitched back together into chromosome-scale sequences. The rice genome (IRGSP, 2005) was just such a project and the latest release, MSU6 (http://www.gramene.org), covers 373.2 Mbp of the clonable portion of the genome.
In the wake of the successful completion of the rice genome sequence, several crop clone-by-clone plant projects were begun; however, few are yet been completed. The principal difficulty with completing these clone-based BAC projects has not been sequencing throughput, which increased dramatically over the course of the HGP, but rather the difficulty of coordinating mapping and sequencing, and meshing the progress with funding, which has caused progress on these projects to stutter over their project lifetimes. In addition, genome centers were adopting strategies developed from the sequencing of vertebrates, including the application of WGS sequencing approaches, proposed by E. Myers (Weber & Myers, 1997) and popularized by Celera.
Whole-genome shotgun sequencing promised to rapidly accelerate the acquisition of genome space by eliminating the massive library making steps required in a clone-by-clone approach and, to a certain extent, eliminating the physical mapping steps, at the cost of fidelity in the repetitive portions of the genomes and at the added cost of significant computational steps to reconstruct the genome sequence (Batzoglou et al., 2002). Essentially, the WGS strategy entails making several different-sized inserts from genomic DNA, which are then sequenced from both ends (Fig. 2). These sequences are then compared by a computer algorithm, which attempts to reconstruct a single linear piece of DNA sequence, constrained by the estimated size difference between the end reads. This strategy is, by its nature, much faster than BAC clone library-based sequencing; however, it carries significant additional potential failure points and an increased degree of difficulty in order to produce a complete genome sequence. WGS does have the advantage of allowing one to see all of the genomic sequence from an organism at once, rather than relying on a narrower view of the assembled, mapped clone contigs.
Early attempts at WGS sequencing of plants were moderately successful (poplar (Tuskan et al., 2006), Chlamydomonas (Merchant et al., 2007), grapevine (Jaillon et al., 2007), Physcomitrella (Rensing et al., 2008) and papaya (Ming et al., 2008)) and paved the way for the high-point of Sanger-based reference plant genomes produced in the previous 3 yr. We have seen the completion of draft WGS genome plant sequence projects of two key crop species, Sorghum bicolor (Paterson et al., 2009) and G. max (Schmutz et al., 2010) and the completion and annotation of at least nine other high-quality reference WGS drafts all produced at the Department of Energy Joint Genome Institute (see http://www.phytozome.net/). For the most part, these WGS assemblies include BAC end sequenced libraries and are assembled into pseudomolecules to facilitate use of these genomes as references. However, the largest plant reference sequence project produced to date was not a WGS, but rather the clone-by-clone maize (Schnable et al., 2009) genome sequencing project, begun in 2005 and completed in 2009, which covers 2 Gbp of the 3 Gbp of the Zea maize genome.
Even as the vast majority of plant genome reference sequences were being produced in the last few years, there has again been a shift in plant genome sequencing. With the introduction of short-read pyrosequencing, for example, 454 Life Sciences (Margulies et al., 2005), and sequencing by synthesis systems (for a review see (Fuller et al., 2009)), another shift in producing de novo references for plant genomes is occurring. These new systems, particularly the Illumina (previously Solexa) platform, can produce sequence data 100 times more cheaply than the Sanger-based technology used to produce the majority of plant references available today. These new projects usually include a mix of Sanger-based, long-insert fosmid or BAC end sequences combined with short contigs assembled from pyrosequencing or sequencing by synthesis systems. Much like the shift from clone-by-clone to a WGS sequencing strategy, these approaches with next-generation sequencing (NGS) have resulted in a proliferation of ongoing sequencing projects, most of which will produce references of significantly less accuracy and poorer completeness than the Sanger projects that have come before. This is because of the short length of sequence reads, lack of adequate pairing methodologies for all of the NGS platforms, and a bias against AT-rich sequences; all of these issues cause significant problems in reconstructing plant genome sequences.
To date, there have been two published plant genomes that combine pyrosequencing and Sanger-based sequencing (cucumber (Huang et al. 2009) and apple (Velasco et al., 2010)), neither of which has approached the quality and completeness of previous Sanger sequenced genomes. With the current NGS methodologies, producing such a reference is not possible. Owing to the extremely low cost of sequencing by synthesis there is already some movement towards producing plant genome sequences based solely on sequencing by synthesis; the only published example in the vertebrate world is the panda genome (Li et al., 2009). It is important to note here that SBS provides the ability to sample, at low cost, the unique space of a genome in question. However, it remains to be seen how successfully these short-read de novo WGS strategies will contribute to crop-based scientific goals.
IV. Future of genome sequencing
As NGS sequencing capability is rapidly expanding and the major producers of Sanger sequence have been reducing their capacity, we are unlikely to continue to see new Sanger-based plant reference genomes. In the near future, hybrid methods based on pyrosequencing and Sanger long pairs will likely continue to be produced. The products will capture the majority of gene space from a genome, but will suffer from unresolved repetitive sequences and a tendency to miss a substantial portion of the genetic code for an organism. These hybrid projects will likely only be a focus for a short time as the sequencing community moves toward producing de novo genomes based on sequencing by synthesis and, in particular, the Illumina sequencing platform. Although the Illumina platform was principally developed to resequence human genomes, many groups are working to adapt a genome strategy to take advantage of its incredibly low cost to produce data. In the past 2 yr, there has been a resurgence of de novo assembly algorithm development, a movement not seen since WGS was adopted by the major genome sequencing centers (Abyss (Simpson et al., 2009), Velvet (Zerbino & Birney, 2008), Allpaths (Butler et al., 2008), SOAPdenovo (http://soap.genomics.org.cn)). The widespread availability of these SBS machines and code to string together short reads into larger contigs and scaffolds, combined with the lower cost of data collections, has catalyzed plant genomics researchers to attempt genomic projects that were once viewed as impossible. It remains to be seen how complete and useful Illumina WGS sequenced plant genomes will be, as none of these projects has yet been brought to a conclusion that resembles a typical reference genome sequence.
Crop genomes will likely be particularly challenging for short-read-based sequencing, as in addition to the normal difficulties of plant repetitive sequences, many planted crop varieties can also have recent polyploidy events and high polymorphism rates. Therefore, simple-genome crop model plants are likely to be more amenable to short-read WGS sequencing. However, sequencing of polyploids, such as wheat, peanut or coffee, will be done using NGS and several approaches can be use to assemble the short-read sequence contigs/scaffolds into longer-range scaffolds that might represent individual chromosomes. The divergence time between progenitor genomes in the polyploidy will have an impact on WGS, as genomes that are not very diverged (e.g. autotetraploid or a recent allopolyploid) will confound sequence assembly as NGS short reads from the subgenomes may not be diverged enough to assign unambiguously. This is ignoring the problem of heterozygosity where the sequencing of highly heterozygous plants will essentially result in the assembly of haplotypes, and thus the sequencing depth will have to be higher to obtain sufficient coverage of each chromosome. There will still be problems in constructing completely ordered haplotypes/chromosomes as some regions will be identical (e.g. identical by descent, IBD) and will collapse the two haplotypes. However, there are new technologies on the horizon, or in maturation, that my help to solve some of these problems. Physical maps of large insert clones with restriction site sequence information may be used to help align and order NGS sequence contigs into longer-range scaffolds (van Oeveren et al., 2011), and optical mapping (essentially an ordered restriction map of a genome; Zhou et al., 2009) may be useful to align sequence scaffolds to larger chromosome regions. Perhaps even more promising are new single molecular sequencing technologies that can produce sequence reads that are tens of kilobases long (http://www.pacificbiosciences.com). These long reads may be useful to ‘string’ together the small sequence scaffolds from short-read WGS.
In the near term for crops, we will see the generation of large single nucleotide polymorphism (SNP) resources and the ability to resequence large germplasm collections for crops with reference genomes. Even without a reference genome, we will have the ability to carry out phenotypic studies in populations with direct short-read sequencing to identify collections of segregating alleles that can be used as functional markers or breeding targets. Even though, in the short term, we are likely to see a reduction in reference plant genomes, there is hope that as new technologies are developed, the ability to recover full genome sequences will once again become commonplace. New technologies such as single molecule sequencers (Pacific Biosciences (Eid et al., 2009) or nanopore technology (Clarke et al., 2009)) may provide us with the ability, once again, to generate the long, accurate sequence reads that gave us the current abundance of plant reference genomes that is necessary for truly detailed genome analysis and layering on additional functional information such as epigenetic marks, resequencing of diverse lines, etc.
V. Application of genomics for crop improvement
The inescapable threat of global climate change with its associated fluctuations in patterns of drought, heat, and flooding brings new challenges to strategies of crop improvement. The burgeoning world population that accompanies this new era for humankind makes the finding of solutions to improving food, feed, fuel and fiber more urgent than at any time in our history. Although traditional plant breeding has produced impressive gains in world food production and safety, it is unlikely that unassisted breeding will be up to the challenges at hand.
Marker-assisted selection (MAS) offers a method by which selection for specific traits can be greatly accelerated. However, many important crop traits are polygenic, have low heritability, and, by their nature, possess large genotype × environment (G × E) interactions (Fleury et al., 2010). Consequently, MAS is most successful with relatively simple traits and those inherited in a Mendelian fashion (Bouchez et al., 2002). However, genomic selection (Meuwissen et al., 2001) paired with the increase in the resolution markers and the decrease in cost, will lead to improved breeding strategies that use large amounts of genomic information, paired with estimated breeding values assigned to markers/haplotypes to expedite the breeding process and increase the rate of gain. Recent work in maize and rice will lead to the widespread application of this approach in plant improvement. In maize, high-resolution mapping in a large number of families for flowering time, a quantitative trait, uncovered a large number of small-effect quantitative trait loci (QTLs) that acted in an additive fashion to determine flowering time (Buckler et al., 2009). In rice, resequencing and phenotyping of > 500 lines allowed the identification of a large number of QTLs controlling 14 different agronomic traits (Huang et al., 2010). It is easy to see how marker information for a large number of small- and even medium- to large-effect QTLs could expedite the selection process.
The genetic mapping of QTLs has been ongoing for many years. Identification of QTLs for a given trait is relatively straightforward as long as the population in which your data are collected possesses genetic variation for the trait. These QTLs tell us within a statistical range the region of a chromosome in which gene(s) affecting the trait are likely to reside. Ultimately, though, to fully take advantage of the QTLs, it is necessary to determine the identity of the genes responsible for the variation in the trait (Mackay, 2001) and to understand the molecular basis of the QTLs (Hansen et al., 2008). The integration of genetic maps with associated QTLs, physical maps, and whole-genome sequence is a necessary aid to making these connections (a partial example is shown in Fig. 3).
The application of whole-genome transcriptome analyses, by either micorarray or various sequencing approaches, is beginning to provide us with an understanding of the regulation of gene expression (Hansen et al., 2008). Through transcriptomic analyses, the interactions of biochemical pathways and of QTLs are beginning to be uncovered. Indeed, a regulatory network governing flowering time has been successfully constructed in Arabidopsis using whole-genome expression analyses (Keurentjes et al., 2007). The association of gene expression data from a variety of tissues and developmental stages (Severin et al., 2010) with gene models gives us a completely new layer of information from which to begin understand gene interactions and regulation (Fig. 4).
The availability of high-quality whole-genome sequence assemblies for major crops such as soybean (Schmutz et al., 2010) and maize (Schnable et al., 2009) creates a paradigm shifting change in how we can approach crop improvement. We now have access to all of the many thousands of genes that make up an organism. Not only does this provide a wealth of candidate genes underlying important QTLs, but the sequence itself, as a framework, also provides a means by which a virtually unlimited amount of information may be mined. The cost of resequencing genomes has plunged and promises to decline even further. Resequencing of old varieties, landraces and even more newly released cultivars has the potential to uncover allelic diversity that has not been seen before, and to draw our attention to the regions of the genome that breeders have unknowingly focused upon in their traditional breeding efforts. These allelic differences provide a rich and nearly unlimited source of polymorphisms from which to create ‘perfect’ markers or to saturate specific regions of a genome (Hyten et al., 2010).
Genome sequence alone may not tell us much about where within the genome to focus our attention in breeding programs, but genome sequence coupled with transcriptomics may tell us a lot. The fusion of genetics and genomics to better understand gene function and gene interrelations has been termed ‘genetical genomics’ (Jansen & Nap, 2001). Expression QTLs (eQTLs) are detectable when genetic variation within the genome results in changes in transcript abundance (Potokina et al., 2008; Holloway & Li, 2010). When eQTLs are correlated with traditional QTL phenotypic information, the function of the allelic variation uncovered in genome sequencing projects may begin to be discerned (Holloway & Li, 2010).
Many of the underlying factors that control complex agronomic traits such as iron homeostasis, and consequently, that contribute to the QTLs for those traits, may be the result of transcription factors (O’Rourke et al., 2009). In support of this concept, two genes important in the domestication of maize (tb1 and tga1) were cloned and both were found to be transcription factors (Doebley et al., 1997; Wang et al., 2005). The magnitude of the effect of the factors responsible for eQTLs is determined by whether they are cis- or trans-acting (Hansen et al., 2008; Holloway & Li, 2010). Cis-acting factors reside in or near the gene responsible for the eQTLs, such as a polymorphism within a promoter, or internal to the gene itself. A trans-acting factor, however, resides at a location that is not co-located with the gene whose expression is measured (Hansen et al., 2008). Cis-acting factors generally have a stronger effect. A study in wheat using doubled haploid lines and a hybridization-based Affymetrix GeneChip recently uncovered a total of 542 distinct eQTLs contributing to seed development (Jordan et al., 2007). In this study, two chromosomes were found to have a rich source of trans-regulatory factors controlling this trait. Studies of physiological disease in rat using transcriptional profiling identified many cis-acting, monogenic traits that were good candidates to explain previously mapped physiological loci (Hubner et al., 2005). A set of 73 candidate genes for hypertension alone was identified. The same approaches can easily be carried out in plants.
The adaptation of genomic-scale data analyses requires new breakthroughs in statistics and modeling (Chenu et al., 2009). Our concepts of comparative genomics need to be advanced to include comparative functional analyses, and analyses of the so-called ‘interactome’. The role of epigenetics in plant responses to the environment is only beginning to be understood and will be greatly advanced by research in crops such as rice and maize (Raghuvanshi et al., 2010).
One of the limitations of the adaptation of genomic tools to crop improvement lies in our reliance upon a reductionist approach to asking questions. Plant responses to the environment are incredibly complex and intertwined, and only the simplest steps can be explained through a single-gene approach. In order to make progress in many of the complex traits with which we work, we will need to understand interactions between stress responses and interactions among biochemical pathways (Fleury et al., 2010) and adopt a ‘systems biology’ approach. An exciting approach is the advent of high-throughput phenomics for crop plants, whereby, under controlled conditions, large numbers of plants can be screened for precise measurements of many phenotypic traits (e.g. http://www.plantphenomics.com/partners/). Coupled with NGS-based genotyping, this may help to accelerate plant improvement.
VI. Unlocking the potential of genetic diversity through genomic approaches
Crop genetic diversity, which refers to variation of the genes within a crop species, is the basis of the ability of crops to adapt to changes in their environments and to respond to natural selection. As a result of domestication and modern practices of crop production with relatively few and genetically similar high-yielding cultivars, crop genetic diversity has declined dramatically (Tanksley & McCouch, 1997; Hyten et al., 2006). Crop populations with little genetic variation are more vulnerable to new diseases, insect pests, and global climate changes. Since long-term food security is threatened by the inability of crops to quickly adapt to rapidly changing conditions (Brown & Funk, 2008; Turner et al., 2009), great effort is being devoted to enhancing the genetic diversity of elite breeding pools using mutants, landraces, and/or wild species closely related to the cultivated crop.
Before the advent of molecular genetics, plant accessions were profiled on plant morphology and phenotypic traits (Gilbert et al., 1999; Hoisington et al., 1999). Pedigree and geographical distribution analyses were also used for measuring genetic diversity (Hammer 2003). A renewed impetus toward diversity analysis based on genotype rather than phenotype was made possible by the development of modern molecular marker techniques (Tanksley & McCouch, 1997). Various molecular markers have been developed (Gupta et al., 2001) and although these markers are useful for determination of phylogenetic relationships and population structure, QTL mapping, map-based cloning, and MAS (Moose & Mumm, 2008), they are not suitable for measuring adaptive genetic diversity. Therefore, diversity analysis should, more appropriately, be based on functional genes or whole-genome sequences.
Only a few plant/crop genomes have rich genome-based databases that incorporate many levels and types of information such as QTLs, expression data, mutants, physical maps, genetic markers, and genetic diversity, to name but a few (soybean, http://www.soybase.org; rice, http://www.gramene.org; Arabidopsis, http://www.tair.org). As more crop genomes are sequenced, the need for integrated databases will continue to grow in order to curate the genome sequences in such a fashion that facilitates crop improvement. At least two roadblocks exist here: lack of continued financial support for database development and maintenance and the perceived lack of intellectual contribution of database developers/mangers; and the decreasing quality of genome sequences that make it difficult to organize additional data on top of a genome sequence. The more fragmented is the genome, the more difficult it is to create a truly functional database with layered information.
The application of NGS technologies for resequencing, assuming a reference-like genome exists, is one of the most powerful applications for crop improvement. Resequencing requires a reference genome, whereas de novo assembly does not. However, de novo assembly of plant genomes using NGS with short-read lengths is not yet a suitable tool because of the high complexity of most plant genomes as a result of extensive duplication and the presence of repeat sequences (Varshney et al., 2009). Thus, NGS technologies may be widely applied for resequencing of species that have a complete reference genome sequence, primarily for identifying SNPs useful as DNA markers (Akhunov et al., 2009; Yan et al., 2010b; You et al., 2011), examination of selection patterns either in advanced populations or during domestication (Gore et al., 2009; McMullen et al., 2009), or finding functional alleles (Thornsberry et al., 2001; Yan et al., 2010a). Genome-wide SNP genotyping is a powerful tool for association mapping and evolutionary studies (Akhunov et al., 2009). Community-developed SNP panels often have limited utility in broader sets of germplasm; however, genotyping by sequencing will overcome these limitations and provide many more polymorphic markers (Huang et al., 2010). These NGS technologies and massively developed genome-wide markers, such as RAD (restriction site-associated DNA)-based markers (Baird et al., 2008), are also deployed for the construction of high-density maps and genetic diversity analysis (Gupta et al., 2008).
The availability of NGS and whole reference genome sequences for major crops such as rice, soybean, and maize provides unique opportunities for exploring DNA-level diversity among members of a crop species and its relationship to phenotypic diversity (Paterson et al., 2010). The ultimate goal for resequencing members within a species is to understand the molecular basis for phenotype–genotype relationships. Diversity panels of hundreds to thousands of genotypes selected to sample the spectrum of diversity in a given species with reference genome sequences using NGS technologies will provide a platform for understanding existing genetic diversity, associating gene(s) with phenotypes and exploiting natural genetic diversity to help develop superior genotypes. In order to do this effectively, extensive phenotypic data will need to be collected for the diversity panels in a given species and combined with resequencing data. Collecting phenotypic data is potentially the biggest stumbling block for effective utilization of genomics technologies in advanced plant improvement. The primary reason for this is that many, if not most, phenotypic traits require an experienced eye and a skilled hand to score them effectively and consistently. Consequently, phenomics (mass collection of phenotypes) has not kept pace with advances in genomics; moreover, fewer and fewer people are being trained in disciplines that can collect relevant phenotypes.
Perhaps most importantly, a new paradigm is needed to train the next generation of plant scientists. Plant scientists are needed who are able to think in a systems biology manner. Breeders are needed who can apply genomics and develop new phenomics technologies to truly advance the improvement process and take advantage of the potential of genomics. Ancillary to this is a need for computational sciences to integrate genomic and phenotypic data in advanced ways to allow one to make predictions, and rational crosses based on these predictions. Engineers engaged with plant scientists are needed create new platforms to rapidly and accurately collect phenotypes on thousands of plants at a time: phenotypes from disease to seed composition/quality and mineral content, to plant vigor and growth processes. Improvements are being made, but rapid and transformational advances are needed if we, as plant scientists, are going to meet the challenges facing the world. As is often the case with our tool-building species, our limitations may not be the result of a lack of technology, but of a lack of our comprehension as to how best to apply that technology.
Funding from the US National Science Foundation (BIO 0822258) and the Next-Generation BioGreen21 Program (No. PJ0081172011), Rural Development Administration, Republic of Korea helped in the preparation of this manuscript.