With the development of molecular techniques, MAS is now used to enhance traditional breeding programs to improve crops, and modern plant breeding is dependent on molecular markers for the rapid and precise analysis of germplasm and for trait mapping (Koebner and Summers, 2002). Molecular markers are complementary tools to traditional selection, used to select parental genotypes in breeding programs, eliminate linkage drag in back-crossing and select for traits that are difficult to measure using phenotypic assays. They can increase our understanding of phenotypic characteristics and their genetic association, which may modify the breeding strategy. MAS allows the breeder to achieve early selection of a trait in a breeding program, and it is particularly useful when the trait is under complex genetic control, or when phenotypic trials are unreliable or expensive. By increasing favourable allele frequency early in the breeding process, a larger number of small populations can be carried forward in the breeding process, each of which has been pre-screened to remove or reduce the frequency of unfavourable alleles. While second generation sequencing can readily be applied for the discovery of markers which can be applied for MAS, there is little if any benefit in using whole genome sequencing during selection as the vast majority of SNP are not associated with agronomic traits. The cost of genotyping SNP in large populations continues to decline and it is unlikely that whole genome sequencing will replace specific marker genotyping for MAS in the near future.
Molecular markers have revolutionized genome mapping over the last two decades, and the high density of markers that can now be generated from second generation sequence data offers the potential for generating very high density genetic maps. These markers can be used to develop haplotypes for genes or regions of interest, and complete genome mapping is now becoming a reality. Genetic mapping places molecular genetic markers in linkage groups based on their co-segregation in a population. The genetic map predicts the linear arrangement of markers on a chromosome and maps are prepared by analysing populations derived from crosses of genetically diverse parents, and estimating the recombination frequency between genetic loci. Many types of markers can be used for map construction, with population size and marker density being important for map resolution. SNP identified within whole genome sequence or large genomic fragments maintained within BAC can be applied for the genetic mapping of complex traits. This enables the genetic mapping of specific genes of interest and assists in the identification of linked or perfect markers for traits, as well as increasing the density of markers on genetic maps (Rafalski, 2002). The development of these markers also allows the integration of genetic and physical maps. The use of common molecular genetic markers across related species permits the comparison of linkage maps. This allows the translation of information between model species with sequenced genomes and non-model species (Moore et al., 1995). Furthermore, the integration of molecular marker data with genomics, proteomics and phenomics data allows researchers to link sequenced genome data with observed traits, bridging the genome to phenome divide. These markers can then be used routinely in crop breeding programs.
Association mapping is a further statistical method to identify genetic loci associated with phenotypic trait variation. Association mapping shares much in common with quantitative trait loci (QTL) mapping. QTL mapping generally involves the use of structured populations and relatively distant markers can segregate with the QTL, providing a wide genetic region within which the gene is located. The use of unstructured populations in association mapping means that they represent many more recombination events and are often many generations from a common ancestor, providing the potential of a greater resolution for a set population size. The advances in genome sequencing technology, allowing the production of millions of markers, provides an increasing ability to generate large quantities of molecular marker genotyping data, which favours association studies over traditional QTL mapping, because of this, association studies are likely to become more common.
Molecular marker discovery
Single nucleotide polymorphisms now dominate molecular marker applications, because of recent advances in DNA sequence technology enabling their discovery, and the development of high throughput assays. As with most molecular markers, the factor limiting the implementation of SNP is the initial cost of their development (Duran et al., 2009b). SNP discovery involves finding differences between two sequences. Traditionally this has been performed through PCR amplification of genes/genomic regions of interest from multiple individuals selected to represent diversity in the species or population of interest, followed by either direct sequencing of these amplicons, or the more expensive method of cloning and sequencing. Sequences are then aligned and any polymorphisms identified. This approach is frequently prohibitively expensive and time consuming for the identification of the large number of SNP required for most applications such as genetic mapping and association studies.
In silico methods of SNP and SSR discovery are now being adopted, providing cheap and efficient methods for marker identification (Barker et al., 2003; Batley et al., 2003; Robinson et al., 2004; Jewell et al., 2006; Duran et al., 2009a,c). Large quantities of sequence data are being generated by the latest second generation sequencing technologies and these provide a valuable resource for the mining of molecular markers (Imelfort et al., 2009). While the large volume of next generation sequencing data are generally produced at the expense of sequence quality, the over sampling of genome data enables the differentiation between true SNP and sequence error. Whole genome sequencing is the most robust method to identify the great variety of genetic diversity in a population and gain a greater understanding of the relationship between the inherited genome and observed heritable traits. The continued rapid advances in genome sequencing technology will likely lead to whole genome sequencing becoming the standard method for genetic polymorphism discovery.
One of the first applications of next generation sequencing in plants identified over 36 000 putative maize SNP using 260 000 and 280 000 EST, sequenced using the Roche GS20. These SNP were identified between B73 and Mo17 inbred maize lines (Barbazuk et al., 2007). Stringent post-processing reduced this number to >7000 putative SNP, and over 85% (94/110) of a sample of these putative SNP were successfully validated by Sanger sequencing. Based on this validation rate, this pilot experiment conservatively identified >4900 valid SNP within >2400 maize genes, demonstrating the suitability and potential of the approach.
Recently, Roche 454 technology was used to sequence and assemble 148 Mbp of expressed sequences (EST) for Eucalyptus grandis (Novaes et al., 2008). The EST sequences were generated from a normalized cDNA pool comprised of multiple tissues and genotypes. By aligning sequencing reads from multiple genotypes 23 742 SNP were predicted, 83% of which were validated. The draft consensus sequence for the grape genome identified 1.7 million SNP which were mapped to chromosomes (Velasco et al., 2007). With the continual increase in read length and introduction of read pairs for Roche 454 sequencing, similar future projects may be undertaken using next generation sequencing without the expense of Sanger sequencing.
The SNP discovery from next generation sequencing isn’t limited to long read technology. The relatively high level of DNA methylation in repetitive regions of the genome has been used to enrich for and sequence the gene-rich regions of several genomes. Deschamps et al. (2008) have taken this method further and demonstrated SNP discovery from a subset of maize genomic sequences selected using a methylation sensitive restriction endonuclease. The approach involves whole genome DNA digestion with a methyl-sensitive restriction enzyme and the 4 bp cutter DpnII followed by selective enrichments and unilateral end sequencing of digested fragments to generate 16 bp unique tag sequences immediately flanked by a 4 bp DpnII signature sequence. In a preliminary experiment to demonstrate the utility of the procedure, several millions tags were produced, corresponding to 292 948 and 166 455 unique high-quality 16 bp tag sequences for the two maize inbred lines B73 and GA209 respectively. Alignment of the B73 tag sequences to assembled Zea mays contigs from the TIGR maize database produced 103 776 B73 tag sequences that perfectly matched the contigs, and these were aligned with the 166 455 GA209 tag sequences. This led to the identification of numerous putative SNP, of which a significant portion was successfully validated by Sanger sequencing. These results demonstrate that even relatively short reads generated by the Illumina Genome Analyzer can be used for SNP discovery in complex crop genomes.
Second generation sequencing has been applied to study methylation of the Arabidopsis genome (Lister et al., 2008), and a greater understanding of the epigenetic modification of genomes and the impact of such modification on gene expression is likely to have outcomes for crop improvement approaches.
Simple sequence repeats have also been identified from second generation sequence data using the Roche 454 technology (Abdelkrim et al., 2009; Allentoft et al., 2009; Santana et al., 2009). These markers are predominantly applied for the study of non-model organisms. The SSR can be used along with SNP in MAS or for comparative genomics to bring useful traits from wild relatives.