An understanding of nature and extent of nucleotide sequence variation is required for programmes of discovery and characterization of single nucleotide polymorphisms (SNPs), which provide the most versatile class of molecular genetic marker. A majority of higher plant species are polyploids, and allopolyploidy, because of hybrid formation between closely related taxa, is very common. Mutational variation may arise both between allelic (homologous) sequences within individual subgenomes and between homoeologous sequences among subgenomes, in addition to paralogous variation between duplicated gene copies. Successful SNP validation in allopolyploids depends on differentiation of the sequence variation classes. A number of biological factors influence the feasibility of discrimination, including degree of gene family complexity, inbreeding or outbreeding reproductive habit, and the level of knowledge concerning progenitor diploid species. In addition, developments in high-throughput DNA sequencing and associated computational analysis provide general solutions for the genetic analysis of allopolyploids. These issues are explored in the context of experience from a range of allopolyploid species, representing grain (wheat and canola), forage (pasture legumes and grasses), and horticultural (strawberry) crop. Following SNP discovery, detection in routine genotyping applications also presents challenges for allopolyploids. Strategies based on either design of subgenome-specific SNP assays through homoeolocus-targeted polymerase chain reaction (PCR) amplification, or detection of incremental changes in nucleotide variant dosage, are described.
Although a multiplicity of sequence polymorphism detection assays have been used for genetic analysis in plant species, single-nucleotide polymorphisms (SNPs) provide the ideal genotyping system. SNPs represent the most fundamental and abundant form of genomic sequence variation and are suitable for highly multiplexed automated analysis (Rafalski, 2002a). SNP genotyping based on candidate gene- or region-specific discovery has supported current applications in genetic mapping and marker-assisted selection of specific agronomic loci. However, high-density map construction, genome-wide association studies (Rafalski, 2002b) and genomic selection (Meuwissen et al., 2001) require large-scale SNP discovery activities to provide whole-genome coverage. In practice, such approaches have only been implemented for a small number of economically important species, although this situation is rapidly changing because of the proliferation of second-generation sequencing technologies (Metzker, 2010), so enabling faster SNP discovery rates.
Methodologies for discovery and validation of predicted SNP variants have been optimized for both inbreeding and outbreeding diploid plant species (Cogan et al., 2006; Chagnéet al., 2007; Edwards et al., 2007). However, a large number of important crops are allopolyploids, for which sequence variation between subgenomes co-exists with allelic variation within subgenomes. The higher levels of genome complexity present in allopolyploid crop plants hence pose major additional challenges for SNP prediction. Understanding of a number of biological factors is of key importance for the resolution of this complexity. In parallel, advances in sequence generation technology and associated computational methods provide general solutions for genetic analysis of allopolyploids. This review describes those factors and approaches, and their influence to date on the efficiency of SNP discovery and subsequent detection in several representative species.
Polyploidy in plant species
Polyploid species contain chromosomal complements in excess of the singular (haploid) or double (diploid) states characteristic of many organisms. There are two conditions of polyploidy. Autopolyploidy is generally regarded as being due to failures of homologous chromosome disjunction during meiotic prophase I, either within a single taxon, or in a F1 hybrid between two related species with identical genomes, leading to reduplication of homologous chromosomes (Thompson and Lumaret, 1992). Allopolyploidy, on the other hand, results from the intergeneric or interspecific hybridization of related species with similar but not identical (homoeologous) chromosomes. The majority of allopolyploid taxa possess control mechanisms which prevent meiotic pairing between homoeologous chromosomes, leading to diploidized or disomic inheritance mechanisms (Riley and Chapman, 1958; Sears, 1976; Jenczewski et al., 2003).
Polyploidy is extremely common in plant lineages, accounting for c. 70% of living angiosperm species (Lewis, 1980; Leitch and Bennett, 1997). In addition, many species [such as apple (Malus domestica Bork.) and maize (Zea mays L.)] show evidence of ancient polyploidy (palaeoploidy) in their lineages, but chromosomal convergence during relatively recent evolution has resulted in highly duplicated contemporary diploid genomes (Gaut, 2001; Evans and Campbell, 2002). The high levels of intragenomic gene duplication that are observed in all angiosperm lineages are consistent with multiple whole-genome duplication events, followed by selective gene loss and neofunctionalization (Adams and Wendel, 2005).
Contemporary polyploid crop plant species include autotetraploids (4×) such as potato (Solanum tuberosum L.), alfalfa (Medicago sativa L.) and cassava (Manihot esculenta Crantz), as well as allopolyploids at various levels such as: durum wheat (Triticum durum Desf.), cotton (Gossypium hirsutum L.), canola (Brassica napus L.), white clover (Trifolium repens L.), tobacco (Nicotiana tabacum L.) and peanut (Arachis hypogaea L.) (all allotetraploid: 4×); bread wheat (Triticum aestivum L.), oat (Avena sativa L.), tall fescue (Festuca arundinacea Schreb. syn. Lolium arundinaceum) and kiwifruit (Actinidia deliciosa C.F.Liang and A.R.Ferguson) (all allohexaploid: 6×); and dessert strawberry (Fragaria × ananassa Duch.) (allooctoploid: 8×). Higher levels of ploidy have been observed for various noncultivated species [for instance, up to dodecaploid (12×) in the Festuca genus], while sugarcane (Saccharum spp.) is a complex of heteroploid genotypes derived from interspecific hybridizations between S. officinarum and S. spontaneum (D’hont et al., 1996).
Categories of single-nucleotide sequence variation
The fundamental process of SNP discovery involves resequencing of target DNA regions from contrasted genotypes, followed by sequence alignment and interpretation (Edwards et al., 2007). For diploids and autopolyploids, the major factor influencing efficiency of discovery is discrimination against paralogous sequences, as well as increasing sampling depth to ensure recovery of all allelic variants. The process is more complicated for allopolyploids, as shown for the minimal (allotetraploid) condition in Figure 1. Homologous single-nucleotide sequence variants (i.e. SNPs) arise between identical chromosome pairs of either each subgenome, or the corresponding diploid progenitors. Homoeologous sequence variants (HSVs) are defined as arising between corresponding nucleotide coordinates in genes that are duplicated between subgenomes (Somers et al., 2003). Paralogous sequence variants (PSVs) arise between duplicated genes that have diverged from a common ancestral sequence and co-exist within contemporary single genomes (Fredman et al., 2004). Strictly speaking, HSVs represent a subset of all possible PSVs, but are sufficiently distinct to be categorized separately. PSVs are commonly defined as occurring within genomes or individual subgenomes [designated PSV type 1 (PSV1) in Figure 1]. However, PSVs may also be predicted between complementary duplicated genes across subgenomes in an allopolyploid [designated PSV type 2 (PSV2) in Figure 1]. Finally, by analogy with HSVs and PSVs, orthologous sequence variants (OSVs) may be predicted between corresponding genes in related diploid taxa, or between allopolyploid species and related contemporary diploid genera. These simplified relationships assume symmetrical patterns of genome evolution between subgenomes and diploid progenitors. In practice, differential rates of gene duplication and selective gene loss in individual lineages may complicate inferred sequence relationships. In general, the frequency of different categories would be expected to increase from SNP to HSV to PSV. However, processes such as gene conversion may potentially homogenize nonhomologous sequences and hence bias estimates of nucleotide sequence variation (Wendel, 2000).
Biological factors affecting efficiency of sequence variant discrimination in allopolyploids
Level of intragenomic duplication
As observed for diploids and autopolyploids, high levels of sequence duplication leading to complex gene family structures will create logistical problems for discrimination of the various sequence variation categories, as well as providing additional opportunities for nonlinear sequence evolution events. However, the presence of both type 1 and 2 PSVs in allopolyploids may generate even higher levels of complexity during sequence comparison. Gene classes that are prone to moderate or high copy number, such as histones, chalcone synthases (Ito et al., 1997; Shimizu et al., 1999; Goto-Yamamoto et al., 2002) and chlorophyll a/b binding proteins (Mackerness et al., 1998; Hofmann et al., 1999) would be predicted to be subject to such effects. The antiquity of intragenomic duplication events will also influence the comparison process, such that more recently generated paralogue pairs would be expected to display lower PSV frequencies than those that occurred more distantly in evolutionary time.
Reproductive breeding habit
Allopolyploid species may either be predominantly or exclusively inbreeding (autogamous) or outbreeding (allogamous), or exhibit facultative behaviour. SNP discovery through comparison between sequences derived from fully inbred (homozygous) genotypes is relatively simple, as by definition, any intragenotypic sequence variants must be HSVs or PSVs. In outbred (multiply heterozygous) genotypes, SNPs are present within and between genotypes, and may potentially coincide and be confused with HSVs and PSVs. Many facultative outbreeding species may undergo limited inbreeding to derive genotypes with reduced heterozygosity. However, similar approaches with obligate outbreeders result in very high levels of inbreeding depression and infertility. Artificial homozygous genotypes such as doubled haploids (DHs) may also be generated for outbreeders, using androgenic tissue culture techniques (Tuvesson et al., 2007). However, such methods are not technically feasible for many species.
Different allopolyploid species display variable levels of knowledge concerning the phylogenetic relationships between contemporary subgenomes and the genomes of the extant diploid taxa, which are most closely related to putative progenitors. Such information may assist discrimination of nucleotide variant types through the ability to compare allopolyploid-derived sequence haplotypes with those obtained from diploid counterparts, permitting selective subtraction of homoeologous components. In addition, OSV frequency (and hence the effectiveness of comparison) will be influenced by the antiquity of the polyploidization event. For instance, allotetraploid cotton (Gossypium) species (AADD) such as G. hirsutum L. and G. barbadense L. arose c. 1.5 million years ago from subgenome donors most closely related to the contemporary taxa G. arboretum L. and G. herbaceum L. (A) and G. raimondii Ulbr. (D) (Senchina et al., 2003). In such cases, long-term evolutionary separation of the diverged lineages could potentially limit the value of comparative data. In contrast, the origin of allohexaploid wheats during human pre-history suggests that progenitor comparisons would be highly effective for assignment of subgenome-specific variants.
Technological and computational tools for SNP discovery in allopolyploids
Generation of sequence resources
To perform genetic analysis of allopolyploids, dideoxynucleotide chain termination (Sanger) sequencing technology may be appropriate for some applications. For instance, molecular phylogenetics studies (e.g. Hand et al., 2010b) are often based on data derived from resequencing of a small number of genes from both nuclear and organellar genomes. SNP identification for a particular candidate gene, in order to develop single-locus diagnostic markers, is also suitable for low-throughput sequencing technology. However, the inadequate capacity of Sanger sequencing to generate large sequence data sets at reasonable cost limits applicability. The relevant sequence generation technologies will generally be operational on second-generation platforms, which are capable of generating >106 reads for each template sample. These methods include massively parallel pyrosequencing (Margulies et al., 2005) on the Roche GSFLX platform, reiterative dinucleotide ligation-cleavage reactions such as the ABI SOLiD system (http://www.appliedbiosystems.com), and reversible dye-terminator solid-phase (Solexa) sequencing on the Illumina (http://www.illumina.com) GAII system. The capabilities and efficiencies of each platform, along with progress towards developing third-generation sequencing technologies, have been comprehensively reviewed (Deschamps and Campbell, 2010) in the context of SNP discovery from plant genomes of varying size and complexity.
Large-scale DNA sequence resources from a target allopolyploid taxon may be derived from sampling of either the transcriptome or the whole genome. Sequencing of tissue-specific mRNA populations as cDNAs to obtain expressed sequence tags (ESTs) is an efficient method for specifically accessing genic regions, especially those of large, complex genomes. However, discrimination between nonhomologous sequences in sequence alignments involving ESTs is anticipated to more challenging than for genomic sequences. The latter case provides a capacity to identify not only transcribed (and hence highly conserved) regions, but also introns and nontranscribed sequences that are less likely to be constrained by selective pressures. Such genomic sequence data sets may be generated through techniques such as pooling of multiple PCR amplicons, or various complexity reduction methods such as high C0t selection (Peterson et al., 2002) and methylfiltration (Whitelaw et al., 2003), but increasingly will be produced directly by whole genome sequencing (WGS).
Methods for computational analysis
Given generation of a candidate data set from an allopolyploid based on the preceding considerations, the first obvious computational step is to attempt the elimination of paralogous sequence comparisons. As previously described, the level and relative antiquity of intragenomic evolutionary duplication events will influence the level of feasibility for this task. One viable approach is to exploit information from related model species, providing a unigene set for sorting of paralogues by BLAST alignment, assuming that ancestral gene divergence pre-dates separation of the target species and the relevant model. More recent duplications in evolutionary terms will not be as readily accessible, but comparisons of this nature could provide heuristic rules based on observations of nucleotide diversity values between known paralogous pairs (such as high incidences of codon third base variation) that may be used to inform new comparisons. For the species described in this review, comparisons with model Poaceae [rice, Brachypodium distachyon (L.) P. Beauv.], Brassicaceae [Arabidopsis thaliana (L.) Heynh.], Fabaceae (Medicago truncatula Gaertn., Lotus japonicus L.) and Rosaceae (Prunus persica L. Batsch., Fragaria vesca L.) species would be appropriate.
Alignment under conditions of low stringency using appropriate alignment parameters, such as lower threshold for minimum match percentage and minimum overlap, will lead to coalescence of homoeologous and homologous sequences. In the event of sequences being generated from a fully homozygous genotype, haplotype discrimination should be trivial, as all observed sequence variants must by definition be HSVs. In practice, residual heterozygosity within inbreeders and high levels of allelic variation in outbreeders will frequently complicate comparisons, although inclusion of intron sequence, as indicated for genomic DNA-derived resources, may permit immediate identification of subgenome-specific haplotypes. In the absence of such diagnostic features, two substrategies may be envisaged. In those instances for which equivalent DNA sequences from well-characterized progenitor taxa are available, haplotype inclusion will assist subgenome resolution. For ploidy level x, [(x/2)−1] progenitor comparisons will be required for complete decomposition of the contig. An allotetraploid (4×), for instance, would require data from a single progenitor, but for an allooctoploid (8×), three comparator taxa are necessary. The implications of these calculations are described later in this review for white clover and strawberry, respectively.
In the absence of sufficient information to support the progenitor comparison approach, an approach based on progressively increased stringency criteria, exploiting the relatively higher frequency of HSVs than SNPs, will be required. The outcomes of such an approach are shown for a hypothetical outbreeding allotetraploid taxon in Figure 2, in which an increase in stringency level (minimum match percentage) from 75% to 95% is capable of resolution between two closely related subgenomes. This method, while not always ideal, is also capable of generating heuristic rules, provided that subsequent SNP validation studies are capable of calculating average incidences for each type of sequence variant. For species with high HSV frequency (possibly related to significant evolutionary divergence between subgenome donors) but low intraspecific SNP diversity, only moderate levels of stringency may be necessary for contig decomposition. In contrast, a combination of high allelic diversity and relatively lower divergence between subgenomes, particularly if HSVs and SNPs are frequently coincident, will dictate the use of much finer discrimination based on different stringency conditions.
Status of SNP discovery in allopolyploid crops
The polyploid wheats are natural inbreeders, with large basic genome sizes and high levels of gene duplication that are characteristic of the Poaceae family. Evolutionary origins from diploid progenitors are well understood, revealing a two-step process. First, Triticum urartu Tumanian ex Gandilyan (the A subgenome donor) hybridized with Aegilops speltoides Tausch (the B subgenome donor) to generate the tetraploid T. turgidum L. (genome constitution = AABB; 2n = 4x = 28), which was the ancestor of the contemporary durum wheats. Second, T. turgidum hybridized with Ae. tauschii Coss. (the D subgenome donor), resulting in the evolution of contemporary hexaploid bread wheats (genome constitution = AABBDD; 2n = 6x = 42) (Kihara, 1944; McFadden and Sears, 1946; Sarkar and Stebbins, 1956; Dvorak and Zhang, 1990; Dvorak et al., 1993). Sequence comparisons between modern polyploids and the diploid progenitors are consequently effective for discrimination between HSVs and SNPs that arise among cultivars or breeding lines.
Bread wheat possesses a collection of more than 1 million ESTs, derived from at least 42 cDNA libraries obtained from different stages of plant development and from imposition of different biotic and abiotic stress conditions (for recent review, see Francki, 2010). These sequences are available in the public domain through Genbank (http://ncbi.nlm.nih.gov/), and the collection provides an extensive resource for exploitation of the expressed genome portion for SNP discovery. Database mining has been combined with bioinformatic analysis to align wheat EST sequence into contigs and allow sequence variant discrimination (Somers et al., 2003; Rustgi et al., 2009). A process for BLASTN comparison of hypothetical ESTs in wheat, discrimination of HSVs between the three subgenomes and identification of SNPs between cultivars is summarized in Figure 3. In addition to progenitor comparisons, hexaploid wheat possesses a unique set of aneuploid genotypes such as nullisomic-tetrasomic lines (Sears, 1954) which permit one-step assignment of polymorphic loci to homoeologous chromosomes. Potential HSVs may hence be allocated to specific chromosomes through development of subgenome-specific PCR assays. Further analysis of subcontigs containing ESTs derived from different wheat varieties leads to identification of cultivar-specific SNPs, which when used in combination with subgenome-specific sequence variation (based on HSV diversity) may be used to develop locus-specific bi-allelic genetic markers (Figure 2).
HSV incidence in the wheat genome based on EST-derived data can vary from 1 in 24 bp (Somers et al., 2003) to 1 in 136 bp (Rustgi et al., 2009), while locus-specific SNP frequency can vary between 1 in 274 bp (Rustgi et al., 2009) and 1 in 540 bp (Somers et al., 2003). Discrepancies between studies probably reflect differences of EST sample size. Moreover, HSV and SNP frequency may vary across different regions of the wheat genome. In particular, loci with elevated homologous and homoeologous sequence variation may indicate the effects of diversifying selection for plant adaptation traits, while lower levels of variation are expected when mutational effects are likely to be deleterious.
Exploitation of randomly selected ESTs from genomic databases has provided a solid foundation for SNP discovery and genetic marker development, with potential for high-density genome coverage. However, targeted approaches to develop functionally associated markers are preferred in some instances, in order to saturate particular regions of the wheat genome, or to identify candidate genes controlling trait variation at a specific locus. A number of studies have focused on the development of SNP markers for particular genes associated with specific traits important for wheat improvement. Examples include Dehydration-Responsive Element Binding (DREB) protein genes involved in abiotic stress tolerance (Wei et al., 2009); starch biosynthesis genes, including ADP-glucose phosphorylase (Agp-L), sucrose transporter (SUT), waxy (Wx) and starch synthase I (SSI), leading to measurable changes in grain yield (Blake et al., 2004); genes for the low-molecular-weight glutenin subunits (LMW-GS) Glu-B3 and Glu-D3, associated with grain-quality characteristics (Zhao et al., 2007); and the phytoene synthase (Psy1) gene involved in xanthophyll accumulation and controlling flour colour variation for wheat-quality improvement (Crawford et al., 2011).
Identification of SNPs for specific genes provides the basis for incremental correlation of allele function with specific trait variation. However, analysis of genome-wide SNP distribution would significantly enhance large-scale efforts to identify all loci controlling multiple traits for crop improvement. Current efforts to fully sequence the bread wheat genome (http://www.wheatgenome.org/) will eventually support systematic SNP discovery activities, but at present, alternative strategies are required. Comparative genomics provides opportunities based on annotated genes from a whole-genome reference sequence of closely related Poaceae species such as rice (Oryza sativa L.) or Brachpodium distachyon L., followed by translation to identify ortholoci in wheat (for review, see Francki and Appels, 2007). PCR primer design for SNP discovery may be performed based on gene sequence templates from the model species, particularly through targeting of regions within exons and near to intron splice junctions, which are often conserved across large phylogenetic distances. Such conservation has been demonstrated for invertase and fuctosyltransferase genes of rice, wheat and perennial ryegrass (Francki et al., 2006) and for sodium ion transport genes of wheat, rice and Arabidopsis thaliana (Mullan et al., 2007). Resequencing of wheat intron-located sequences is an attractive option, because of the presence of generally higher nucleotide diversity than in exons. The development of a comprehensive suite of conserved orthologous set (COS) sequences for cereal genomics (Quaraishi et al., 2009) provide support for such activities. Judicious use of comparative methods, in concert with intron-derived sequences also provides a method to discriminate against comparisons of duplicated genes and hence, PSV detection.
A recent study (You et al., 2009) developed a web interface-based bioinformatic tool (ConservedPrimers 2.0) to enable alignment between wheat ESTs and model species genome sequences for comparative analysis of intron–exon junctions and to establish parameter settings for conserved primer design (http://wheat.pw.usda.gov/demos/ConservedPrimers). A total of 4,003 from 6045 candidate ESTs (66%) obtained matches with rice genome pseudomolecules, identifying 1922 unique collinear exon blocks from which to design and evaluate PCR assays for SNP discovery. A total of 1527 loci were identified as containing HSVs for design of subgenome-specific SNP assays. New bioinformatics tools will be useful not only for wheat SNP discovery prior to availability of the whole-genome sequence, but will also permit further exploitation of translational genomics.
In contrast to the situation for other crop (e.g. rice and maize) and model (e.g. A. thaliana) plant species, the number of wheat SNPs that are currently publicly available is low, but will increase in the near future when large-scale discovery projects come to fruition. A USA-based National Science Foundation (NSF) funded project for wheat SNP development provides an example of recently developed public domain resources. The Wheat SNP Database (http://probes.pw.usda.gov:8080/snpworld/Search) contains information on over 17 000 forward and reverse amplification primers, 11 200 polymorphic loci and 2200 polymorphic sequence tag sites (STS) that have been attributed to diploid progenitors. Further discovery efforts may exploit resequencing of gene-space enriched genomic fractions, such as by transcriptome sequencing using second-generation sequencing technologies (Barbazuk et al., 2007). Much larger SNP discovery outcomes will be achieved through resequencing of primary and secondary wheat gene pools based on a template provided by the completed whole genome, similar to the method used for the identification of c. 400 000 rice SNPs and indels through genome alignments from the indica and japonica subspecies (Feltus et al., 2004). Second-generation sequencing technologies provide computational tools for base-calling and postprocessing data analysis for comparison with reference sequence and variant detection that are suitable for cost-effective large-scale sequencing projects (Deschamps and Campbell, 2010), and offer the opportunity to substantially accelerate SNP discovery in wheat.
Cultivated Brassica species include a number of vegetable and oilseed crops exhibiting a broad spectrum of adaptation to cultivation under diverse climatic conditions (Choi et al., 2007). Three of the six widely cultivated species are diploids [B. rapa L. (AA; 2n = 2x = 20), B. nigra L. (BB; 2n = 2x = 16) and B. oleracea L. (CC; 2n = 2x = 18)], while the remaining species [B. juncea L. Czern. (AABB; 2n = 4x = 36), B. napus (AACC; 2n = 4x = 38) and B. carinata A. Braun (BBCC, 2n = 4x = 34)] are allotetraploids derived from pair-wise hybridizations between each diploid (U, 1935). The most economically important crop in this group is Brassica napus, represented by a root vegetable morphotype (swede) and an oilseed morphotype (oilseed rape or canola). The evolutionary origin of canola based on identified progenitors is hence well understood and is thought to have occurred relatively recently in historical time (Allender and King, 2010).
Canola is a partially outcrossing species, but the majority of fertilization events are based on self-pollination, allowing derivation of inbred genotypes. In addition, microspore culture is routinely used to obtain homozygous DH lines (Takahira et al., 2011). Discrimination between HSVs and SNPs based on resequencing from fully homozygous genotypes, in concert with comparison to sequence haplotypes from the contemporary diploids, should consequently provide a relative simple system for SNP discovery, analogous to the situation for polyploid wheats. However, residual heterozygosity in partially inbred lines may confound this situation. In addition, extensive evolutionary duplication has occurred with the Brassica lineage. Comparative genome structure analysis with the related model species A. thaliana (Lagercrantz and Lydiate, 1996; Lysak et al., 2005) have provided evidence for genome triplication, and current gene discovery (Hu et al., 2006) and genome sequencing activities (Mun et al., 2010) support this general conclusion. The presence of significant intragenomic sequence paralogy in Brassica polyploids will require rigorous discrimination between PSVs, HSVs and SNPs.
The genetic homogeneity of DH canola lines was exploited for SNP discovery through deep trancriptome sequencing from contrasted genotypes (Tapidor and Ningyou7), which also function as genetic mapping population parents (Trick et al., 2009). Sequences generated using the Illumina GA system were aligned with a set of Brassica unigenes as reference templates, in the absence of fully compiled genome sequence. Polymorphism detection was performed, following sequence alignment. In this study, polymorphisms between subgenomes were termed ‘inter-homoeologue polymorphisms’ and sequence variants corresponding to HSVs were called ‘hemi-SNPs’. The large majority of all detected single-nucleotide variants (87.5–91.2%, depending on sequencing depth) were HSVs, but were not attributed to the A or C subgenomes. The remnant set of putative SNPs were available for validation by genetic mapping in a DH mapping population derived from the parental cross.
Progenitor comparisons were used more explicitly in a study based on Sanger sequencing of candidate gene-derived amplicons across a varietal panel designed to represent genetic diversity in both rapeseed germplasm and the A and C progenitor genomes (Durstewitz et al., 2010). Following HSV-SNP discrimination, a subset of predicted SNPs was validated using a 1536-plex Illumina GoldenGateTM genotyping assay. In both individual and multiplexed assays, c. 80% of SNPs proved informative in canola mapping populations, and a large number of markers were assigned to genetic map loci. As the 19 chromosome pairs of canola have previously been attributed to two homoelogous sets, SNP assignment to individual linkage groups may be used to retrospectively predict subgenomic origin.
More recently, second-generation sequencing studies have been used to generate large-scale sets of putative canola SNPs. Sidebottom et al. (2010) used the Roche GS FLX sequencing platform to target sequence variation in both specific gene families and global 3′-anchored transcript populations. Wiggins et al. (2010) reported in silico SNP prediction based on c. 500 000 canola EST templates, generating a total of 3000 predicted high confidence SNPs. The same group used restriction site-associated DNA (RAD) marker technology (Baird et al., 2008) in concert with Illumina GAIIx-based paired-end sequencing to identify 108 551 putative SNPs (Tang et al., 2010). Data sets of this kind, based on sequencing of the whole genome or of complexity-reduced representations, would highly benefit from the capacity to compare with reference genome sequences from the diploid progenitors. The culmination of current WGS efforts for B. rapa and B. oleracea (Ayele et al., 2005; Mun et al., 2010) is hence likely to provide a robust system for one-step SNP identification and in silico assignment to subgenome and chromosome of origin in B. napus.
A broad range of grass and legume species have been domesticated for use in pastoral systems of temperate and tropical regions, supporting grazing industries for milk, meat and fibre production. The majority of temperate pasture grass and legume species are outbreeding, and a number of important taxa have allopolyploid chromosome constitutions (Forster et al., 2008). To date, the most intensively studied of these latter species has been white clover (2n = 4x = 32), a member of the cool-season Galegoid clade in the Papilinoideae subfamily of the legume family Fabaceae. The genomes of all members of this group show evidence for several cycles of large-scale evolutionary duplication. The origins of the two white clover subgenomes have been a matter of frequent speculation. However, a molecular phylogenetics study based on chloroplast (cp) DNA and nuclear ribosomal DNA internal transcribed spacer (nrDNA ITS) data identified the diploid species T. occidentale D.E. Coombe (Western clover) and T. pallescens Schreber as the contemporary taxa that are most closely related to the putative diploid progenitor species (Ellison et al., 2006).
SNP discovery in white clover has so far been based on template sequences from a collection of ESTs that defined 15 989 unigenes (Sawbridge et al., 2003). In the first instance, an in silico discovery strategy (Buetow et al., 1999; Picoult-Newberg et al., 1999) was used for computational identification of predicted SNPs in EST contigs generated from multiple heterogeneous individuals of a single heterogeneous variety, Grasslands Huia (Sawbridge et al., 2003). A sample of 236 EST contigs was selected for validation of predicted polymorphisms through Mendelian transmission tests using the parents and progeny of two-way pseudo-testcross mapping families (based on mating between multiply heterozygous individuals) (Cogan et al., 2007). Intensive analysis confirmed 58 clusters (25%) as containing validated segregating SNPs. Such a high level of attrition was interpreted as arising at least partially (aside from the contribution of sequence errors) from identification of HSVs and PSVs as SNPs.
An in vitro SNP discovery process (Zhang and Hewitt, 2003) was hence designed to address these problems. In a preliminary study (Lawless et al., 2009), a total of 43 white clover cDNAs were selected from public databases and from the unigene resource. Amplicons were obtained from the mapping family parental genotypes and were individually cloned and subjected to Sanger sequencing. Substantial haplotypic complexity was observed for the majority of template genes, at levels greater than predicted based on amplification solely from homologous sequences. High rates of failed validation (82%) for predicted SNPs were also observed. As observed for the in silico discovery study, homoeologous gene amplification (Cronn and Wendel, 1998) provides a likely explanation for this effect.
Improved methods based on progenitor comparison were clearly required to improve the efficiency of in vitro SNP discovery, and to allow discrimination of sequence variants. To exemplify this approach, nine ESTs corresponding to genes associated with tolerance to abiotic stresses (drought and salinity) were selected, including members of the dehydration—responsive element binding (DREB) and late embryogenesis abundant protein (LEA) gene families. A total of 7290 bp of genomic DNA was resequenced from the parental genotypes of the F1(Haifa2 × LCL2) and F1(S1846 × LCL6) genetic mapping families and from single genotypes of both T. occidentale and T. pallescens (Hand et al., 2008). White clover-derived sequence haplotypes were separated into two putative subgenome-specific clusters, one of which closely resembled the sequences from T. occidentale, the other being more distantly related to both T. occidentale and T. pallescens. The two putative subgenomes were hence designated O and P’ respectively. Detailed sequence comparison and phenetic clustering analysis confirmed the divergent status of T. pallescens, and suggested that a third, as yet uncharacterized taxon had contributed the P’ subgenome.
Resolution of homoeologous sequence clusters permitted assessment of SNP frequency within the subgenomes (1 per 87 bp and 1 per 97 bp, respectively), and an HSV frequency of 1 per 27 bp. A large proportion of SNPs co-located with HSVs when compared across genotypes, and in the majority of instances (88%), the coincident SNP-HSV shared the same nucleotide variant. This result provided a partial explanation for the very low rate of SNP validation observed in the preliminary study. This rate was subsequently improved to levels comparable with those seen in comparable diploid systems (Cogan et al., 2006). Identification of two distinct copies related to the TrLEAa EST also permitted resolution of PSVs, at a frequency of 1 per 14 bp. Complementary SNPs for both subgenomes were subsequently assigned to homoeologous linkage groups. Identification of subgenome-specific sequence haplotypes also permitted isolation and characterization of homoeologous bacterial artificial chromosome (BAC) clone pairs (Hand et al., 2010a).
Progenitor comparison provided a highly improved system for in vitro SNP discovery in white clover, at least for specific mapping family sib-ships. However, both SNP-HSV coincidence and the presence of identical nucleotide variation at equivalent positions in both the O and P’ subgenomes (Hand et al., 2008) continue to present problems for discovery in large-scale germplasm collections. The rate of analysis has been accelerated by massively parallel pyrosequencing on the Roche GSFLX platform, through pooling of amplicons from multiple genes (100–150) after unimolecular amplification reactions, followed by sequestration of the template in genotype-specific sectors. Products of the sequencing reaction were also compared to data derived from direct sequencing of an inbred T. occidentale reference sample. Prior knowledge of SNP and HSV frequencies was used in concert with computational sequence assembly to identify variable loci and assess the feasibility of Illumina GoldenGateTM multiplexed SNP genotyping (Cogan et al., 2008). Further implementation of this approach should be capable of generating large-scale SNP resources for the molecular breeding of white clover.
Experience gained with white clover is applicable to other outbreeding allopolyploid pasture species, of which tall fescue is the most important (Easton et al., 1994). The phylogenetics of this taxon is still poorly understood, although restriction fragment length polymorphism (RFLP) studies (Xu and Sleper, 1994) and genomic in situ hybridization (GISH) analysis (Humphreys et al., 1995) suggested that summer-active varieties derived from the Continental morphotype may originate from the tetraploid ancestor F. glaucescens Bioss., which contributed two subgenomes (G1 and G2), and the contemporary diploid meadow fescue (F. pratensis Huds.), which contributed a single subgenome (P). This inference has been supported by amplicon resequencing from cp DNA, nrDNA ITS and protein-coding genes (Hand et al., 2010b). However, an independent origin for the winter-active Mediterranean morphotype from a nonoverlapping set of diploid progenitors was also revealed, suggesting the presence of two distinct taxa and a requirement for taxonomic revision.
Preliminary data from resequencing of multiple pooled amplicons from Continental tall fescue genotypes (M. Hand, unpublished) supported this hypothesis, revealing up to three well-differentiated haplotype groups, one of which closely resembles those obtained from contemporary F. pratensis. The method of progenitor comparison for SNP-HSV discrimination is hence well suited to allohexaploid tall fescue, although identification of the two diploid contributors to the contemporary F. glaucescens genome, which has not yet been achieved, would be ideal for full resolution. Although genomic studies of tall fescue are still rudimentary in comparison with other Poaceae species, rapid progress in SNP discovery may be anticipated, at least for the Continental morphotype.
The Fragaria genus of the Rosaceae family comprises a group of c. 22 wild species with genetic constitutions ranging from diploids (2n = 2x = 14) to decaploids (2n = 10x = 70) (Rousseau-Gueutin et al., 2009). The dessert strawberry is a highly valued cultivated fruit crop which originated very recently (during the mid-18th century) from a chance hybridization between the wild octoploids F. virginiana Mill. and F. chiloensis (L.) Mill. (Folta and Davis, 2006). Dessert strawberry is predominantly outcrossing, but is capable of partial inbreeding by self-pollination.
The subgenome composition of allooctoploid strawberry (2n = 8x = 56) is still unclear. An initial model based on four relatively distinct subgenomes (AABBCCDD) (Federova, 1946) was subsequently modified to a system based on minimal differentiation and partial autopolyploid status (AAA’A’BBBB) (Senanayake and Bringhurst, 1967). Subsequently, a fully allopolyploid model based on two pairs of related subgenomes (AAA’A’BBB’B’) was suggested (Bringhurst, 1990). This model is consistent with the observation of disomic inheritance during construction of molecular genetic marker-based linkage maps (Sargent et al., 2009).
The diploid taxon alpine strawberry, F. vesca L., has long been considered to be the contemporary species most closely related to at least one (probably A) of the predicted subgenomes. A second diploid, F. iinumae, is the other current leading candidate, with F. bucharica and F. mandshurica providing additional possibilities (Folta and Davis, 2006). Recent molecular phylogenetic studies (Rousseau-Gueutin et al., 2009) suggested that F. vesca or F. mandshurica was highly likely to have contributed the maternal chloroplast genome as well as a nuclear subgenome to F. × ananassa, while F. iinumae is closely related to a second subgenome. Although the precise composition of the octoploid genome is still a matter of conjecture, the various current models are all based on the assumption that the subgenomes have very similar sequence structures. In combination with the possibility of SNP-HSV coincidence owing to intragenotypic heterozygosity, sequence variant discrimination in strawberry is likely to represent extreme example of difficulty, in excess of that observed for white clover. However, one positive factor is the apparent absence of large-scale evolutionary duplications within diploid Fragaria genomes (Shulaev et al., 2011), possibly because of preferential gene loss following genome size reduction. This observation suggests that misattribution of PSVs may be a less significant factor than for the other species described in this review.
To explore the potential problems, a ‘proof-of-concept’ study was designed for SNP discovery by resequencing, based on the method previously used for white clover and tall fescue. A small set (5) of candidate genes related to fruit yield quality and traits (Kaur et al., 2008) were selected from a collection of c. 20 000 ESTs (Keniry et al., 2006). PCR amplicons generated from the heterozygous parents of a two-way pseudo-testcross family were cloned and sequenced. Up to eight distinct sequence haplotypes per genotype for each gene are anticipated in an outbred allooctoploid genetic background. Among the sequenced genes, the number of identified haplotypes per gene varied from 3–6 (Figure 4). Phenetic analysis of haplotypes derived from an alcohol dehydrogenase (ADH) gene detected one highly distinct (haplotype 1) and several closely related (haplotypes 2/5 and 3/4) variants, which were differentiated both by nucleotide changes and intron-located insertion-deletion (indel) events. Affinities were observed between one of the haplotypes and sequences obtained from contemporary F. vesca. The different haplotype groups may hence derive from separate subgenomes. Similar results have been obtained from analysis of gene-pair haplotypes across a range of octoploid strawberry cultivars (Davis et al., 2008; Tombolato et al., 2008), which further implicated F. iinumae as a potential subgenome donor. However, minimal similarity was observed with F. vesca, raising the possibility of nonuniform genome composition between different germplasm pools.
Although sequence variants may be readily predicted from in vitro resequencing studies of strawberry, the validation rate for predicted SNPs was low in the absence of definitive subgenome discrimination data. A subset of putative SNPs predicted from the Sanger sequencing activity (Kaur et al., 2008) showed a low (9%) validation rate in a Mendelian transmission test, presumably (as for white clover) because of misidentification of and coincidence with HSVs.
Larger-scale data sets for strawberry SNP discovery have been generated through pooled amplicon resequencing (S. Kaur, unpublished) and deep transcriptome sequencing (Clancy et al., 2008). The latter study included both diploid F. vesca and F. iinumae as reference samples. However, effective improvement of the SNP prediction-validation process is likely to require a broader survey of genome structure across the Fragaria genus, including all contemporary diploids. The relatively small genome size of species such as F. vesca (c. 240 Mbp) enhances the feasibility of such an approach (Shulaev et al., 2011). However, there is also a formal possibility that partial autopolyploidy may indeed have played role in the evolution of the contemporary octoploid, following by establishment of functional diploidy. In this instance, WGS and assembly for F. ×ananassa itself may be required to fully decouple the subgenomic components.
SNP genotyping in allopolyploids
It is apparent from previous sections that considerable efforts are required for accurate SNP discovery in allopolyploid species. Subsequently, efficient methods must be developed for high-throughout SNP genotyping, and implementation of such methods may also be affected by complexities of genome architecture, because of the desirability of deriving information on subgenome-specific allelic variation.
A wide range of methods for high-throughput SNP genotyping have been developed for use with complex genomes, including single-step homogenous methods such as Taqman and molecular beacons, pyrosequencing analysis, high-density solid-phase array-based systems, bead-based methods and mass spectrometry-based assays. The relevant technology has been comprehensively reviewed (Tsuchihashi and Dracapoli, 2002), and specific applications for plant genomes have been described (Appleby et al., 2009). Although several genotyping systems have been used for species addressed in this review, such as wheat, canola and white clover, the majority of effort in this area has been performed for polyploid wheats. Results from these studies, which should be broadly applicable to a range of different polyploid species, are used here to exemplify the more general problem.
The ideal method for SNP genotyping in an allopolyploid genetic system is to reduce complexity to the diploid level through design of subgenome-specific assays. In practice, this requires substantial knowledge of HSV incidence, and will be difficult for those species with closely related subgenomes and high levels of SNP-HSV coincidence. Individual loci with a high degree of prior knowledge may be subjected to this form of analysis through the use of homeolocus-specific PCR amplification reactions. The padlock probe genotyping technology (Nilsson et al., 1994) is also capable of supplying subgenome-specific data. This method relies on circularization of allele- and locus-specific oligonucleotides by DNA ligation in the presence of the target DNA sequence, and subsequent detection through fluorescence of conjugated complementary oligonucleotides on a microarray. The reaction requires identification of both the query SNP and c. 25 bp of perfect homologous sequence on either side of the target locus, which assists discrimination against homoeologous sequences. Applicability of the method for allohexaploid wheat was demonstrated through differentiation of the Rht-B1 and Rht-D1 semi-dwarfing homeoloci (Edwards et al., 2009). Further analysis was performed with 336 probe pairs, for which prior sequence data suggested the capacity to perform homoeolocus-specific analysis. However, only 64% of the probes produced reproducible genotyping results. Therefore, high-density genome coverage of SNP markers would require considerably higher numbers of synthesized padlock probes than the target number of markers required for any particular genetic or association mapping study. In addition, padlock probe technology relies on an array-based hybridization system for detection. Together with the added expense involved in probe synthesis, this factor could be cost-prohibitive for high-throughput and multiplex SNP genotyping to implement high-density genetic and association mapping or marker-assisted selection in breeding.
Although design of subgenome-specific assays is desirable, and in many cases feasible, each locus must be assessed individually, and appropriate HSVs may not always be available for discriminatory amplification. In such cases, quantitative genotyping approaches may be appropriate, especially in specific pedigrees for which genomic constitutions for individual genotypes can be predicted. As an example, the hexaploid wheat genotypes CC[A]GG[B]GG[D], CC[A]CG[B]GG, CC[A]CC[B]GG[D] and CC[A]CC[B]CG[D] would appear identical to a qualitative nonspecific assay, but present 1 : 2, 1 : 1, 2 : 1 and 5 : 1 ratios of C:G sequence variant content in a quantitative analysis. Genotyping methods such as matrix-assisted laser disorption-ionization time-of-flight mass spectrometry (MALDI-TOF-MS) are capable of delivery information on sequence variant dosage, based on peak height variation and ion mass discrimination determined by the variant base (Ross et al., 2000). The Sequenom MassARRAYTM MALDI-TOF-MS technology has been used for SNP genotyping in the complex heteroploid species, sugar cane (Bundock et al., 2009). Pyrosequencing (Rickert et al., 2002) and quantitative real-time PCR methods (Wilhelm and Pingoud, 2003) are also applicable.
For polyploid wheats, equivalent data has been obtained from two commercial systems that employ oligonucleotide ligation assay (OLA) technology. The Applied Biosystems SNPlex™ OLA system provided 39 reliable SNP assays from an initial group of 48 markers, revealing similar levels of SNP genotyping accuracies across a panel of tetraploid and hexaploid wheat (Bérard et al., 2009). The Illumina GoldenGate™ OLA assay was used to genotype 1536 SNP loci across 384 accessions of wheat using pairs of allele-specific oligonucleotides that anneal to regions flanking the query SNP, and a locus-specific oligonucleotide for SNP detection (Akhunov et al., 2009). At least 84% of 135 SNPs selected from the Wheat SNP Database could be converted to genotypic assays with accuracy across a panel of 91 tetraploid and hexaploid wheat lines. The Illumina GoldenGate™ assay may also be adapted to deliver quantitative information based on Cy3:Cy5 fluorescence ratios of the type described earlier, through resolution of multiple signal clusters during data analysis.
The relative value of genotyping methods that deliver information with greater or lesser levels of subgenome-specificity will differ according to the genetic architecture of the target species. Interpretation of quantitative genotypic information is assisted by detailed knowledge of parental allelic constitution and linkage phase, which is likely to be more accessible for obligate inbreeders. The value of locus-specific assays may hence be considerably higher for allogamous species.
Comparison between SNP discovery activities for the several species described in this review has illustrated the interplay of the various biological factors. For instance, canola exhibits the favourable characteristics of minimal polyploid status, capacity to generate homozygous genotypes, and has well-described progenitor species, but displays substantial levels of evolutionary gene sequence duplication. In contrast, analysis of strawberry is complicated by the unfavourable characteristics of higher ploidy level, outbreeding habit and limited knowledge of subgenome donor taxa, but benefits from relatively lower levels of gene duplication.
All studies to date have been based on targeted resequencing of individual genes, or complexity-reduced genomic representations. However, WGS technologies will clearly have a major impact for genetic analysis of allopolyploids. For those inbreeding allopolyploids with known ancestral relationships, derivation of reference genomic sequences for each subgenome and their diploid counterparts is both desirable and feasible. Canola is particularly well placed in this regard, as both B. rapa and B. oleracea are important vegetable crop species in their own right. There are no obvious theoretical impediments to systematic genome-wide SNP marker development in either wheat or canola, although substantial logistical challenges remain.
In contrast, specific challenges remain for the allogamous allopolyploids. Although complexity of SNP characterization is broadly correlated with ascending ploidy level, incomplete knowledge of progenitor species, and potential high levels of similarity between those species and the corresponding subgenomes, present the most serious problems. As suggested for strawberry, concerted efforts to obtain whole-genome sequences for a full range of related diploid taxa provides a viable strategy. However, this approach pre-supposes that the descendants of the subgenome donor are still represented among living species. For the P’ subgenome donor of white clover, and for each diploid component of Mediterranean tall fescue, this assumption may be incorrect.
An alternative approach to HSV-SNP discrimination that is applicable to both inbreeders and outbreeders is high-density genotyping of full-sib mapping populations in order to eliminate nonallelic variants, and retrospective subgenome assignment based on genetic map position. This approach appears to be most advanced for canola. To date, large-scale adoption of such a method has been prohibited by cost. However, the imminent transition from genotyping per se to ‘genotyping-by-sequencing’ (GBS) methods (Huang et al., 2009; Elshire et al., 2011) will change both the cost structure and feasibility of this method for SNP validation. Detailed knowledge of homoeolocus variation based on completion of reference genome sequences will permit assignment of specific sequence variants to subgenomes, either through whole-genome resequencing, or computational analysis of reduced complexity representations from specific genotypes (Baird et al., 2008; Wang et al., 2011). The transition to GBS approaches also implies a future coalescence of SNP discovery and genotyping activities at the technical level.
The authors acknowledge support from the Victorian Department of Primary Industries. Original research in the molecular genetics of temperate forage species has been funded by Dairy Australia, the Geoffrey Gardiner Dairy Foundation, Meat and Livestock Australia and the Victorian Department of Primary Industries through the Molecular Plant Breeding and Dairy Futures Cooperative Research Centres. Activities in strawberry molecular genetics were co-funded by Horticulture Australia Ltd., AgGenomics Pty. Ltd. and the Victorian Department of Primary Industries. The authors thank Prof. German Spangenberg, Dr Noel Cogan and Dr Matthew Hayden for helpful critical comments.