Notice: Wiley Online Library will be unavailable on Saturday 27th February from 09:00-14:00 GMT / 04:00-09:00 EST / 17:00-22:00 SGT for essential maintenance. Apologies for the inconvenience.
Over the past few years, considerable progress has been made in high-throughput single nucleotide polymorphism (SNP) genotyping technologies, largely through the investment of the human genetics community. These technologies are well adapted to diploid species. For plant breeding purposes, it is important to determine whether these genotyping methods are adapted to polyploidy, as most major crops are former or recent polyploids. To address this problem, we tested the capacity of the multiplex technology SNPlex™ with a set of 47 wheat SNPs to genotype DNAs of 1314 lines that were organized in four 384-well plates. These lines represented different taxa of tetra- and hexaploid Triticum species and their wild diploid relatives. We observed 40 markers which gave less than 20% missing data. Different methods, based on either Sanger sequencing or the MassARRAY® genotyping technology, were then used to validate the genotypes obtained by SNPlex™ for 11 markers. The concordance of the genotypes obtained by SNPlex™ with the results obtained by the different validation methods was 96%, except for one discarded marker. Furthermore, a mapping study on six markers showed the expected genetic positions previously described. To conclude, this study showed that high-throughput genotyping technologies developed for diploid species can be used successfully in polyploids, although there is a need for manual reading. For the first time in wheat species, a core of 39 SNPs is available that can serve as the basis for the development of a complete SNPlex™ set of 48 markers.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
Single nucleotide polymorphisms (SNPs) are the most frequent type of polymorphism in genomes. They can provide a huge number of useful markers for many genetic analyses (e.g. phylogenetic analysis, ultra-dense genetic mapping, genotype/phenotype association studies) and important applications (e.g. cultivar identification, marker-assisted selection), which are simplified as these markers are most often biallelic. In the last decade, the constitution of very large SNP collections for humans (http://www.hapmap.org), several animal species (http://www.livestockgenomics.csiro.au/ibiss) and even model plants (for Arabidopsis and rice, see http://www.arabidopsis.org and http://irfgc.irri.org respectively) has made possible the development of genome-wide analyses which require the genotyping of a large number of SNPs. As such genotyping methods are limited by the cost and time for scoring SNPs, there has been an increasing demand for the development of high-throughput and low-cost genotyping methods.
Automation and multiplexing enhance the effectiveness of SNP genotyping enormously. Several high-throughput platforms are now available for the genotyping of a variable number of genomic DNA (gDNA) samples for one to up to one million SNPs in parallel. These technologies were first developed for human genomic studies i.e. a diploid genome. They include, but are not limited to, TaqMan® and SNPlex™ technologies from Applied Biosystems (Foster City, CA, USA) and array-based technologies from Illumina (San Diego, CA, USA) (GoldenGate® and Infinium®) or Affymetrix (Santa Clara, CA, USA) (for a review, see Sobrino etal., 2005; Syvänen, 2005; Khlestkina and Salina, 2006; for a comparison, see De La Vega etal., 2005; Giancola etal., 2006).
Today, except for model plant species (Arabidopsis and rice) and a small number of other species (e.g. maize, barley, pine, grapevine), the number of SNPs available is still low. In wheat, SNP databases are available (Gupta etal., 2008), but with a limited number of SNPs for hexaploid material. This will certainly change as many SNP discovery projects in plants reach completion. Nevertheless, the utilization of SNPs in breeding programmes, such as marker-assisted selection, depends on high-throughput SNP genotyping methods. Recently, some of these methods have been applied in plants. For instance, SNPlex™ technology has been used in Arabidopsis (Drouaut etal., 2007; Simon etal., 2008) and grapevine (Vitis vinifera) (Lijavetzky etal., 2007; Pindo etal., 2008), and GoldenGate® technology has been used in spruce (Picea glauca and Picea mariana) (Pavy etal., 2008) and soybean (Glycine max) (Hyten etal., 2008). These current technologies have been applied to diploid species. However, many plant species are former or recent polyploids, and nearly all crop plants are polyploid (for a review, see Adams and Wendel, 2005), and therefore their genotyping represents an extra challenge.
Bread wheat (Triticum aestivum L.) is a major cereal crop in both human and animal nutrition. Its large genome, consisting of three highly related genomes (homoeologous A, B and D genomes), originated from two independent polyploidization events (for a review, see Dubcovsky and Dvorak, 2007). The first event involved the hybridization of two diploid progenitors, an ancestor of Triticum urartu (AA genome) and an unconfirmed species related to Aegilops speltoides (BB genome), which resulted in wild and cultivated allotetraploid wheats (T. turgidum ssp.). The second hybridization between an ancestor of the diploid Aegilops tauschii var. strangulata (DD genome) and an allotetraploid wheat formed the hexaploid. Few studies have been carried out on nucleotide diversity in wheat because of the presence of two or three homoeologous genome copies. Cultivated wheat species are reported to have a low level of nucleotide diversity, explained by their evolutionary history, including several demographic bottlenecks and selective events (Ravel etal., 2006; Haudry etal., 2007). Therefore, to date, SNP discovery in these species has been an arduous task.
SNPlex™ technology (De La Vega etal., 2005) allows the investigation of one to several hundred samples with sets of 48 SNP markers, and is a good compromise between very-high-throughput genotyping methods, such as array-based technologies (large number of SNPs but a limited number of samples), and simplex technologies (for example, the TaqMan® assay allows the analysis of one SNP at a time in thousands of samples). Therefore, the purpose of this work was to study whether a multiplex genotyping method could be used in polyploid wheat species. To address this problem, we simultaneously genotyped 47 SNPs by SNPlex™ using gDNAs from different taxa of wheat (tetra- and hexaploids) and their wild diploid relatives.
We submitted 75 SNPs to Applied Biosystems (Table 1). Among these, 73 passed the design rules and were included in two sets of compatible markers for the multiplex reaction: one with 47 markers and one with 26 markers (sets 1 and 2 in Table 1). We considered that the second set had an insufficient number of markers, and so only used the complete set (set 1).
Table 1. Description of submitted single nucleotide polymorphisms (SNPs)
The SNP code corresponds to the code of the gene followed by the homoeologous genome, the type of SNP using the IUB code and its position in the consensus sequence that we obtained. For GDH genes, the genome is followed by a number indicating the location of the fragment (fragments A1 and B9, from exon 1 to exon 4; fragments A8 and B5, from exon 4 to exon 6).
The flanking sequences of unpublished SNPs are given in Table S1 (see Supporting information).
In this study, wheat gDNAs with different ploidy levels were used (see Experimental procedures). Four plates were constituted and named Di-Tetra, CC1, CC2 and RILS-ReR for diploids and tetraploids, hexaploid core collection 1, hexaploid core collection 2 and hexaploid wheat recombinant inbred lines (RILs), respectively (Table 2).
Table 2. Number of accessions per wheat taxon genotyped by SNPlex™ technology
Indeterminate diploid species or T. turgidum subspecies.
Allele calling was performed by an automatic analysis of SNPlex™ signals with the software GeneMapper v3.7, followed by a manual reading. Samples giving signals which could not be discriminated from the negative control (water) were treated as missing data. We observed more than 20% missing data in all plates for all four markers developed in a hypothetical gene on the 5A chromosome (HG_A_Y_335, HG_A_M_489, HG_A_K_615 and HG_A_R_695), as well as for the marker PSY_B_M_640. No data were obtained for GSP_A_R_366 and Hd1_A_K_1545 (Tables 1 and 3). Thus, 40 of 47 markers gave less than 20% missing data for all plates and, henceforth, these are referred to as selected markers.
Table 3. Percentage of missing data. Validation markers are indicated in bold and discarded markers are indicated in italic
For all markers, the average number of missing data points (calculated as the sum of the number of missing data for each marker/total number of markers, i.e. 47) was higher in CC1 (67.7) than in CC2 (48.1). The correlation between the number of missing data points for each marker in these two plates was high (0.93). CC1 contained more DNAs that were unable to give analysable signals than did CC2. Twenty-one lines were detected which gave no signal regardless of the marker in CC1. The technical repeat gave similar results. This suggests that the DNA in CC1 is of low quality, which may be critical for a few markers (AAP_A_R_99, GSP_A_Y_716, PSY_A_R_830 and SPA2_A_Y_1292) (Table 3). The correlations between CC1 or CC2 and the Di-Tetra plate were lower (0.85 and 0.79, respectively). This lower correlation was probably a result of the fact that several wheat species were included in the Di-Tetra plate.
For the 40 selected markers, the success rates were 92.1%, 91.1% and 98.1% for plates containing diploids and tetraploids, CC1 and CC2, respectively (Table 4).
Table 4. Success rate of the assay, excluding the seven discarded markers: number and percentage of called lines for each plate
No discordance was detected between independent technical repeats, i.e. the same genotype was observed in each analysis.
Three independent methods (M1, M2, M3) were used to validate the SNPlex™ data for the CC1 plate (see Experimental procedures). M1 corresponded to the Sanger sequencing of all markers in 42 lines, M2 to the Sanger sequencing of all lines for four validation markers, and M3 to the MassARRAY® technology on all lines for 11 validation markers.
For 42 lines, the data obtained by SNPlex™ genotyping were compared with the data obtained by sequencing (M1 in Table 5). The marker AAP_A_R_99, which showed a lot of missing data in the CC1 plate that included the 42 references lines, was discarded. We observed no discordance for 34 of the 39 selected markers. For two markers (GDH_A1_Y_67 and GDH_B9_S_133), we detected only one discordance. The markers LDD_B_M_924 and Sal1_A_S_149 gave four and five (about 10%) discordant values, respectively. The worst results were obtained for Sal1_B_M_509 (M2 in Table 5). It appeared to be monomorphic for allele C, whereas the frequency of the minor allele (A) was about 35% in sequenced lines.
Table 5. Percentage of discordant values observed for validation markers using three methods of validation in CC1. First, alleles obtained by SNPlexTM and Sanger sequencing of 42 reference lines were compared (method M1). The two other methods involved the comparison of SNPlexTM data and sequences (M2) or data generated by MassARRAY® technology (M3) for all of the lines in CC1
Percentage of discordant SNP values
–, method was not used; SNP, single nucleotide polymorphism.
For the 11 validation markers, the concordance of the genotypes given by SNPlex™ technology with those obtained by Sanger sequencing and MassARRAY® technology approached 96%. The genotype calls given by SNPlex™ and MassARRAY® technology were in perfect agreement for three markers (AAP_D_Y_450, GDH_B5_259_W and Sal1_B_S_416), and we observed greater than 5% discordance for only two markers (Sal1_A_S_149 and Sal1_B_M_509 with 6% and 30%, respectively) (M3 in Table 5). For the latter marker, we observed many discordant values, as described above for the M1 sequencing results. Sanger sequencing and MassARRAY® technology showed that 114 lines contained the A allele; all were classified as the C allele by SNPlex™.
The number of discordant genotypes observed for the M1 and M3 validation methods based on 11 markers was highly correlated (0.98). The correlation between the number of discordant genotypes observed for the M1 and M2 validation methods was based on only four markers and was 0.99. These high correlations suggest that the results obtained with a set of reference lines are representative for the whole sample, and thus can be used for validation.
We also mapped the six markers showing a polymorphism between Renan and Récital cultivars (SPA_B_S_142, LDD_B_ M_924, GDH_A1_R_891, GDH_A1_Y_970, GDH_B5_W_259 and Sal1_B_S_416). The results shown in Figure 1 are in accordance with previous mapping results for three genes, SPA (storage protein activator), LD (lumini dependens) and GDH (glutamate dehydrogenase) (Guillaumie etal., 2004; Fontaine etal., in press), and present the first mapping of the supernumerary aleurone layer 1 gene (Sal1).
Seventy-five pre-selected SNPs were submitted and all but two were included in two sets of markers. This good result can probably be attributed to the manual pre-selection step (see Experimental procedures), which is required to avoid SNPs which cannot be developed by this technology.
The software GeneMapper v3.7 allows the automatic calling of alleles and is well adapted to diploids. For each SNP, the software creates a genotype plot (two neighbouring peaks), which shows the intensity of each allele, and a Cartesian plot using a clustering algorithm to assign genotypes (Figure 2a) (De La Vega etal., 2005; Tobler etal., 2005). However, the polyploidy of the wheat genome complicates the interpretation of these plots. The allele calling of each SNP is hindered by the presence of the homoeologous genomes. Consequently, we observed a modification of the location of clusters relative to the diploid plot and a mis-assignment of genotypes, as the software considers one of the two homozygous genotypes as heterozygous. Thus, additional manual reading and corrections are required (Figure 2b). This manual genotype reading is based on the semi-quantitative nature of the SNPlex™ method. Figure 3 gives an example of a tetraploid and hexaploid genetic context for an SNP on one homoeologue: assuming that all genomes are amplified, for a homozygous sample, we expect a ratio of 1 : 1 and 1 : 2 for the size of both allelic peaks for tetraploids and hexaploids, respectively. Although manual reading is feasible, it is time consuming, decreases the throughput and is likely to generate errors.
As the taxa under study are produced by self-fertilization, we expected very few heterozygous genotypes (Enjalbert and David, 2000; Charlesworth and Wright, 2001). However, we can expect that heterozygosity will complicate the interpretation of the Cartesian plot by producing a third cluster between the two homozygous ones. For the future use of this genotyping method in plant breeding, this difficulty will need to be solved. The addition of heterozygous control samples (natural or artificial) in the assay will facilitate the separation of the three expected clusters. Moreover, for an efficient use of SNPlex™ technology with polyploid genomes, an adaptation of the GeneMapper® clustering algorithm needs to be developed.
We obtained reliable genotypes for 35–39 SNPs of the 47 simultaneously genotyped by SNPlex™ (as previously described in Results, four markers gave reliable results on only one of the three plates). Two markers failed to give genotypes for all the lines under study. This is in agreement with the results reported by Pindo etal. (2008) and our experience in other species, as we often obtain up to four failed markers per set of 48 markers. Some markers failed because they gave too much missing data, making them unusable for genetic studies. We observed more than 50% missing data for five markers. Four are located in a hypothetical gene and one in a gene coding for phytoene synthase (Psy-2). The sequence variations in this hypothetical gene were probably too frequent, thus preventing hybridization of the probes. We also observed a high level of polymorphism in the B homoeologue of the Psy-2 gene. These results highlight the difficulty in developing SNPlex™ probes (i.e. allelic and locus-specific probes) in highly polymorphic genomic regions. It is expected that the same difficulty will be encountered in other genotyping technologies based on hybridization probes (VeraCode™, GoldenGate®, Infinium®).
Validation by Sanger sequencing and MassARRAY® genotyping technology showed that the number of lines that were well called was high. For most markers, we observed more than 95% of correctly called lines. This is less than the concordance rate given by the manufacturer (> 99.2%) (http://www.appliedbiosystems.com). However, this rate is given for human samples with automated allele calling. In our context (polyploidy and manual reading), the rate of 5% misclassified lines appears to be acceptable. This assay failed to call the correct allele for two markers (Sal1-A_S_149 and Sal1_B_M_509). The discordance observed for the first marker (11.9% and 6.6% for M1 and M3 validation methods, respectively) may be caused by the presence of the same polymorphism in its D homoeologous sequence. Indeed, Sanger sequencing and MassARRAY® genotyping technology use a specific homoeologous amplification, whereas SNPlex™ technology amplifies the three genomes. This underlines the importance of knowing the sequence of each homoeologue and of pre-selecting SNPs for SNPlex™ technology. For the second marker, the result cannot be explained by the two other homoeologous sequences which have no polymorphism at this locus. Despite a good representation of the A allele, we detected only the C allele, as if the A allele could not be recognized by the probe. This may suggest a problem of synthesis of the probe. It also could be a result of a dilution of the A allele, if we suppose gene duplication of one homoeologue (the homozygous C): the ratio between the allelic peaks would then be 1 : 3 and the clusters would be more difficult to distinguish.
As an example of how the data can be used, we undertook the genetic mapping of six markers. The marker SPA_B_S_142 was mapped at 1.2 cM from the previous mapping position found by Guillaumie etal. (2004) (SPA). The genetic locations of the two markers on the A homoeologue of GDH (GDH_A1_R_891, GDH_A1_Y_970) are 124.90 cM and 125.60 cM, respectively. These different values are included in the confidence interval. We mapped the B homoeologue of GDH at 37.3 cM from gwm271, which is in agreement with the results reported by Fontaine etal. (in press). The B copy of Sal1 has not yet been mapped in wheat. We mapped this gene in the telomeric region of the 7BL chromosome. In barley, Sal1 was mapped in a syntenic position (Jestin etal., 2008). Therefore, these mapping results confirmed that the genotyping data used are reliable (Figure 1).
This study demonstrates the possibility of using existing high-throughput multiplex technologies, such as SNPlex™, for the analysis of polyploid genomes. For the first time, we have generated a set of 39 SNPs that can be used routinely and could serve as the basis for the development of a set of 48 markers.
Nevertheless, these results have highlighted some important points to be considered. It is evident that studying individuals from related species for which the polymorphism is largely or completely unknown should be avoided. Indeed, the larger number of lines giving missing data in the Di-Tetra plate, as well as the lower correlation between the number of missing data in CC1 or CC2 and Di-Tetra plates, suggests that the allelic and locus probes, which are the specific components, probably do not hybridize to their genomic target sequence in all species studied. Ideally, it is necessary to obtain sequences for several genotypes in order to validate markers that function in the assay. Moreover, the importance of having reference controls of known genotypes must be emphasized.
We also detected other problems, which are more specifically related to the polyploid structure of the wheat genome. A previous knowledge of the sequences of the three homoeologues allows the construction of a set of SNPlex™ markers that is more likely to succeed. Unfortunately, this situation is rare in wheat because of the difficulties presented by this species for Sanger sequencing. Thus, efforts are needed to develop SNP markers for this species. The major handicap for the routine use of this technique is the obligation to perform a manual verification of the results, and the lack of software that takes into account the polyploidy of genomes. These factors increase both the time required for the analysis and the risk of errors. We expect that the same limitations will hold for other SNP genotyping platforms, which, like SNPlex™, were developed for the analysis of the human genome.
In conclusion, we have shown the efficiency of the high-throughput genotyping technique SNPlex™ for species that are tetraploid or hexaploid, such as wheat. Our results suggest that other multiplex genotyping techniques (VeraCode™, GoldenGate® and even Infinium®) may also give good results for polyploid genomes. The routine use of such techniques awaits the development of software that is adapted to polyploid genomes. For the first time, a core of 39 SNPs is available for genotyping the complex of wheat species. This core can serve as the basis for the development of a set of 48 markers.
Plant material consisted of 1314 lines of wheat with three different ploidy levels (2×, 4× and 6×).
gDNA from these lines was distributed in four 384-well plates. In each plate, we included four positive controls of known genotype and two negative controls (water). Table 2 describes the samples of Triticum taxa used. For each taxon, we used a core set of individuals representing the highest allelic diversity. These individuals were chosen on the basis of simple sequence repeat (SSR) polymorphism data using mstrat (Gouesnard etal., 2001). One plate, called Di-Tetra, contained 374 gDNAs from diploid progenitors of polyploid wheat species (n = 50) and tetraploid species (n = 324). Among the latter, tetraploid wheats (T. turgidum) were represented by the wild T. turgidum spp. dicoccoïdes (n = 62), the primary domesticated T. turgidum spp. dicoccum (n = 97), the cultivated durum wheat T. turgidum spp. durum (n = 71) and additional species and subspecies (T. turgidum ssp. carthlicum and ssp. polonicum, and T. timopheevii ssp. timopheevii). In this plate, four positive controls were included: one line per diploid progenitor (T. urartu, Aegilops speltoïdes and T. tauschii) and the cultivar Langdon (durum wheat).
Two plates contained 2 × 372 accessions of the hexaploid T. aestivum. This sample of 744 accessions comprised the 372 lines of the core collection defined by Balfourier etal. (2007), called CC1, and 372 supplementary accessions (CC2) chosen by these authors to balance the size of the geographical groups which structured the core collection. In these two plates, two of these 744 accessions (Récital and Redman cultivars) were each spotted twice and considered as positive controls. Among the 372 lines in the CC1 plate, 42 were used as reference lines for validation: data from other genotyping methods are available.
Seeds were obtained from the Biological Resource Centre for Cereal Crops (INRA, Clermont-Ferrand, France) and from Pierre Roumet (INRA, Montpellier, France; tetraploid wheats and Aegilops species). All seeds were obtained from a single self-pollinated head. Fresh leaves of five to six plants per accession were pooled, and bulk gDNA was extracted using a cetyltrimethylammonium bromide (CTAB) protocol, as described previously (Tixier etal., 1998).
A fourth plate, RILs-ReR, consisted of 194 F7 RILs of the mapping population developed from the cross between Renan and Récital plus the two parental lines (Groos etal., 2002).
The SNPlex™ genotyping system uses an Applied Biosystems oligonucleotide ligation assay (OLA) combined with multiplex polymerase chain reaction (PCR) to achieve allelic discrimination and target amplification. The specificity of the ligation probes (allelic and locus-specific probes) for the target is critical.
All SNPs used came from French genomic projects which focused on SNP discovery. As we expected that polyploidy would engender difficulties for reading allele values obtained by SNPlex™, to the extent possible we used SNPs present in gene fragments for which we had sequenced the three homoeologues. We sent 75 SNPs to Applied Biosystems Assays-by-Design ServiceSM (http://www.appliedbiosystems.com) in order to form a set of 48 SNPs (Table 1). We initially examined and pre-selected all these SNPs, following the SNPlex™ design rules, with the advice of Applied Biosystems. Generally, polymorphisms that were too close to one another were discarded, as were long insertions–deletions. In rare cases, we detected the same polymorphism in homoeologous genomes. Such SNPs were usually discarded.
The SNPlex™ assay was carried out in 2 days using the manufacturer's instructions (http://www.appliedbiosystems.com), taking care to perform pre-PCR and post-PCR steps in different locations. To facilitate the reading of genotyping data, it is necessary to have a uniform DNA concentration among samples on the same plate. All gDNAs were initially quantified by the Quant-iT™ PicoGreen® dsDNA Assay Kit (Invitrogen™, Carlsbad, CA, USA) in order to obtain an overview of the homogeneity of the DNA concentration. One modification has been made in the manufacturer's protocol: after testing different concentrations of gDNA, we performed the SNPlex™ reaction with a concentration of gDNA around 50 ng/µL, instead of 18.5 ng/µL, as is used for human gDNA. The need for a higher concentration of DNA could be explained by the different size of the wheat (16.5 Gb for T. aestivum) and human (3.5 Gb) genomes. Thus, in a given quantity of DNA, the target locus is about fourfold less represented in wheat than in humans.
Samples were run on a 3730xl DNA Analyser (Applied Biosystems). GeneMapper® software version 3.7 (Applied Biosystems) was used to analyse the raw data and corrected by a manual reading. For each SNP, this software analyses the raw capillary electrophoresis data, creates different plots and provides automated allele calling.
To validate SNPlex™ data in T. aestivum, one independent technical repetition was carried out. For all SNPs, sequence data from a set of 42 reference lines from the CC1 plate were available. Genotypes from these sequences were compared with SNPlex™ data generated in both assays (validation method 1 – M1). Furthermore, for all CC1 samples, 11 of the 47 SNPlex™ markers (23%) were characterized previously either by Sanger sequencing (validation method 2 – M2) or by MassARRAY® technology (Sequenom Inc., San Diego, CA, USA), a simplex genotyping method based on matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometry (validation method 3 – M3) (Balfourier etal., 2006) (Table 5). These 11 markers are called validation markers. For three of these, localized in the supernumerary aleurone layer 1 gene, Sal1 (Shen etal., 2003), both sets of genotyping data were available for the B homoeologous fragment.
Linkage analyses of six polymorphic markers in Renan and Récital cultivars were performed using Mapmaker/exp 3.06 (Lander etal., 1987). The Kosambi mapping function (Kosambi, 1944) was applied to transform recombination frequencies into additive distances in centiMorgans (cM).
We are grateful to Heather McKhann for critical reading of the manuscript. We thank Steven McGinn from Ivo Gut's group at the CEA-IG/CNG (Evry, France), directed by Mark Lathrop, for providing advice on the SNPlex™ technology.