By continuing to browse this site you agree to us using cookies as described in About Cookies
Notice: Wiley Online Library will be unavailable on Saturday 7th Oct from 03.00 EDT / 08:00 BST / 12:30 IST / 15.00 SGT to 08.00 EDT / 13.00 BST / 17:30 IST / 20.00 SGT and Sunday 8th Oct from 03.00 EDT / 08:00 BST / 12:30 IST / 15.00 SGT to 06.00 EDT / 11.00 BST / 15:30 IST / 18.00 SGT for essential maintenance. Apologies for the inconvenience.
Discovering single nucleotide polymorphisms (SNPs) in specific genes in a heterozygous polyploid plant species, such as sugarcane, is challenging because of the presence of a large number of homologues. To discover SNPs for mapping genes of interest, 454 sequencing of 307 polymerase chain reaction (PCR) amplicons (> 59 kb of sequence) was undertaken. One region of a four-gasket sequencing run, on a 454 Genome Sequencer FLX, was used for pooled PCR products amplified from each parent of a quantitative trait locus (QTL) mapping population (IJ76-514 × Q165). The sequencing yielded 96 755 (IJ76-514) and 86 241 (Q165) sequences with perfect matches to a PCR primer used in amplification, with an average sequence depth of approximately 300 and an average read length of 220 bases. Further analysis was carried out on amplicons whose sequences clustered into a single contig using an identity of 80% with the program cap3. In the more polymorphic sugarcane parent (Q165), 94% of amplicons (227/242) had evidence of a reliable SNP – an average of one every 35 bases. Significantly fewer SNPs were found in the pure Saccharum officinarum parent – with one SNP every 58 bases and SNPs in 86% (213/247) of amplicons. Using automatic SNP detection, 1632 SNPs were detected in Q165 sequences and 1013 in IJ76-514. From 225 candidate SNP sites tested, 209 (93%) were validated as polymorphic using the Sequenom MassARRAY system. Amplicon re-sequencing using the 454 system enables cost-effective SNP discovery that can be targeted to genes of interest and is able to perform in the highly challenging area of polyploid genomes.
Sugarcane is a heterozygous, highly polyploid crop species, and modern commercial clones are generally the outcome of interspecific hybridization of Saccharum officinarum with Saccharum spontaneum, followed by backcrossing of these hybrids to S. officinarum (a process known as ‘nobilization’). An added complication is that double transmission of chromosomes (2n) from the backcrossed S. officinarum female parent is a frequent occurrence (Bhat and Gill, 1985; D’Hont etal., 1996). This increases the complement of chromosomes in the progeny of these crosses to well over 100 chromosomes. To create linkage maps of these large genomes, a number of different DNA marker types have been utilized in sugarcane, including restriction fragment length polymorphisms (RFLPs) (da Silva etal., 1993), simple sequence repeats (SSRs) (Rossi etal., 2003), amplified fragment length polymorphisms (AFLPs) (Hoarau etal., 2001) and random amplification of polymorphic DNAs (RAPDs) (Al-Janabi etal., 1993).
The use of publicly available ESTs as the basis for direct SNP discovery runs the risk of developing assays for sites that are not polymorphic in the population of interest. The re-sequencing strategy can be employed to target a population of interest, but the use of conventional sequencing is complicated in a highly polyploid and heterozygous species such as sugarcane. This is because the PCR template for conventional re-sequencing requires cloning to achieve the necessary template purity for dideoxy sequencing and, in addition, relatively large numbers of clones must be sequenced for each amplified product to achieve the sequencing depth to enable a confident determination of haplotype or SNP frequency (McIntyre etal., 2006). The cost of cloning and conventional sequencing of more than a modest number of products is prohibitive for most budgets. In addition, haplotype assignment can be confusing with this system as a result of bacterial host mismatch repair of cloned PCR heteroduplexes – which can produce apparent ‘recombinant’ haplotypes (Longeri etal., 2002).
We attempted to utilize SNPs discovered from ESTs in the public domain for the development of markers, but this proved to be inefficient, with only a small proportion of these SNPs segregating in our cross and less than 50% of the selected EST contigs harbouring any trustworthy (each allele represented by two or more sequences) SNP sites (P. C. Bundock et al., unpubl. data). In this paper, we describe the use of a new-generation DNA sequencing technology, 454 sequencing (Margulies etal., 2005), for the discovery of SNPs for marker development in a large number of sugarcane genes. The use of 454 technology has enabled the re-sequencing of pooled PCR products with pure template derived by amplification from a single strand during the emulsion PCR (emPCR) step of the 454 protocol. Heteroduplexes at the completion of PCR are not problematic, and the cost per sequence is sufficiently low to be able to obtain the required sequence depth for the identification of low-dose alleles and the estimation of SNP frequency. To our knowledge, this is the first time the re-sequencing of amplicons using a next-generation sequencing system has been reported for a plant species, and the first time this method has been used to find polymorphisms in a polyploid genome. This approach should be applicable to polymorphism detection in polyploid species generally.
Sequencing and SNP discovery
Sequences obtained by 454 sequencing for each parent, Q165 and IJ76-514, were binned according to matches to the 614 primers used for PCR amplification, with each amplicon potentially represented by a set of forward and reverse sequences. On average, more than 90 000 sequences with a perfect match to a PCR primer were obtained by 454 Life Sciences for each of the two parents (Table 1). These sequences were from a commissioned sequencing run of two regions of an FLX PicoTiterPlate™ with a four-region gasket. Some of the 307 PCR amplicons which were at low concentration before pooling were not found to have been sampled by the 454 sequencing.
Table 1. Summary of 454 sequences of pooled amplicons from the parents of the PJ2 sugarcane mapping population
No. seq. reads all amplicons
No. of amplicons represented
Average no. reads/amplicon
Average read length (bp)
Of the 298 amplicons represented in the Q165 sequences, 242 produced a single consensus sequence after clustering and alignment of resulting sequence reads using cap3 (Table 2). Sequences from these 242 amplicons were analysed for candidate SNPs using the software package PolyBayes, which is a program used to detect SNPs. An SNP was considered to be a candidate if it fulfilled the following criterion: it was present at a frequency of 4% or more with a sequence depth greater than 25 sequences, and a frequency of 5% or more with a sequence depth greater than 21 sequences. This ensured that SNPs were only considered to be candidates if the minor allele was present in at least two sequences and at a frequency of at least 4%. Using this criterion, 1632 candidate SNPs from the 454 sequences were determined from the Q165 genotype. The average frequency of candidate SNPs across the 242 amplicons for the Q165 genotype was one SNP every 35 bases. Transition mutations made up approximately 62% of the total.
Table 2. Summary of single nucleotide polymorphism (SNP) detection in 454 sequences of polymerase chain reaction (PCR) amplicons from the parents of the PJ2 sugarcane mapping population
Number of products with single consensus sequence
Length of amplicons (bases)
Products with one or more SNPs
Products SNPs absent
SNPs with minor allele frequency ≥ 4%
Mean SNPs per amplicon
Average sequence depth at SNP sites
Candidate single-dose SNPs (frequency 4–15%)
Likely multidose SNPs (frequency > 15%)
Of the 302 amplicons represented in the IJ76-514 sequences, 247 produced a single consensus sequence after cap3 alignment (Table 2). As for the Q165 parent, analysis with PolyBayes was carried out on amplicon sequences represented by a single consensus (247 amplicons). Candidate SNPs (defined as above) were considerably fewer in number in the S. officinarum clone, with 1013 compared with 1632 for Q165. The average number of SNPs per amplicon was significantly different between the two parents [analysis of variance (anova), P = 4 × 10−6]. However, the difference was much more striking when only those SNPs that were considered to be single-dose candidates were taken into account, with more than three times as many SNPs in this category for Q165 (788) than IJ76-514 (216). The average frequency of candidate SNPs across the 247 amplicons for this genotype was one SNP every 58 bases. The proportion of transition mutations was approximately 59% amongst the 1013 SNPs.
To reduce the possibility of bias when comparing the frequency of SNPs between Q165 and IJ76-514, a set of 214 shared amplicons was analysed and compared. In this set of 214 amplicons, there were 1364 SNPs in Q165 and 855 SNPs in IJ76-514, a ratio of 1.6 : 1, the same ratio as obtained when all amplicons (shared and unshared) were included in the analysis.
Previous results using this mapping population indicated that single-dose polymorphisms were much more common in Q165 than in IJ76-514 (Aitken etal., 2005). A frequency histogram for each parent allowed a comparison of the frequency of occurrence of SNP alleles with respect to their proportions in sequence reads (Figure 1). This clearly demonstrated that low-frequency SNP alleles identified in the 454 sequences occurred more often in Q165 (Figure 1a) than in IJ76-514 (Figure 1b).
In practice, the identification of SNPs and the sequence information used for the design of SNP assays require more information than that provided by the output from PolyBayes. In particular, we wished to avoid redundancy, in which a single-dose haplotype is represented by a number of SNPs (haplotype information is not provided by PolyBayes). We also wished to avoid SNPs that were single dose in one parent and multidose in the second parent. This information was difficult to obtain from the PolyBayes output, because it is difficult to align the SNP sites from the two parents as a result of the different start sites originating from the length of sequence runs. In addition, the consensus sequences from the two parents may be in opposite orientations depending on the number of sequences obtained in forward and reverse orientations in each case. The other factor is that, for the assay design phase to verify the SNPs using the Sequenom MassARRAY, consensus sequences with coded SNP sites are required for the design of assays.
To accommodate the information requirement for the SNP assay design process and the problems with combining forward and reverse sequence runs, we designed macros in the Microsoft spreadsheet program Excel to semi-automate the SNP detection and annotation process. The sequence alignments from cap3 were parsed to a format suitable for use in Excel using in-house-executable program (available from the authors). Macros and formulae in Excel were used to detect SNPs, to calculate minor allele frequencies and to output sequences with SNP sites encoded for Sequenom Assay Design software. The criteria used to define candidate single-dose SNPs were similar to those used for the parsed PolyBayes results – frequency of the minor allele between 5% and 25% with a sequence depth of more than 21 sequences. Only amplicons which produced a single contig from the cap3 alignment process were analysed.
The validation of a subset of SNPs discovered from the 454 sequences was carried out by designing SNP assays for the Sequenom MassARRAY [matrix-assisted laser desorption ionization-time of flight (MALDI-TOF) mass spectrometer] system. Two hundred and twenty-five candidate SNPs were evaluated across the parents (IJ76-514 and Q165) and 21 progeny of the PJ2 cross. For 209 SNPs, the polymorphism identified in the 454 sequences was validated by Sequenom assays. For the purposes of linkage mapping, single-dose markers (present on one homologue) that are inherited from one parent are the most useful, as they are expected to segregate 1 : 1 and all progeny scored are informative (Aitken etal., 2005). For 106 of the 209 polymorphic SNPs, the segregation pattern amongst the progeny fitted an expected 1 : 1 segregation ratio for a single-dose marker (P > 0.05 from the binomial distribution, where 17 or more progeny were scored). Thus, 93% of the evaluated SNPs discovered from the 454 sequences were validated by the Sequenom assays, with 50% of these SNPs fitting a 1 : 1 segregation pattern in the test panel, indicating the utility of the 454 sequencing approach for the detection of single-dose SNP markers in sugarcane.
The discovery of SNPs in heterozygous polyploid species, such as sugarcane, is more challenging than in diploid species because a given gene may be represented by a number of different alleles. For alleles present in a single dose (or low dose), direct re-sequencing of PCR products using conventional Sanger sequencing methods will fail to detect the corresponding SNP sites, and indels can create uninterpretable sequence reads. The sequencing of cloned products is necessary to detect these SNPs, a process which becomes extremely expensive using conventional dideoxy methods. However, the application of highly parallel sequencing from clonally amplified DNA to discover SNPs and indels is ideal for polyploids, because, even for a relatively large number of amplicons, it is possible to achieve a sequence depth which makes it highly probable that SNPs at low dosage (those present on one or a few homologues) can be discriminated from sequencing errors. Indeed, for the 300 or so amplicons re-sequenced in this study, it was possible to obtain a sequence depth that enabled reliable SNP frequency (relative dosage) estimation for most SNPs. These relative dosage estimates were then used to reduce the SNP pool for assay design to those SNPs that were most likely to be single dose (which are the most informative for linkage mapping). With the large number of reads per amplicon (average of 320 for IJ76-514 and 289 for Q165), the sequence depth necessary for estimating the relative dosage of SNPs can be provided for hundreds of amplicons using partial 454 sequencing runs.
A high proportion of the amplicons analysed had one or more SNPs present in the sequences (227/242, or 94%, for the Q165 parent and 213/247, or 86%, for the IJ76-514 parent). Given that most amplicons were only around 200 bp in length, this indicates that, for most loci, there is significant diversity between homologues in the sugarcane genomes studied here. We detected a larger number of SNPs overall and a significantly larger mean number of SNPs per amplicon in Q165, the pollen parent. A larger number of polymorphisms was also observed for this parent from Ecotilling (data not shown) and from the use of AFLP and SSR marker systems (Aitken etal., 2005). The greater polymorphism in Q165 can perhaps be explained by the fact that this parent clone has been derived from introgression of S. spontaneum material into the S. officinarum genome. The introgressed chromosomes are likely to carry sequence variation outside of the variation observed within S. officinarum. SNPs present in Q165 and inherited from S. spontaneum would also be expected to be at low dosage in the genome. This high frequency of low-dosage SNP alleles in Q165 is observed in the histogram of SNP frequency for the Q165 parent. The female parent, IJ76-514, also has a high proportion of low-dose alleles, but not to the extent observed in Q165. The histograms of the SNP data bare a striking resemblance to the marker frequency histograms derived from AFLP and SSR data for Q165 and IJ76-514 (Aitken etal., 2007), indicating that, as for the AFLP and SSR marker data, the relative proportion of multidose SNPs in the S. officinarum parent is larger. It will be of great interest to observe the patterns of SNP dosages in other sugarcane cultivars to determine whether any patterns can be discerned with regard to the distribution of SNP dosages, and how this might relate to recent breeding history.
Two earlier approaches for the development of SNP markers for this mapping population were tested before trying 454 sequencing (P. C. Bundock et al., unpubl. data). Firstly, assays were developed based on in silico SNPs from public domain ESTs; however, as less than 50% of the 300 candidate genes had an SNP site, this approach was not ideal. In addition, the small proportion of in silico SNPs that segregated as single-dose markers in the target mapping population meant that a lot of effort was wasted at the assay design and testing stages. The use of Ecotilling as a method to develop markers directly (Cordeiro etal., 2006b) was not attractive because of the cost per genotype of capillary electrophoresis. However, the 454 sequencing approach presented here has enabled the discovery of a large number of SNPs in the sequenced amplicons, which bodes well for the mapping of a high proportion of the targeted genes. With more than 50% of the 209 validated SNPs predicted to be single dose, a much higher success rate was achieved than for in silico- and Ecotilling-based approaches. The first round of mining and resultant SNP/indel assay development could enable the mapping of up to 75% of the 300 amplicons (these being sequences with one or more SNPs detected), a much higher proportion than possible with the other approaches.
Other new-generation sequencing technologies (which also incorporate clonal amplification strategies), such as the Illumina Genome Analyser and the AB SOLiD, have reduced per base sequencing costs further, and should enable the sequence depth necessary for even larger numbers of target products per sequence run. However, at present, the shorter read lengths of these technologies may not be as suitable as 454 sequences for SNP detection, especially for genes (amplicons) with uncharacterized introns or substantial polymorphism.
In a number of plant species, 454 sequencing has been used to sample the transcriptome in EST sequencing projects, e.g. maize (Emrich etal., 2007; Ohtsu etal., 2007), Arabidopsis (Weber etal., 2007) and Medicago (Cheung etal., 2006). SNP discovery using transcriptome-derived 454 sequences was carried out in maize (Barbazuk etal., 2007). Recently, the use of the 454 GS-FLX for the sequencing of pooled amplicons from human genomic DNA has been published (Bordoni etal., 2008). However, targeting of SNP discovery to genes of interest in plant species using 454 sequencing has not been reported to date. This approach should be of interest in studies similar to that reported here, where SNP markers are required for a number of candidate genes, and will be of particular interest for assessing diversity in candidate genes in association studies. It should also be applicable to population genetic or ecologically based projects. As this approach is successful for sugarcane, it will also be applicable to other polyploids, of which there is a considerable number just amongst plant crop species.
Plant material and genomic DNA
Two varieties, IJ76-514 and Q165, are the parents (female and male, respectively) of the PJ2 population, which has been used extensively for linkage mapping (Aitken etal., 2005, 2007). The female parent is S. officinarum, whereas Q165 is a commercial clone derived from crossing interspecific hybrids of S. officinarum and S. spontaneum. Genomic DNA was extracted from leaf material of the parents and progeny of the PJ2 population using the method of Hoisington (1992). Genomic DNA from the parent clones was used as template for the generation of PCR products for 454 sequencing. PCR product derived from a parent was pooled with the two parent pools kept separate for 454 pyrosequencing. Genomic DNA from a subset of the progeny of the cross (21 individuals) was used as a test panel to detect SNP segregation.
Candidate genes were mined from gene lists generated previously (Casu etal., 2003, 2004, 2007). In all, 311 sequences that are represented as tentative consensus sequences (or singletons) in the Institute for Genomic Research (TIGR) gene indices for sugarcane (http://compbio.dfci.harvard.edu/tgi/) were targeted. Primers for PCR amplification were designed using the software application MassARRAY® Assay Design 3.1 (Sequenom®, San Diego, CA, USA). False SNP sites were inserted into each sequence and ‘capture’ PCR primers were designed for PCR amplicons with a minimum length of 200 bp, an optimum length of 250 bp and a maximum length of 460 bp. A 20-µL PCR amplification was carried out for each combination of primers and template, with reaction components at the following concentrations: 1 U (0.2 µL) of Platinum Taq (Invitrogen, Carlsbad, CA, USA), one-tenth volume of 10 × Platinum Taq reaction buffer, 1.5 mm MgCl2, 80 µm of each deoxynucleoside triphosphate (dNTP), 0.2 µm of each PCR primer and 25 ng of sugarcane genomic DNA as template. Thermocycling was carried out according to the following temperatures and times: 94 °C for 15 min; 94 °C for 20 s, 56 °C for 30 s, 72 °C for 1 min for 45 cycles; followed by 72 °C for 3 min and 4 °C hold. Test PCRs were carried out for each primer pair to determine the success of amplification and the length of the final product. Owing to the maximum amplicon length of 500 bp for emPCR during 454 sequencing, the maximum length for PCR products was 460 bp (to allow for the ligation of 20 bp 454 emPCR/sequencing adapters at each end). Some products which exceeded this length because of the presence of unknown intronic sequences were excluded. Successful amplification was determined by agarose gel electrophoresis of the amplified products, with poor amplifications subsequently excluded. A subset of 307 PCR products was pooled for 454 sequencing. A list of the primers used for amplification, alongside the gene index identifier from TIGR, is shown in Table S1 (see Supporting information).
Amplifications were carried out in 96-well microtitre plates, and double-stranded products were quantified using PicoGreen on a fluorometer. PCR products were pooled to obtain an approximate equimolar concentration of each product based on the length and observed PicoGreen estimates of concentration. Pooled products were then purified using magnetic beads from an AMPure® PCR purification kit (Agencourt® Bioscience Corporation, Beverly, MA, USA). A sample of the purified product was electrophoresed on a Bioanalyser (Agilent, Santa Clara, CA, USA) DNA series II LabChip to determine the purity, and quantified using a Nanodrop spectrophotometer (NanoDrop Products, Wilmington, DE, USA). The pooled, cleaned and dried samples were sent to 454 Life Sciences (Branford, CT, USA) for 454 pyrosequencing, where sequencing/emPCR adapter primers were ligated to each end of the products to prepare them for emPCR and the following steps of 454 sequencing (Margulies etal., 2005). Products from each parent were sequenced on the Genome Sequencer FLX (Roche Diagnostics, Indianapolis, IN, USA) with a single region of a four-region gasket used for each of the two parents.
Fasta-formatted sequences and quality scores were returned from 454 Life Sciences for analysis, together with binned sequences based on matches to PCR primers in either the forward or reverse orientation. For an analysis which included the forward and reverse sequences for each product together, the forward and reverse sequence files were concatenated using the ‘cat’ command in Linux. To retrieve quality scores and sequences in the same order for analysis, the binned sequence files were converted to lists on a Linux system using the ‘sed’ command, and the lists were used to extract corresponding sequence and quality score files using the ‘gawk’ command.
The forward and reverse sequences for each product were used as input for the alignment program cap3 (Huang and Madan, 1999). The options employed for cap3 were ‘–p 80 –r 1 –o 40’; otherwise, default values were used. cap3 contigs were employed as input for the program cross_match (P. Green, University of Washington; http://www.phrap.org/), together with the sequence files and the following options, ‘–discrep_lists –tags –masklevel 5’, as recommended for the use of the output for PolyBayes 3.0 (Marth etal., 1999). The pair-wise alignment output from cross_match was used as input for the multiple alignment and SNP identification of PolyBayes. The standard PolyBayes output was redirected to text files, and alignment files were saved for reference.
The PolyBayes standard output was parsed using an in-house-executable program (available from the authors). This parses the SNP summary information into comma-separated variables (csv) format for input into Microsoft Excel. Owing to the large number of sequence errors that are caused by variable base calling of homopolymeric regions, indels were excluded from the analysis with PolyBayes (Barbazuk etal., 2007). To identify genuine SNPs as opposed to sequencing errors, all SNPs that were based on a single sequence were excluded and, more broadly, any SNPs in less than 4% of sequence reads.
In a process to enable SNP and indel information to be output more directly for use by the Sequenom assay design software and to enable manual screening, cap3 alignments obtained as above, but from only a single orientation (forward sequences separate from reverse), were used. The padded sequences from cap3 ace format files were parsed to enable input into Microsoft Excel with a base for each cell. Macros and formulas were developed in Microsoft Excel to identify SNPs and indels, to calculate minor allele frequencies and to write consensus sequences with SNP/indels encoded as required for input into Sequenom Assay Design software. The resulting summary for each parent for an amplicon could then be compared to determine SNPs unique to each parent and to ensure that selected SNPs had an allele that was infrequent in one parent and absent in the other. SNPs with one allele at low frequency in the 454 sequences were preferentially targeted for assay development where present. As for PolyBayes analysis, SNPs or indels that were present in only a single sequence were excluded, as were indels based on homopolymers. Polymorphisms that were present in fewer than 4% of sequences were also excluded.
Assays for SNP genotyping were designed to a subset of SNPs detected from 454 sequencing. SNP assays were designed for genotyping on the Sequenom® MassARRAY® system (MALDI-TOF mass spectrometer) using Assay Design 3.1 software from Sequenom for the design of capture (PCR) primers which flank the SNP site and extension primers which bind immediately adjacent to the SNP site. PCR reactions (5 µL volume) were carried out in 384 well plates as per the protocols in the Sequenom iPLEX® Application Guide (Version 1, Sequenom, San Diego, CA) using Platinum Taq DNA polymerase (Invitrogen). Extension primers were extended by enzymatic addition of a terminator nucleotide dependent on the template sequence using the iPLEX® Gold Reagent Kit (Sequenom) and the analytes detected by the MALDI-TOF mass spectrometer for SNP genotyping. Those SNPs which were deemed to be potential single dose and inherited from one parent, based on sequence frequencies (or the best candidate derived from an EST cluster), were assayed on the DNA of the PJ2 parents (IJ76-514 and Q165) and 21 progeny using the Sequenom system.
This work is part of a project funded by the Co-operative Research Centre for Sugar Industry Innovation through Biotechnology (CRC-SIIB), University of Queensland, St Lucia, Qld 4072, Australia.