This is the first report of incorporation of massively parallel sequencing and long-range PCR to discover new SNP variants in a specific target group of genes (starch-related genes) in rice as model plant.
High-throughput sequencing of pooled DNA was applied to polymorphism discovery in candidate genes involved in starch synthesis. This approach employed semi- to long-range PCR (LR-PCR) followed by next-generation sequencing technology. A total of 17 rice starch synthesis genes encoding seven classes of enzymes, including ADP-glucose pyrophosphorylase (AGPase), granule starch synthase (GBSS), soluble starch synthase (SS), starch branching enzyme (BE), starch debranching enzyme (DBE) and starch phosphorylase (SPHOL) and phosphate translocator (GPT1) from 233 genotypes were PCR amplified using semi- to long-range PCR. The amplification products were equimolarly pooled and sequenced using massively parallel sequencing technology (MPS). By detecting single nucleotide polymorphism (SNP)/Indels in both coding and noncoding areas of the genes, we identified genetic differences and characterized the SNP/Indel variation and distribution patterns among individual starch candidate genes. Approximately, 60.9 million reads were generated, of which 54.8 million (90%) mapped to the reference sequences. The average coverage rate ranged from 12 708 to 38 300 times for SSIIa and SSIIIb, respectively. SNPs and single/multiple-base Indels were analysed in a total assembled length of 116 403 bp. In total, 501 SNPs and 113 Indels were detected across the 17 starch-related loci. The ratio of synonymous to nonsynonymous SNPs (Ka/Ks) test indicated GBSSI and isoamylase 1 (ISA1) as the least diversified (most purified) and conservative genes as the studied populations have been through cycles of selection. This report demonstrates a useful strategy for screening germplasm by MPS to discover variants in a specific target group of genes.
The capacity to perform massively parallel sequencing to simultaneously assay millions of single nucleotide polymorphisms (SNPs) has made genome-wide studies possible (Schuster, 2008). The use of next-generation sequencing (Thomas et al., 2006) platforms for population-based sequencing of targeted genomic regions enables the discovery of new variants and their frequencies across the selected important genes (Harismendy and Frazer, 2009). These new technologies have the power to detect new mutations or allow identification of errors in previously published reference sequences (Bentley, 2006) or SNP databases (Velicer et al., 2006). The massively parallel sequencing technology (genome analyser) is a groundbreaking, flexible and high-throughput platform for genetic analysis and functional genomics which is based on ultra deep sequencing of short read lengths and a huge number of sequencing reactions (Imelfort et al., 2009). This platform utilizes a sequencing-by-synthesis approach in which all four nucleotides are added simultaneously followed by an optic imaging procedure which occurs at each base incorporation step (Mardis, 2008b) and has widely been used by researchers to discover SNPs associated with human genetic diseases and particularly cancer studies (Bentley, 2006; Mardis, 2008a). This platform can be utilized in different ways, from whole genome sequencing (WGS) of plants and animals to specific genomic regions or even functional encoding genes or loci (Bentley et al., 2008; Hillier et al., 2008; Kim et al., 2009).
Massively parallel sequencing (MPS) is an attractive cost-efficient technology that enables characterization of genetic traits on an unprecedented scale, in terms of the number of genes, number of samples and allele frequency, which is necessary if variants and rare alleles in selected genes are to be found (Kaiser, 2008; Pettersson et al., 2009).
Recently, targeted MPS has been effectively integrated with long-range PCR (LR-PCR) of pooled DNA samples to minimize the cost of sequencing, amplification, oligonucleotides and labour (Out et al., 2009). This LR-PCR targeted MPS can be employed to deeply sequence regions surrounding candidate genes containing SNPs/Indels (Varley and Mitra, 2008). Utilizing this approach, the full extent of allelic variation in a vast number of encoding genes involved in various aspects of physiology, disease, etc. can be recovered and large linkage disequilibrium regions (approximately 5–11 kb) identified (Bodmer and Bonilla, 2008).
One of the major advantages of this approach is the capacity of MPS and targeted gene amplification to provide a high sequence depth in all studied loci simultaneously, which can reliably report the existence, frequency and nature of SNP/Indels. For example, a total sequence yield of 1 Gb means a fragment of 10-kb will be read approximately 100 000 times (Out et al., 2009), which meets the requirements for discovery of rare alleles (Druley et al., 2009) (Thomas et al., 2006) (Ingman and Gyllensten, 2008). The flexibility of the platform is extended when multiple genomic regions of numerous individuals from wild or segregating populations are pooled. The application of next-generation sequencing (NGS) to plant and animal breeding research programmes can be very powerful as we move towards the full characterization of candidate genes associated with critical plant traits and their variation at molecular level.
Rice (Oryza sativa L.) starch, a complex carbohydrate, is one of the most important crop products for humankind (Fitzgerald, 2004). Starch is synthesized by the activity of several enzymes and has been subjected to extensive studies (Morell et al., 2003).
Each of the enzymes exists as a number of different isoforms and is usually classified into one of the specific group of genes, such as ADP-glucose pyrophosphorylase (AGPase and GPT1), starch synthase (SS), starch branching enzyme (BE), starch debranching enzyme (DBE) and starch phosphorylase (PHO) (James et al., 2003).
In this study, we equimolarly pooled DNA of 233 individuals from a breeding population and amplified 17 rice starch quality-related genes, encoding seven classes of starch enzymes, which are part of the starch biosynthesis pathway, by a LR-PCR or semi-LR-PCR (SLR-PCR) protocol. The pooled, targeted amplifications were subsequently sequenced using MPS sequencing technology (Illumina Inc., San Diego, CA). By detecting SNP/Indels contained in both coding and noncoding areas of the genes, we were able to identify genetic differences and characterize SNP/Indel variation and the distribution pattern, among individuals. We have demonstrated the potential to find genetic variation in a specific group of genes in rice germplasm by sequencing long amplicons from large pools of DNA.
Number of reads and average coverage
Sequencing of LR-PCR products of all 17 studied loci generated approximately 60.9 million reads, of which 54.8 million (90%) mapped to the reference sequences. Table 1 shows the summary statistics of the mapping report. The average coverage differed among loci and ranged from 12 708 to 38 300 times for SSIIa and SSIIIb, respectively (Figure 1). We presumed that this significant difference may be related to a set of different factors such as concentration of amplicons and PCR efficiency, number of nonspecific products and even contamination with external PCR products. For example, LR-PCR products of the SSIIa gene revealed a number of nonspecific bands on agarose gel which led to higher unmapped reads and lower coverage. The highest and lowest number of reads was counted for SSIIIa and SSIIa, with 5 920 785 and 876 986 reads, respectively.
Table 1. Summary statistics of mapping report, generated by Illumina genome analyse sequencing
Number of reads (count)
Average length of reads (bp)
*Reference sequence was taken from NCBI database.
60 985 472
3 927 761 087
Mapped to reference
54 813 065
3 549 512 020
6 172 407
378 249 067
42 720 746
Polymorphism discovery and SNP/Indel detection
In this study, we successfully sequenced all studied loci in starch quality genes of 233 breeding lines covering different chromosomes to great depth and confident coverage.
SNPs and single/multiple-base Indels were discovered in a total length of 116 403 bp assembled by Genome Analyser (GA). In total, 501 SNPs and 113 Indels were detected across the 17 starch-related loci (Table S1). The total number of polymorphisms was then compared to SNPs available at OryzaSNP MSU database (http://oryzasnp.plantbiology.msu.edu/) (Table 2). A total of 399 SNPs for the targeted loci have already been reported in this database for 20 rice cultivars. As expected, the total number of polymorphisms in this experiment was significantly higher than that reported in the rice database. On the other hand, the confidence score was significantly higher as a result of huge reads coverage. On average, the SNP rate was 4.31 SNPs/kb and 0.97 Indels/kb. Previous data have reported an average rate of one SNPs every 170 bp and one Indel every 540 bp in the genome-wide approach (Goff et al., 2002; Yu et al., 2002). Our data indicate one SNP in every 232 bp and one Indel every 1030 bp within this set of germplasm and for these candidate genes. Although the average rate of SNPs is gene specific and related to species and structure of the studied population, our results are similar to previous reports (Nasu et al., 2002; Yu et al., 2002). Out of 501 identified SNPs, 75 or approximately 14.9% of SNPs caused an amino acid change, making them potentially functional. All Indels resided in the intronic regions were thus not responsible for any stop codons, frameshift mutations or amino acid changes. The Indel rate was a slightly higher than previously reported (Goff et al., 2002; Yu et al., 2002), which may have been attributable to lower stringency criteria of mapping the short reads in CLC Workbench (software package applied for assembly of short reads and SNP detection provided by Illumina), with Min Cost 2; Min Insert 2 and Similarity 0.7. The largest and smallest Indels were 8 and 1 bp nucleotides, respectively.
Table 2. Total polymorphism detected across the 17 starch quality-related genes
NC-number Genbank no.
Gene ID in NCBI
Gene ID in OryzaSNP@MSU database
Average coverage times
Length assembled by Illumina (bp)
Number of variants Detected by GA
Number of SNPs in OryzaSNP@MSU database
Number of functional (amino acid changes)
Perlegen + machine learning†
*The lower stringency used for in/dels as follows: Min Cost 2; Min Insert 2; Similarity 0.7.
†SNPs in OryzaSNP@MSU database detected and analysed using the Perlegen model-based method as well as a machine learning method. Totally, over 158 000 high quality SNPs have been identified in the rice genome by these two technologies.
SNP variation across the starch-related candidate loci
To evaluate the capacity of massively parallel sequencing to detect new variants in starch synthesizing enzyme/genes pools, we conducted a comprehensive experiment by Illumina GA platform on 17 different rice starch-related genes. Table 2 summarizes the information on newly discovered variation on studied genes. Seven classifications of starch-related enzymes which impact starch structure and quality such as ADP-glucose pyrophosphorylase (AGPase), granule bound starch synthase (GBSS), SS, BE, DBE, PHO and glucose phosphate translocator (GPT) were pool sequenced. The details of each gene member and their detected polymorphism are as follows:
These enzymes/genes reside at the top of the starch biosynthesis pathway and are classified as the starting point to grain starch production. Glucose is first activated by the addition of ADP by AGPase which then becomes the substrate for starch synthase enzymes. There are several gene/isozymes in this classification but AGPS2b has the highest expression level in rice endosperm (Hirose et al., 2006).
AGPS2b (small subunit) The role of this subunit in starch granule synthesis has been identified by way of its association with rice shrunken mutants (Kawagoe et al., 2005). A dramatic inhibition of starch synthesis has been observed in AGPase-deficient rice mutants and some other species and results in increased soluble sugars, a large number of underdeveloped granules, small grains and pleomorphic amyloplasts (Rolletschek et al., 2002).
In total, 30 SNPs and four Indels were found across the population for this gene. None of them caused an amino acid change, suggesting this gene has little impact on starch quality in this population. However, the number of SNPs found was significantly higher than those previously reported in rice databases (Table 2).
SPHOL (alpha 1,4 glucan starch phosphorylase) This gene is generally considered to be involved in starch degradation but recent studies suggest some important roles in starch biosynthesis. Although its precise mechanism and influence are still not well known, the mechanism appears to be associated with phosphorylation of some starch-related enzymes and proteins such as starch branching enzymes (SBEs) and starch synthase (SSIIa) (Tetlow et al., 2004). In total, five and seven nonfunctional SNPs and Indels were found in this gene, respectively. The SNP rate was lower than that reported in databases (Table 2 and 3).
Table 3. The polymorphism analysis of starch-related candidate genes in rice
Gene coordinates in genbank
Length assembled by Illumina (bp)
Total polymorphism rate
Nonsynonymous SNP rate
SNP per Kb
Indel per Kb
Ka/Ks ratio: The proportion of nonsynonymous (Ka) relative to synonymous (Ks) can reveal whether a gene has been under purifying, neutral or diversifying selection. The data for calculation of Ka/Ks (number of Ka and Ks) can be found in columns R and S of Table S1. Values in column R shows the SNPs exist in the coding region (Marked as CDS or mRNA) and S column shows the number on nsSNPs. The total polymorphism rate calculated as: TSI/TL × 100 where, TSI, total number of SNPs and Indels and TL is the total length of each candidate gene. The functional or nonsynonymous SNP rate calculated as: NS/TL × 1000 where, NS, number of nonsynonymous SNPs in each locus and TL is the total length of each candidate gene.
ADP-glucose pyrophosphorylase (small unit)
15 760 599–15 754 206
alpha 1,4-glucan phosphorylase
32 183 093–32 190 581
5 138 640–5 142 712
Granule-bound starch synthase I (Waxy gene)
1 764 623–1 769 657
GBSSII (expressed in leaves)
Granule-bound starch synthase II
13 584 483–13 576 435
Starch synthase I
3 078 060–3 085 809
Starch synthase IIa
6 747 562–6 751 981
SSIIb (expressed in leaves)
Starch synthase IIb
32 125 071–32 119 749
Starch synthase IIIa
5 351 108–5 362 370
Starch synthase IIIb
32 149 493–32 158 120
Starch synthase IVa
31 786 842–31 797 321
Branching enzyme I
31 775 431–31 782 688
Branching enzyme IIa
20 260 837–20 265 349
BEIIb (Amylose extender)
Branching enzyme IIb
20 213 965–20 224 864
Debranching enzyme-isoamylase 1
25 981 756–25 988 347
Debranching enzyme-isoamylase 2
23 596–25 998
4 399 980–4 410 318
Glucose-6-phosphate translocator (GPT1) Glucose-6-phosphate translocator was strongly expressed in endosperm. This gene is believed to be responsible for import of essential carbon substrates such as Glc6P into the plastids during the grain development (Fischer and Weber, 2002; Jiang et al., 2003). We found 16 SNPs, one of which causes an amino acid change and eight Indels. A ‘C/T’ SNP at reference position of 1188 changes an amino acid from Leu to Phe (Leu24Phe). This is a conservative nonpolar amino acid substitution (L→F) and therefore might not significantly alter protein activity. However, this is a new functional SNP which has not previously been reported in databases.
Granule bound starch synthase (GBSS) gene family
This family of genes is responsible for production of the amylose component of starch in plants.
Granule bound starch synthase I (GBSSI) GBSSI or the waxy gene is one of the most important genes involved in starch synthesis and influences cereal grain quality, particularly in rice. The major role of GBSSI on amylose content is well known, and several SNPs associated with starch quality have been characterized in rice (Chen et al., 2008). Previous studies have shown that three SNPs, one each at the intron/exon 1 boundary, exon 6 and 10, have the most significant impact on starch quality (Cai et al., 1998). Only one functional ‘A/C’ SNP was detected at position 1086 of the reference sequence and corresponds to the previously reported exon 6 SNP and causes a Tyr→Ser substitution at position 224 of amino acid. This substitution is nonconservative, changes the polarity of the amino acid and the function of the waxy protein, enzyme activity and amylose content. Larkin and Park (2003) have already reported this SNP to be effective on amylose content. One nonfunctional Indel was also found in this gene. The sole SNP detected in GBSSI in this population was compared with the eight SNPs retrieved from OryzaSNP database, indicating there has been significant selection pressure imposed on this locus in this population during the course of breeding. A multiplex SNP verification experiment was conducted to validate the data (Masouleh et al., 2009). The results showed that only this SNP, with very low frequency, exists in this population. The breeding selection criteria applied to this population somehow have restricted the polymorphism of the GBSSI gene in this population. Our Ka/Ks data also suggest that GBSSI is a gene under purifying selection in this population. The highest Ka/Ks ratio of 2.00 was calculated for this gene (Table 3).
Granule bound starch synthase II (GBSSII) This gene/enzyme is predominantly expressed in leaf, leaf sheaths, culm, and pericarp tissue at a low level, particularly during preheading and 1–3 days after flowering (Ohdan et al., 2005). The impact of GBSSII on elongation of amylose in nonstorage tissues of cereals has been confirmed (Vrinten and Nakamura, 2000). GBSSII is found exclusively bound to starch granules in green tissues and synthesizes amylose which is subsequently consumed by the plant or accumulated in the endosperm (Dian et al., 2003).
We found four SNPs and eight Indels, one of which occurred at coordinate 1638 of the reference sequence and altered a Leu to Serine at position 523 of GBSSII. This A/G SNP changes the polarity of the amino acid and hence may impact the activity and function of the protein. All Indels were detected in introns.
Starch synthase (SS) family
This family is primarily involved in the production of amylopectin component of starch in plants.
SSI This protein is presumed to be expressed in the endosperm and leaf of rice (Fujita et al., 2006). The transcript level of SSI is higher in endosperm than leaf sheaths and blades and has therefore been classified as an endosperm and nonendosperm expressing gene (Hirose et al., 2006). The measurement of SSI transcript levels at different seed developmental stages found high expression at 1–3 DAF, peaking at 5 DAF, and remaining almost constant during starch synthesis in endosperm, suggesting SSI is the major SS form in cereals (Cao et al., 1999).
A comprehensive analysis of mutant rice with a retrotransposon inserted into the SSI-encoding gene revealed SSI has a capacity for the synthesis of chains with DP8–12 with the extension of smaller chains (Nakamura, 2002). This gene has a very small phenotypic effect on rice eating quality although a significant negative correlation between the ratio of short chains (DP 6–12) and gelatinization temperature has been reported (Umemoto et al., 2008).
There were 73 SNPs and eight Indels detected in this gene. No functional SNP/Indels were found, in comparison with two amino acid changes that have been reported in the OryzaSNP database.
SSIIa SSIIa is known to have a major effect on starch quality. This gene is expressed in the endosperm at very high levels and presumably affects amylopectin structure of starch (Craig et al., 1998; Morell et al., 2003). The effect of this gene on cooking quality and starch texture has clearly been revealed (Umemoto et al., 2004, 2008). The gelatinization temperature (GT), alkali disintegration and eating quality of rice starch have been explained by polymorphism of two SNPs, [A/G] and [GC/TT], within the exon 8 of alk loci (Umemoto and Aoki, 2005; Waters et al., 2006). In total, 31 SNPs and one Indel were detected in this gene, which was significantly higher than those reported in OryzaSNP database (12 SNPs). Surprisingly, 22 out 31 SNPs were functional and introduced an amino acid change as determined by CLC Workbench. SNP distribution analysis revealed 80% of these low frequency SNPs (25) were located at the beginning of the reference sequence, starting from coordinates 13–553, bringing about 17 amino acid changes. This high SNP rate may be associated with inefficient PCR and consequent low coverage (45–129 times) of this GC-rich region (Table S1). Re-sequencing verified only four SNPs between coordinates 13–553, with a minimum frequency of 8%–10% for the minor allele. Taking the high false positive rate into account, a total of nine SNP and one Indel (six amino acid changes) were identified in this gene (Table S1). Three single nucleotide 3-allelic SNPs [G/T/A] and a G/T SNP at positions 72, 77, 81 and 87 of the reference sequences respectively are new polymorphisms. Of the three single nucleotide 3-allelic SNPs, one ‘G/T/A’ SNP is presumed to cause the most critical amino acid substitution of Arg26Met, Lys which induces a polar to nonpolar alteration in the protein.
SSIIb It is believed SSIIb is a low-level early expressed gene which is primarily expressed in sink and source leaf blades and sheaths (leaf specific) at an early stage of grain filling (Hirose and Terao, 2004). However, a recent study presented evidence that it contributes to six other starch genes to alter some Rapid Viscosity Analyser (RVA) parameters in glutinous rice (Yan et al., 2010). We found only three SNPs and two Indels, both of which were nonfunctional, indicating this gene does not affect phenotypic variation of this population.
SSIIIa The highest rate of polymorphism in terms of amino acid changes was observed in this gene. In total, 83 SNPs and 15 Indels including 23 nonsynonymous and nine synonymous substitutions were detected, indicating this is the most diverse gene in our population. Previous findings have detected 52 SNPs. Of 23 nonsynonymous substitutions, 10 are amino acid changes which alter polarity and may produce significant changes in the protein structure (Table S1). The SSIIIa-encoding gene is highly expressed in the endosperm, although some reports revealed its expression on green tissues (Dian et al., 2005). A recent study of amylopectin chain length in an SSIIIa-deficient mutant suggests SSIIIa plays an important role in the elongation of amylopectin B2 to B4 chains. Furthermore, in these mutants, the amylose content and the extra long chains of amylopectin increased by 1.3- and 12-fold, because of an increase in GBSSI activity (Fujita et al., 2007). Conversely, no functional effect of SSIIIa differentiation was observed on RVA parameters, at least between glutinous cultivars (Yan et al., 2010).
SSIIIb SSIIIb is mainly expressed in endosperm but transient expression in leaf sheaths and leaves has also been reported (Hirose et al., 2006). This might be attributed to the existence of two divergent groups of SIIIb in rice that are expressed in different tissues (Dian et al., 2005). It has also been classified into two different categories on the basis of timing of expression in the developing seed. In late expression category, the gene expressed in the mid-to-later stage of grain filling (Hirose and Terao, 2004), and in early expressing category the transcript level usually increases to peak, at 3–5 days after flowering (Ohdan et al., 2005). An association study of rice glutinous near-isogenic lines suggested SSIIIb has a significant impact on RVA parameters such as peak time and pasting temperature (Yan et al., 2010).
In total, 26 SNPs and 11 Indels were found in this gene. No functional Indels were detected in this gene and of the seven amino acid changes; three changed the polarity of the amino acids, Thr1176Ala, Glu634Gly and Ser756Ile.
SSIVa SSIVa is one of the least well-known starch genes in plants. Like most starch synthase genes, SSIVa is exclusively involved in amylopectin biosynthesis. Expression analysis with reverse transcription PCR has indicated SSIVa is preferentially expressed in rice endosperm and to a degree in leaf blades as a late or steady expresser gene during grain filling (Hirose and Terao, 2004). QTL mapping and expression profile analysis have shown that high temperature during the grain filling can considerably increase the transcription level of SSIVa up to 1.11-fold, which is considerably higher than the other starch synthase genes, with a general expression level range of 0.8–1.2 (Yamakawa et al., 2007), and may contribute to grain chalkiness (Yamakawa et al., 2008). SSIVa may also affect some secondary RVA parameters such as breakdown and setback (Yan et al., 2010). We found 27 SNPs, of which five were nonsynonymous and six intronic Indels. Only one SNP modified amino acid polarity, a ‘C/T’ SNP at coordinate 4048 of the gene nucleotide sequence induced a Gly708ASP substitution.
Starch branching enzymes (SBEs)
Starch branching enzymes determine the structure of amylopectin by breaking α-(1→4)-linkages in existing chains and attaching the released reducing ends to C6 hydroxyls, forming the elongated and branched glucan, amylopectin (Tetlow et al., 2004). The nucleotide polymorphisms of different isoforms of BEs were studied, and the results are as follows.
BEI BEI is mainly expressed in the endosperm. Biochemical observations with purified BEI from maize endosperm indicate that BEI preferentially branches amylose-type polyglucans and has a high capacity for branching less branched α-glucans (Takeda et al., 1993).
Analysis of the catalytic properties of BEI has indicated the N- and C-termini play a critical role in chain length transfer and substrate preference (Kuriki et al., 1997). BEI transcript levels increase rapidly 3–5 days after flowering. A rice BEI-deficient mutant induced by mutagenesis exhibited modified amylopectin structure and grain morphology but the same quantity of starch as the wild type (Satoh et al., 2003) and the BEI-encoding gene also effects the RVA profile (Yan et al., 2010). The maize sugary gene arises from a mutation in the maize BEI-encoding orthologue (Boyer and Preiss, 1978). In total, 18 SNPs and six Indels were found in this gene, one of which is nonsynonymous ‘C/T’ SNP that alters Gly607ASP which is potentially very important as it changes the polarity of the amino acid.
BEIIa BEIIa is a leaf-expressed gene involved in amylopectin synthesis. BEIIa is also expressed in the endosperm but at levels tenfold lower than in leaf tissue (Gao et al., 1997). Variation in this gene/enzyme may have a significant influence on rice starch properties, considering that BEIIa is preferentially expressed along with at least one important starch synthesis gene expressed in leaf and endosperm (both tissue expressing genes)(Hirose et al., 2006). An association study including the gene and RVA properties demonstrated a low F value (6.60) with a very slight influence on glutinous rice (Yan et al., 2010). Application of antibody-specific BEIIa has demonstrated this protein is present in both soluble and granule bound forms in developing wheat endosperm (Rahman et al., 2001). In total, six SNPs were detected including a nonsynonymous ‘T/G’ which causes a Tyr140Ser substitution, with no polarity alteration. No SNP/Indels has been previously reported for this gene, suggesting BEIIa might be one of the most conservative starch-related genes in rice.
BEIIb A relatively high variation rate of 6.422 was detected for this important gene (Table 3), which is also known as amylose extender (ae) in maize and other cereals (Yun and Matheson, 1993). Many studies have reported the significance of this gene on starch properties on various plant species (Fisher et al., 1993; Sun et al., 1997, 1998). This is a granule- and soluble-associated enzyme which is only expressed in the endosperm. Expression of three different functional maize SBE genes in BE-deficient yeast strains demonstrated that the presence of BEIIb is necessary to activate BEI and BEIIa (Seo et al., 2002). A recent association study has determined very high F value of 11.12 between BEIIb and RVA properties in rice (Yan et al., 2010). Additionally, a 0.5–0.7-fold decrease in the expression of BEIIb (amylose extender) during grain filling creates chalky rice (Tanaka et al., 2004). There were 53 SNPs, three of which were nonsynonymous and 17 Indels were found in amylose extender. No functional polymorphism was recorded in the available databases but we detected three nonsynonymous SNPs as ‘C/T’ (Val403Ile), ‘C/T’(His196Arg) and ‘C/A’ (Leu97Val), none of which changed amino acid polarity.
Debranching enzymes (DBEs)
DBEs belong to α-amylase family of which two classes exist in plants, isoamylase and pullulanase. These enzymes debranch (hydrolase) α-(1-6)-linkages in amylopectin and pullulan. Defective DBEs in plants are thought to be responsible for accumulation of phytoglycogen rather than starch and in turn change the phenotypic appearance of the endosperm (Bustos et al., 2004).
Iso 1 (ISA1) In wheat, the expression of ISA1 cDNA was highest in the developing endosperm and undetectable in mature grains. This suggests a fundamental biosynthetic role of isoamylase 1 in plant starch, although the precise roles of DBEs are not yet known (Tetlow et al., 2004). The regulation of ISA1 gene at the transcriptional level during grain filling of rice in response to high temperatures has been reported (Yamakawa et al., 2008). In rice endosperm, antisense inhibition of isoamylase 1 has altered the structure of amylopectin and the physiochemical properties of starch (Fujita et al., 2003). The ISA genes are also presumed to have some sort of contributions to the degree of setback on glutinous rice cultivars (Yan et al., 2010). No functional polymorphism was found in this gene in the studied population. Only nine Indels in the intronic regions were detected. This suggests that this gene has no or minimum effect on variation in starch properties in this population.
Iso2 (ISA2) The existence of this type of isoamylase was first reported in maize endosperm (Doehlert and Knutson, 1991). It was suggested two isoamylase isoforms I and II exist in maize endosperm which were distinguishable by anion-exchange chromatography. On the basis of enzymic characteristics, the sugary1 (su1) protein corresponds to the isoamylase II form in the maize endosperm (Beatty et al., 1999). Association between ISA2 and rice grain quality is unknown. There is no intron in this relatively small gene (2625 bp); thus, each detected SNP/Indel can be potentially important. The polymorphism rate was significantly high about 0.66. There were 16 SNPs including nine nonsynonymous SNP and no Indels in the ISA2 gene. Three of the nonsynonymous SNPs altered the polarity of amino acids as follows: T/C, C/A and T/G at coordinates 960, 1712 and 2067 of reference sequence which cause Thr482Ala, Arg231Leu and Thr113Pro substitutions, respectively.
In rice endosperm, a defect in pullulanase-type DBE activity triggers and modulates some phenotypic effects (Nakamura et al., 1998). In maize endosperm, it is believed that pullulanase has a dual role, contributing either to starch synthesis or degradation (Dinges et al., 2003). Kubo et al. (1999) suggest pullulanase plays a predominant and essential role in amylopectin synthesis and compensates shortages of isoamylase activity in the construction of multiple cluster structure of amylopectin. A recent association study between pullulanase and RVA profile parameters in glutinous rice has shown strong relations of this gene with peak viscosity, hot paste viscosity, breakdown viscosity and peak time (Yan et al., 2010). The highest polymorphism rate (1.14) was seen in pullulanase, where in total, 109 SNPs and ten Indels were detected. This number of SNPs exactly equals the number already reported in OryzaSNP database. In our population, only one nonsynonymous SNP was detected at coordinate 2319 of the reference which substitutes a Ser to Asn at position 217 of the protein. This alteration might not be very influential as it does not change the polarity of the molecule.
Distribution of SNPs across the loci
Distributions of detected polymorphism and coverage patterns of short reads across the length of candidate genes indicated no specific correlation among 17 studied loci (Figure S1). Some genes such as SPHOL and GBSSI exhibited similar distribution patterns. However, there were no associations among the patterns of different genes. Based on the distribution patterns, it can be concluded that most of candidate genes have shown higher polymorphism rate in the median intron/exon regions rather than UTR ends.
Ka/Ks ratio (‘purifying’ versus ‘diversifying’ genes)
The proportion of nonsynonymous (Ka) relative to synonymous (Ks) can reveal whether a gene has been under purifying, neutral or diversifying selection. The Ka/Ks ratio has been created to classify candidate genes into two main categories of ‘purifying’ and ‘diversifying’ genes. Under neutral conditions of evolution, at the amino acid level, Ka should equal Ks and hence the ratio Ka/Ks = 1. Any deviation from this score shows the selection pressure on genetic structure of population or candidate genes. The Ka/Ks ratio < 1 indicates negative (purifying) selection, and positive (adaptive) selection is Ka/Ks > 1 (Roth and Liberles, 2006). The different Ka/Ks ratios were calculated for candidate genes, ranging from 0.11 to 2.40 for SSI and SSIIIa, respectively (Table 3). These results indicate that genes such as SSIIIa are under adaptive selection whereas others like SSI are still under purifying selection.
We classified GBSSI and GBSSII as highly conservative genes which are being passed through the adaptive phase by the rice breeder’s artificial selection pressure as they showed a low polymorphism rate and high Ka/Ks ratio. We suggest some genes with a Ka/Ks ratio of one such as AGPS2b, SSIIa and SPHOL as neutral, which means they probably have not undergone selection pressure.
This MPS analysis of rice starch metabolism candidate genes identified a relatively high SNP/Indel variation at all loci. In total, 501 SNPs and 113 Indels were detected in comparison with 399 SNPs that are already available in the public domain. No Indels are recorded in public databases such as OryzaSNP. Out of 501 SNPs, 75 SNPs (approximately 14.9%) were nonsynonymous leading to amino acid changes. All Indels resided in the intronic regions and so no obvious functional Indels were found. The highest and lowest polymorphism rates were observed in pullulanase (11.443) and GBSSI (0.574), respectively (Table 3).
Massively parallel sequencing applied with LR-PCR ensured high sequence depth in terms of the number of candidate genes and number of samples at all studied loci (Pettersson et al., 2009). Among the numerous elements involved in the MPS, amplification efficiency and pooling strategy are the most important parameters.
The error rate of BioRad iProof™ (BioRad Laboratories, Hercules, CA) High-Fidelity DNA polymerase is low, 4.4 × 10−7, which is approximately 50-fold lower than normal polymerases, and the extension efficiency is high, 5–30 s/kb, four times faster and thus makes the PCR faster, making it the enzyme of choice. However, establishment of an efficient long-range PCR for large genomic regions can be costly and time consuming (Ingman and Gyllensten, 2008). To solve this problem, we suggest the semi-long-range PCR (SLR-PCR) strategy which is generally more robust and saves time and cost of primers. Preparing an optimal SLR-PCR will increase the performance of MPS and must be established before pooling genomic DNA samples.
The error rate of the GA Illumina is reportedly about 0.5%–1.0%. In our experiment, we pooled 233 DNA samples. This means that discovery of only one SNP (variant) out of 233 SNPs will make the SNP frequency approximately 0.43%, which is lower than the reported error rate.
To avoid this problem and to reduce the risk of false SNP detection, we firstly used an accurate two steps pooling strategy and secondly created a high coverage depth. These two strategies significantly overcome the effect of the error rates of GA Illumina platform.
Figure 1 indicates the high coverage for all loci, starting from 12 000 to 38 000 times. Our raw data show that the maximum coverage in some regions reaches to 240 000 times (data not shown here).
This huge coverage depth will ensure us that the error rate has been neutralized significantly in our experiment.
Out et al. (2009) have clearly discussed the correlations between allele frequencies, pool size, coverage depth and error rates in GA Illumina. They demonstrated that a coverage depth of 25 000 times would be enough for detection of SNP frequencies on or above 0.3%.
Currently, there is much interest in applying the Illumina GA platform to targeted sequencing of specific candidate genes, particularly for finding SNPs in a large number of individuals in the targeted populations (Hodges et al., 2007). An incorrect pooling strategy is another important issue that may be encountered in generating and analysing data. DNA samples from different individuals may not be amplified with the similar efficiency in PCRs, creating random bias. To rectify this issue, we tried smaller pools of individuals (20 individuals each) and therefore minimized the chance of biased amplification of target regions. Using this strategy, we were able to detect rare SNPs which occurred at a frequency lower than 1%.
The coverage rate is also critical. It is believed a 20-fold sequencing coverage is sufficient for accurate detection of real SNPs (Dohm et al., 2008) An even coverage pattern is also highly desirable. Given this is the case, with average coverage of 90 times we should have obtained reasonable SNP data for the beginning of SSIIaH1 fragment, but we observed only 18.5% of the SNPs are real in this region. This can be attributed to the difficulties encountered in amplifying this high GC region.
The main reason of different coverage patterns is still unknown (Mardis, 2008b). The highest peak was observed for positions 4885 of BEIIa with 239 019 times coverage. However, the coverage depth only affects the accuracy of SNP frequency and not the number of discovered SNPs (Morozova and Marra, 2008). Ingman and Gyllensten (2008) studied the effect of different pooling strategies and coverage levels to evaluate SNP frequencies of pooled and un-pooled individuals in a approximately 17 kb region, and they found that all SNPs, including low frequency (not under 0.4%), can be detected at coverage levels above 500 times. They suggested that in pooled PCR products, 50 times coverage would be sufficient for SNP frequencies on or above 4%. As we obtained very high level of coverage, we were able to discover very rare SNPs with frequencies lower than 0.5%. Sequencing errors are common in next-generation sequencing (NGS), and sequencing errors are easily confounded with low frequency SNPs if the minimum number of reads is too low (Futschik and Schlotterer, 2010). The high level of coverage for all candidate genes enabled us to recognize rare SNPs effectively.
It has been demonstrated that allelic variation of amino acids and structure of proteins correlate with the effect of natural selection seen as an excess of rare SNPs which affect actual phenotypes (Sunyaev et al., 2000). The distribution of genetic variation in 17 starch candidate genes clearly indicates they have been selected for starch properties. The pressure of natural selection can significantly influence the extant pattern of genetic variation (Akey et al., 2002; Barreiro et al., 2008). In our study, the total polymorphism rate and distribution pattern indicate that the candidate genes have been subjected to selection pressure by breeders, as some important genes with high impact on starch properties such as GBSSI and ISA1 have shown unusually low levels of polymorphism. Artificial selection during the breeding programme has had a major influence on genetic variation of population studied. These changes in population structure mainly happen because of narrowing the gene pool, and changing the balance between genetic drift and population size during the breeding process.
All plant material was supplied by Industry and Investment NSW, Yanco Agricultural research institute, Australia. Two hundred and thirty-three rice lines from a breeding programme were analysed. These lines were significantly diverse in starch quality properties and provided a high rate of variation for starch value traits.
Variability of genotypes
This population comprised a series of lines at the F6 stage, from harvested pedigree rows entering the first stage of plot testing. This was to capture the widest possible variability in the set of lines from a breeding programme focused on temperate (japonica-type) rice. The selection procedure was primarily carried out on plant height, the capacity to flower and set seed in our temperate environment of Australia, and visual inspection of grain size and shape. No selection had taken place for quality traits like gelatinization temperature and RVA curve data. The samples were provided after we had selected and sown the lines based on the previous simple selection criteria (Table S2).
Sample preparation and DNA extraction
Rice seeds of each line were germinated at 25 °C and 10–20 seedlings from each individual selected for DNA extraction. Total genomic DNA was extracted from 15-day-old seedlings using DNeasy 96 Plant kit (Qiagen, Valencia, CA) according to the manufacturer’s instructions.
Designation of starch-metabolizing enzymes/genes involved in starch synthesis
This study focused on the genes encoding seven groups of enzymes, namely ADP-glucose pyrophosphorylase (AGPase), granule bound starch synthase (GBSS), starch synthase (SS), BE, DBE, PHO and GPT.
Designing primers to capture target genes
The sequence of each target gene, including exons and introns, were divided into two relatively equal fragments. Each selected sequence included 500 bp from up- and downstream of the coding regions (5′ and 3′ UTRs) and an approximately 300 bp overlap in the middle. A set of specific primers were designed for each half using Clone Manager V9.1 (Sci-Ed Software, Cary, NC) (Table S3).
Long-range PCR protocol (LR-PCR)
The concentration of extracted DNAs was quantified by the automated flurometric protocol of PicoGreen (PicoGreen dsDNA Quantification kit; Invitrogen, San Diego, CA) and then diluted to 30–40 ng/μL for amplifications. A unified LR-PCR approach was applied to amplify all genes (with few exceptions) simultaneously. BioRad iProof™ High-Fidelity DNA polymerase was used for PCR amplifications, in 10 μL reactions, containing 20 ng of pooled genomic DNA of 20 individuals. The extreme fidelity of iProof™ makes it the enzyme of choice for SNP detection in long amplicons. PCRs were performed using 2 μL of HF or GC buffer (the HF buffer are used for normal and GC buffer for GC-rich sequences), 0.2 μL of dNTPs (10 mm), 2 μL of each forward and reverse primer (2.5 μm), 1 μL of pooled DNA (20 ng/μL), 0.1 μL of iProof™ polymerase (0.2 unit) and 2.7 μL sterile water. As the different genes needed unique optimal conditions for amplification, we tried a unified PCR method to amplify all targeted genes simultaneously. The touchdown PCR protocol was performed using a Corbett PCR thermo cycler as follows: 98 °C for 1 min (1 cycle), followed by ten touchdown cycles of 98 °C for 10 s, annealing temperature of 72–62 (10 °C touchdown) and 72 °C extension for 4 min, followed by 28 cycles of a normal amplification of 98 °C for 10 s, 62 °C for 20 s and 72 °C for 4 min. The final extension was performed by a cycle of 72 °C for 10 min. Prior to Illumina sequencing, the PCR products were Sanger sequenced using BigDye Terminator version 3.1 (Applied Biosystems, Foster City, CA). The generated sequences were aligned with the reference sequence to ensure the correct gene had been captured.
DNA equimolar pooling
A uniform pooling strategy was applied for all samples. The genomic DNA of 233 breeding lines, which had already been normalized for PCR in previous stage (30 ng/μL), divided into 12 sections, containing the pools of approximately 20 individuals each, and LR-PCRs were carried out. The concentration of PCR products from these pools were measured using PicoGreen (PicoGreen dsDNA Quantification kit, Invitrogen). A second pool was made for each fragment from PCR products. To facilitate the final equimolar pooling of PCR products, the concentration of second pools (33 second pools/amplicons) were individually normalized to 25 ng/μL and then equimolarly pooled into a mega pool based on the proposed lengths, giving consideration to the requirement that the larger amplicons need a higher number of copies than smaller fragments. The final mega pool was prepared with the aim to have the final concentration of 2.5 μg of long amplicons, including all 240 individuals.
Massively parallel sequencing
The final mega pool was subjected to Illumina GA sequencing (Illumina Inc.). The PCR product fragmentation and library were prepared according to the manufacturer’s instruction. The fragments with length of approximately 200 bp were selected for sequencing, and 4 pmol of the library was added on to a one flowcell lane.
SNP detection and data analysis
The data analysis such as filtering, trimming and mapping to the reference sequences was performed with the CLCbio Workbench 4 using 17 reference sequences with the specified coordinates, extracted from genbank (Table 3). The CLC Workbench general parameters were set to the following: The conflict resolution changed into all four nucleotides (vote A, C, G, T), nonspecific and masking references ignored. The reads parameters set to default as the min-max distance, mismatch cost; length fraction and similarity were 100–1000, 3, 0.9 and 0.9, respectively for both single and paired end reads. This set of parameters was selected to minimize reads alignment ambiguities as well to detect rare SNPs. To become confident that all low frequency SNPs would be captured by software, we adjusted the min coverage and per cent of min variant frequency to 20 and 0.5, respectively, which means all variations on or above 0.5% were considered as SNPs.
Total polymorphism rate and functional SNPs
The total polymorphism rate was calculated as TSI/TL × 100, where TSI = total number of SNPs and Indels and TL is the total length of each candidate gene. The functional or nonsynonymous SNP rate was also calculated as NS/TL × 1000, where NS = number of nonsynonymous SNPs in each locus and TL is the total length of each candidate gene.
We are grateful to Striling Bowen in Southern Cross Plant Genomics for providing technical support in Illumina GA analysis and Timothy Sexton in the Centre for Plant Conservation Genetics for his valuable assistance with analysis.