Advances in next-generation sequencing technologies have aided discovery of millions of genome-wide DNA polymorphisms, single nucleotide polymorphisms (SNPs) and insertions–deletions (InDels), which are an invaluable resource for marker-assisted breeding. Whole-genome resequencing of six elite indica rice inbreds (three cytoplasmic male sterile and three restorer lines) resulted in the generation of 338 million 75-bp paired-end reads, which provided 85.4% coverage of the Nipponbare genome. A total of 2 819 086 nonredundant DNA polymorphisms including 2 495 052 SNPs, 160 478 insertions and 163 556 deletions were discovered between the inbreds and Nipponbare, providing an average of 6.8 SNPs/kb across the genome. Distribution of SNPs and InDels in the chromosome was nonrandom with SNP-rich and SNP-poor regions being evident across the genome. A contiguous 4.3-Mb region on chromosome 5 with extremely low SNP density was identified. Overall, 83 262 nonsynonymous SNPs spanning 16 379 genes and 3620 nonsynonymous InDels in 2625 genes have been discovered which provide valuable insights into the basis underlying performance of the inbreds and the hybrids between these inbred combinations. SNPs and InDels discovered from this diverse set of indica rice inbreds not only enrich SNP resources for molecular breeding but also enable the study of genome-wide variations on hybrid performance.
Genetic polymorphisms are major determinants of phenotypic variations, and their interaction with environment is vital for the expression of a trait in an individual. These variations in the DNA have been the basis for the development of a vast array of molecular markers, from restriction fragment length polymorphism to simple sequence repeats (SSR), for use in genetic analysis (Jones et al., 2009). The advent and cost-effective implementation of next-generation sequencing technologies have significantly improved our ability to study genome-wide genetic variation through large-scale resequencing of whole genomes (Bentley, 2006). Next-generation sequencing technologies make it possible to discover a massive number of DNA polymorphisms such as single nucleotide polymorphisms (SNPs) and insertion–deletion polymorphisms (InDels) by comparing the whole-genome sequences of individuals with high-quality reference genome sequences.
Single nucleotide polymorphisms (Henry and Edwards, 2009) have gained importance over other DNA markers in marker-assisted breeding because of their inherent advantage of huge number, stability, high-throughput capability and cost-effectiveness. SNPs are being employed in breeding programmes for marker-assisted and genomic selection, association and QTL mapping, positional cloning, haplotype and pedigree analysis, seed purity analysis and variety identification (McCouch et al., 2010). In addition to massive number of SNPs, whole-genome resequencing also yields information on substantial numbers of InDels. InDels are valuable DNA markers that have been used for QTL mapping (Vasemagi et al., 2010), fine mapping (Liang et al., 2011), marker-assisted selection (Hayashi et al., 2006) and varietal testing (Steele et al., 2008) and have the potential for map-based cloning of genes (Pan et al., 2008) because they are cheap, require relatively simple genotyping with low technical needs and show easy transferability between populations in addition to being abundant, stable and high throughput in nature.
Shen et al. (2004) reported 2 182 582 DNA polymorphisms (including 1 703 176 SNPs and 479 406 InDels) between the whole-genome shot sequences of japonica rice cultivar, Nipponbare, and indica cultivar, 93-11, whereas Feltus et al. (2004) reported 408 898 high-quality DNA polymorphisms (including 384 431 SNPs and 24 557 InDels) between the same two cultivars after eliminating multiple copy and low-quality sequences. Recently, the whole-genome resequencing approach has been utilized in rice for identifying SNPs, taking advantage of the high-quality reference genome sequences of the japonica cultivar Nipponbare (IRGSP, 2005) and the massive parallel sequencing approaches. Based on hybridization-based resequencing, a total of 159 879 nonredundant SNPs distributed over 100 Mb of rice genome were identified in 20 diverse rice cultivars and landraces (McNally et al., 2009). Through whole-genome resequencing by sequencing by synthesis (SBS), 67 051 SNPs were discovered in japonica cultivar Koshihikari (Yamamoto et al., 2010) and 168 228 DNA polymorphisms (including 132 462 SNPs and 35 766 InDels) in japonica land race Omachi (Arai-Kichise et al., 2011). An important consideration in the use of the available SNP resources is that they are based on the differences between a limited number of cultivars and only a subset of them may be applicable to other combinations of genotypes. To enlarge the SNP-discovery pool for rice, diverse Oryza accessions including both wild and cultivated Asian and African rice are being resequenced by international collaborators throughout the world (Tung et al., 2010).
Heterosis or hybrid vigour refers to superior performance of heterozygous F1 hybrid in terms of higher biomass, size, speed of development, fertility and yield compared with their homozygous parental inbred lines (Hochholdinger and Hoecker, 2007). Despite its successful commercial exploitation in crop improvement, the underlying genetic and molecular principles for heterosis remain an unresolved mystery. The development of a successful hybrid with superior performance is still a question of hit or miss, entirely depending on the large number of test crosses and extensive testing. Dissecting the genetic basis of heterosis is a basic step in unravelling the genetic requirements for its expression and facilitates its exploitation in hybrid improvement. The current era of genomic sequencing provide powerful tools to study allelic variations at whole-genome level including the diversity in the sequences (genic/intergenic, coding/noncoding and unique/repetitive), which can be used for mapping or association studies.
Hybrid rice yields 10%–20% more than the elite inbred varieties and accounts for 50% of the total rice area in China (Chen, 2010) and is gaining importance in other rice-producing countries such as India, Indonesia and Vietnam. Currently, hybrid rice breeding is primarily based on a three-line breeding system, involving a cytoplasmic male sterile (CMS) or A line, an iso-nuclear maintainer or B line and genetically diverse restorer or R line. Considerable efforts have been made to study the genetic basis of heterosis in rice using molecular markers but with little consensus, with reports of dominance, overdominance and epistasis as the cause of heterosis (Xiao et al., 1995; Yu et al., 1997; Li et al., 2001; Luo et al., 2001; Hua et al., 2002, 2003). Attempts have also been made to study the patterns of gene expression using cDNA microarrays (Huang et al., 2006) and differential gene expression in the parental lines and their F1 hybrids through genome-wide transcriptome analysis (Zhang et al., 2008; Wei et al., 2009).
With this background, the present study was carried out with the aim of discovering genome-wide DNA polymorphisms in elite indica rice inbred lines, including both CMS lines and R lines. Whole-genome resequencing of six elite indica rice inbreds (three CMS lines, namely 9001 A, 9002 A and 9003 A, and three R lines namely 9001 R, 9002 R and 9003 R) from a hybrid rice breeding programme was carried out by SBS using an Illumina Genome Analyser IIx (GA IIx). The sequence reads generated were then mapped to the high-quality Nipponbare genomic sequence, and genome-wide variations were uncovered through comprehensive detection of SNPs and InDels across the genome. The discovery of these genetic variations (essentially genome-wide sequence variations) in this study will provide vital clues that will help unravelling the genetic basis underlying heterosis in hybrid rice.
Mapping of GA reads to the Nipponbare genome
Whole-genome resequencing of the six elite rice inbreds yielded 3.38 billion 75-bp paired-end reads, which comprised 24.4 Gb of high-quality raw data. After appropriate processing, the short reads were mapped to high-quality genomic sequences of japonica rice cultivar, Nipponbare, using CLC Genome workbench 4.0. A total of 287.67 and 24.96 million reads were mapped to the nuclear and organellar genomes, respectively (Figure 1). Out of these reads, 222.62 million reads were uniquely mapped to the 12 chromosomes, corresponding to 16.2 Gb of the Nipponbare genome, while 65.05 million reads mapped to multiple locations (Table 1). The number of reads mapped to unique reference sequences ranged from 30.04 million reads in inbred 9001 R to 44.5 million reads in inbred 9003 A (Figure S1). On an average, an effective depth of ×43.2 coverage was achieved across the whole genome, with a sequencing depth ranging from 38.6 in chromosome 11 to 48.2 in chromosome 3. Individually, the mean sequencing depth varied from ×5.7 for inbred 9001 R to ×8.9 for 9003 A (Table S1). The reads mapped uniquely (316 652 843 bp) covered approximately 85.4% of the Nipponbare genome (370 792 118 bp) with a minimum coverage of 81.9% in chromosome 11 to a maximum of 91.0% in chromosome 3 (Table 1). About 25.37 million reads obtained from the six inbreds could not be mapped onto the organellar/nuclear genome of Nipponbare.
Table 1. Coverage of the reads from resequencing of six elite inbred parents using Illumina Genome Analyzer to the Nipponbare genome
Nipponbare genome (bp)*
Length of consensus sequence aligned to reference (bp)†
‡Total number except for the column of coverage and sequencing depth, which are average values.
§Coverage is calculated based on the consensus of aligned length to the Nipponbare reference. The value is slightly different from average coverage of respective chromosome.
43 261 740
38 067 736
27 012 387
1 960 018 801
35 954 743
31 889 071
23 765 016
1 724 389 561
36 192 742
32 938 018
24 086 909
1 747 746 117
35 498 469
29 156 010
19 856 483
1 440 786 406
29 737 217
25 732 320
18 783 399
1 362 923 431
30 731 886
25 991 135
18 536 789
1 345 029 410
29 644 043
24 713 303
16 996 652
1 233 277 069
28 434 780
23 987 843
16 629 683
1 206 649 798
22 696 651
19 119 121
13 328 131
22 685 906
19 138 394
13 438 388
28 386 948
23 255 593
15 116 942
1 096 885 312
27 566 993
22 664 299
15 065 493
1 093 152 172
370 792 118
316 652 843
222 616 272
16 153 036 696
Detection of DNA polymorphisms in the elite indica inbred lines
A total of 5 111 333 polymorphisms (4 636 538 SNPs and 474 795 InDels) were detected in six elite indica rice inbred lines with minimum parameters (as defined in Experimental procedure). In order to minimize the rate of false-positive SNPs and InDels, two filters were applied: coverage ≥10 and ≤100, and SNPs/InDels from repetitive regions were eliminated. After applying these filters, the total number of DNA polymorphisms was 2 819 086 including 2 495 052 SNPs, 160 478 insertions and 163 556 deletions (Table 2). For the detection of SNPs and InDels between each of the six inbreds and Nipponbare, all the parameters were identical except minimum coverage that was ≥5 reads. Individually, the total DNA polymorphisms detected ranged from 879 916 in the inbred 9001 R to 2 445 994 in 9003 R (Table S2).
Table 2. Polymorphism in genomic DNA observed in the six elite indica rice inbreds
DNA polymorphism/100 kb
No. of SNPs
No. of insertions
No. of deletions
SNP, single nucleotide polymorphisms.
*Total number of polymorphisms.
†Average polymorphism per 100 kb.
3 251 77
2 819 086*
2 495 052*
Distribution of SNP and InDel variations across rice genome
The total number of SNPs and InDels detected varied across different chromosomes. The highest number of SNPs (284 078) and InDels (20 137 insertions and 20 287 deletions) was observed in chromosome 1, while chromosome 9 had the lowest number of these polymorphisms (Table 2). Detection of SNPs and InDels in each of the individuals also produced the same trend with the maximum numbers in chromosome 1 and minimum in chromosome 9 except 9001 A where the DNA polymorphisms (both SNPs and InDels) and 9003 A where SNPs detected were the least in chromosome 12 (Figure 2; Table S2).
The genomic distribution of the patterns of DNA polymorphism between the six inbreds and the japonica Nipponbare genome was explored by calculating the frequency of polymorphisms observed for each 100-kb interval along the chromosome. The average densities of DNA polymorphisms detected per 100-kb window across the genome were 677.8 (SNPs), 43.2 (insertions) and 44.1 (deletions) in these six indica rice inbreds (Table 2). The SNP frequency showed variations within the genome with chromosome 11 and chromosome 5 having the highest (790.8) and lowest SNP density (547.9) per 100-kb interval (Table 2). At an individual level, the average density of the SNPs per 100-kb interval ranged from 215.4 (9002 A) to 288.3 (9003 R), while in case of InDels, 9001 R had the least density (10.7 insertions and 11.3 deletions/100 kb) and 9003 A had the highest density with 31.4 insertions and 31.3 deletions per 100 kb (Table S2). Overall, the number and density of the insertions and deletions detected was found to be proportionate with very slight bias on the higher side towards deletions except chromosomes 3, 5 and 11 (Tables 2 and S2). Additionally, exceptions in this trend were also observed in two individuals namely 9003 A (chromosomes 1, 2, 6, 8 and 9) and 9003 R (chromosomes 1 and 2) where the insertions detected were slightly higher than the deletions (Table S2).
The distribution of SNPs within the chromosomes was nonrandom (Figure 3). Within each of the chromosomes, there were regions with significant variations in SNP frequencies, and as many as 60 intervals (100 kb each), with significantly higher (50 intervals) and lower SNP frequencies (10 intervals), were observed across the genome. This includes a 500-kb contiguous region between 8.7 and 9.2 Mb in chromosome 8, where an unusually high SNP density was observed. No SNPs could not be detected in one interval each in the chromosomes 4 (between 9.1 and 9.2 Mb) and 9 (between 6.5 and 6.6 Mb), while a longer interval of 200 kb in chromosome 11 (between 27.2 and 27.4 Mb) also did not have any SNP. This was also confirmed in the SNP analysis with each of the individual inbreds. The SNPs in each of the single individuals also reflected nonrandom distribution within the chromosomes (Figure S2). The distribution patterns of insertions and deletions in both combined mapping (Figures 4 and 5) and each of the individuals mapped to Nipponbare (Figures S3 and S4) were also nonrandom, which was similar to SNPs distribution. Overall, the distribution pattern of SNP variations in the inbreds lines exhibited patterns wherein the extremely low SNP rates were observed in specific segments with chromosome 5 showing the large contiguous regions (Figure 6). Most of the SNP-poor regions detected in all the six inbred are 200 kb or shorter, except two regions, which were 500 kb or longer. Even though SNP-poor regions with extremely low SNP frequency longer than 200 kb were observed in individual/group of inbred lines, higher SNP rates were observed in one or other of the remaining individuals. The SNP-poor region in chromosome 5 is the longest (4.3 Mb) followed by chromosome 1 (0.5 Mb).
Analysis of SNPs and InDels
The SNPs detected in each of the six inbreds were classified as transitions (C/T and G/A) or transversions (C/G, T/A, A/C and G/T) based on nucleotide substitutions (Table 3). The proportions of transitions (Ts) were significantly higher than the proportion of transversions (Tv) in all the six inbreds. Among the transitions, the number of C/T transitions was slightly higher than G/A transitions, while T/A transversions were relatively higher compared with other transversions namely A/C, G/T and C/G. The ratio between transitions and transversions (Ts/Tv) was 2.0 in all the six inbred lines.
Table 3. Classification of nucleotide substitutions in the single nucleotide polymorphisms detected in the six elite indica rice inbreds
Out of the total 160 478 insertions and 163 556 deletions detected, the variation in the length of the insertions ranged from 1 to 7 bp, while that of deletions was up to 8 bp (Figure 7). The majority of the InDels (65.2%) were mononucleotide (insertions—105 721 and deletions—105 439), 31.2% were 2- to 4-bp InDels (insertions—49 968 and deletions—51 068) and a meagre 3.7%≥5 bp. The InDels detected in each of the six inbreds individually also showed an upward bias towards mononucleotide insertions and deletions ranging from 42 115 (9001 R) to 11 6273 (9003 A) and 39 985 (9001 R) to 1 16 735 (9003 A), respectively (Figure S5).
Annotation of SNPs and InDels
The use of the annotated reference sequence of Nipponbare enabled us to annotate the SNPs and InDels detected between these elite inbreds and Nipponbare. Accordingly, a total of 2 495 052 SNPs, 160 478 insertions and 163 556 deletions were detected in the nonrepeat regions of the genome, out of which 497 250 SNPs (19.9%), 35 871 insertions (22.4%) and 36 731 deletions (22.4%) were located in 25 591, 10 269 and 10 321 genes, respectively. Among these, 146 604 SNPs (3.2% of the total), 1733 insertions (0.9% of the total) and 1887 deletions (0.9% of the total) were found in coding sequences. Altogether, 83 262 nonsynonymous SNPs (1.8% of the total SNPs) were detected in all the six elite inbred lines that were located onto 16 379 genes (Figure 8). The annotation of SNPs in each of the six individuals revealed that the nonsynonymous SNPs ranged from 21 501 (9002 A) to 30 659 (9003 R), while that of the insertions and deletions in coding sequences ranged from 337 (9001 R) to 1417 (9003 A) and 528 (9002 A) to 1729 (9003 A), respectively (Figure S6). Annotations of insertions and deletions showed that 72 601 InDels were observed in 17 852 genes in the inbred lines.
The number of nonsynonymous SNPs per kb of each of gene had wide distribution, ranging from 0.04 to 35.71, with a mean of 1.56. The five-number summary of the box and whisker plot was used to calculate the outlier values, and 1347 genes were classified as outliers with nonsynonymous SNP > 3.9 per kb (Figure 9). Individually, the genotypes had 740 (9002 A) to 934 (9003 R) genes classified as outliers based on higher nonsynonymous SNP per kb detected in respective individuals (Figure S7).
Whole-genome resequencing for detecting genome-wide DNA polymorphisms
Genome-wide analysis based on sequencing of two rice cultivars Nipponbare and 93-11 has led to the discovery of millions of SNPs across the rice genome (Feltus et al., 2004; Shen et al., 2004). In spite of rapid advances in next-generation sequencing technologies, only a few japonica rice cultivars, Koshihikari (Yamamoto et al., 2010), Omachi (Arai-Kichise et al., 2011), Eiko and Rikuu 132 (Nagasaki et al., 2010) have been sequenced by whole-genome resequencing. Therefore, emphasis has been placed, and efforts are underway to sequence diverse set of additional rice genotypes through whole-genome sequencing to enlarge the pool of DNA polymorphisms (SNPs and InDels) occurring throughout rice genome, which could be utilized for genetic analysis as well as in the improvement of rice (McCouch et al., 2010).
In the present study, the whole genome of six elite indica rice inbred lines was resequenced and mapped to Nipponbare as a reference genome for discovering genome-wide DNA variations. The uniquely mapped reads from these lines covered 85.4% of the reference genome providing an average coverage of ×43.2 across the genome. About 7.5% (25.37 × 106 reads) of the total reads from the indica rice inbreds did not map to either the organellar or nuclear genome of Nipponbare. At least 6% of the genome has been shown to be unusually divergent between japonica rice cultivar, Nipponbare, and indica rice, 93-11 (Tang et al., 2006), and the reason for observing unmapped reads in mapping reads from indica inbreds to the japonica reference may be due to the inherent differences in genomes accumulated through genetic differentiation between the indica-japonica subspecies.
A total of 2 495 052 SNPs and 224 034 InDels were detected between the six inbreds and Nipponbare, with an average density of 677.8 SNPs per 100 kb (Table 2), which is more than six times the SNP density identified between the japonica cultivar, Nipponbare, and indica variety, 93-11 (Feltus et al., 2004). Tenaillon et al. (2001) suggested that greater SNP rates could be correlated with the higher level of diversity, and the SNPs identified in six elite indica inbreds are more than four times that previously reported between Nipponbare and 93-11 perhaps because of the genetic diversity in the inbreds. The average polymorphism rate detected in our study 6.78 SNPs/kb in the nonrepetitive region of the rice genome is significantly higher than the previous estimates of 4.31 SNPs/kb (Nasu et al., 2002) and 1.70 SNPs/kb (Feltus et al., 2004), offering high-density coverage across the entire rice genome which could be utilized for high-resolution scans and genome-wide linkage disequilibrium studies in indica rice.
Nonrandom distribution of DNA polymorphisms in rice genome
An analysis of the number of SNPs and InDels in the six elite indica rice inbred showed that the observed number of these polymorphisms in each of the chromosomes was significantly different from the expected number based on their physical size. Further analysis of the distribution of SNPs in each chromosome of the six inbreds revealed SNP-rich and SNP-poor regions along the chromosomes. The occurrence of such regional variations in the genome has also been previously reported in rice (Feltus et al., 2004) as well as in Arabidopsis (Nordborg et al., 2005), wheat (Ravel et al., 2006) and human (Smith and Lercher, 2002). In general, the regions with higher SNP frequency were shorter than regions with lower SNP frequency. Noteworthy among the SNP-poor regions was an interval of 4.3 Mb between 9.1- and 15.4-Mb region on chromosome 5 in all the six inbreds. Interestingly, this phenomenon popularly termed ‘SNP deserts’ has also been reported between Nipponbare and 93-11 (Wang et al., 2009) and also between Nipponbare and three other japonica rice cultivars, Koshihikari, Rikuu 132 and Eiko (Nagasaki et al., 2010). Although selective sweeps during the process of domestication in rice is a plausible explanation for reduction in SNP frequencies across the genome (Caicedo et al., 2007), this phenomenon in chromosome 5 observed in both the indica and japonica rice needs elucidation through further in-depth study.
Bias in base substitutions
The ratio of transitions to transversions (Ts/Tv) in the six inbred lines was 2.0, an upward bias towards transitions from the expected ratio of 0.5. This phenomenon known as ‘transition bias’ has previously been reported in rice (Morton, 1995) and maize (Batley et al., 2003). This has been attributed to a higher frequency of transitional mutations over transversions because of conformational advantage in case of mispairing, and better tolerance of transitions during natural selection, because transitions are more likely to conserve the protein structure than transversions (Wakeley, 1996). Although the ratio observed in this study is significantly higher compared with earlier reports in rice, such a high bias has also been reported in SNPs from the chicken genome with Ts/Tv ratio ranging from 2.3 to 4.0 (Vignal et al., 2002). The higher ratio may be due to the inclusion of SNPs from throughout the rice genome (both genic and intergenic) in the present study.
Within transitions, the C/T transitions were higher in number, possibly due to the higher frequency of the C to T mutation after methylation (Coulondre et al., 1978). The relative abundance of C/T mutations has also been reported in other crops such as common bean (Ramírez et al., 2005), maize (Batley et al., 2003), grape (Lijavetzky et al., 2007) and citrus (Terol et al., 2008). Among the transversions, T/A transversions were found in large numbers compared with other transversions, which have also been reported in citrus (Terol et al., 2008). Because of the higher stringency used in mapping parameters, the maximum length of the InDels detected in the present study was restricted to only 8 bp, as compared to 36 bp in Omachi/Nipponbare (Arai-Kichise et al., 2011) and 82 bp in 93-11/Nipponbare (Shen et al., 2004).
Nonsynonymous polymorphisms in rice breeding
Annotation of the polymorphisms in the six rice inbreds shows that more than one-third of the SNPs occur in the nonrepeat regions, while 10.7% of the total SNPs have been found in 25 591 genes across the rice genome. Among the SNPs in genic regions, only 29.5% of them have been found in coding sequences, with 2.5% of them being nonsynonymous in nature. The proportion of nonsynonymous SNPs (>52.0%) is higher than that of synonymous SNPs in coding sequences. Overall, 83 262 nonsynonymous SNPs spanning 16 379 genes and 3620 nonsynonymous InDels in 2625 genes have been identified from the six rice inbreds, which would provide valuable insights into the basis underlying performance of the inbreds and the hybrids between these inbred combinations, as these are more likely to have phenotypic effects.
In the present study, whole-genome variations in indica rice were analysed by detecting SNPs, insertions and deletion polymorphisms among the six elite indica rice inbred lines. Even though individual SNPs are not very informative compared with SSR markers because of lower expected heterozygosity, constructing haplotypes using a set of SNPs enhances the capability of SNPs in genetic analysis enormously (Rafalski, 2002). Additionally, InDels have also been proved to be potential markers with applicability in genetic studies. SNP and InDel discovery by next-generation sequencing has become the main option for genetic marker discovery (Bundock et al., 2010; Imelfort et al., 2010).
Genome-wide association analysis with the performance of the hybrids generated from these elite inbreds using the SNPs discovered in our study will provide valuable insights into the molecular basis of heterosis. With further progress in techniques for analysing SNPs, the millions of SNPs and InDels identified from a diverse set of indica rice inbreds in the present study in addition to the already available SNP resources provide high-density coverage across the rice genome, which will be useful for the analysis of genetic diversity, QTL mapping, marker-assisted breeding, positional cloning, comparative mapping and association studies in rice.
Sample preparation and sequencing
DNA of the six elite indica rice inbreds was extracted from leaf tissue using a Qiagen DNeasy kit (Qiagen Pty Ltd., Melbourne, Victoria, Australia). Approximately 1 μg of total DNA was sheared, polished and prepared following the manufacturer’s instructions, Kit FC-102-1002 (Illumina sample preparation protocol for paired-end sequencing) with the following modifications. Briefly, DNA was sheared using the adaptive focused acoustics method on a Covaris S2 device with the following settings: duty cycle 10%; intensity 5; cycles per burst 200 for 180 s at 6 °C. Ligation products were purified by electrophoresis on Invitrogen E-gel SizeSelect 2% (Invitrogen Corporation, Carlsbad, CA). The fragments in the range of predominantly 450-bp size were selected from the gel. PCR products were further purified with a QIAquick PCR Purification Kit (Qiagen Pty Ltd.) and quantified using a DNA 1000 chip on an Agilent BioAnalyzer 2100 (Agilent Technologies, Palo Alto, CA). Approximately 4 pmol per individual was sequenced with 76 × 2 cycles carried out on an Illumina Genome Analyser (GAIIx) using sequencing kit v4. Base calling was performed using Illumina software Pipeline 1.4 (Illumina, San Diego, CA).
Mapping of reads to the reference
Paired-end sequence reads were trimmed of low-quality data with a quality score limit of 0.01 and adaptor sequence in CLC Genomics Workbench 4.0 (http://www.clcbio.com), and reads of <30 base pairs (bp) in length were discarded. Trimmed short-read sequences were first aligned to the published rice organellar genomes (Chloroplast genome: GenBank accession—AY522331.1, mitochondrial genome: GenBank accession—DQ 167400.1), and the unmapped reads were taken up for further assembly against the nuclear genome (GenBank accession—AACV000000000.1). The reads were assembled to the Nipponbare reference with CLC Genomics workbench with the following parameters: mismatch cost—2, insertion cost—3, deletion cost—3, length fraction—0.9 and similarity—0.9. Reads that aligned to more than one position of the reference genome were filtered and used for determining reads mapping to multiple positions in the reference and unmapped reads. Combined assembly was performed for all the six inbreds by assembling all the reads from all the six inbreds, and individual assembly for each of the six inbreds was also performed separately.
Discovery of SNPs and InDel polymorphisms
The assembled contigs were screened for SNPs and InDels using CLC Genomics workbench 4.0 for detecting the variants across all the six inbreds mapped to Nipponbare, contigs from combined assembly under the following criteria for the detection of variants; minimum coverage: 10× (average 43×), and minimum frequency of the least frequent allele: 10%, with additional parameters of a window of 21 bp at a central base quality score of 30 with the surrounding base quality of 20 or more for SNP detection. SNPs and InDels for each of the six inbreds were also detected based on the contigs from individual assemblies with a minimum coverage of 4× and other parameters as in the combined assembly. Additionally, the minimum variant frequency was very stringent with ≥90% for SNP and ≥30% for InDel detection in individual inbreds. Assembly and SNP detection using identical parameters of a resequenced sample of rice cultivar Nipponbare to the reference Nipponbare genome identified 1011 SNP. These SNP could be resequencing errors, errors in the reference Sanger sequence, or may reflect within-cultivar sequence differences that arise from drift. Further, the SNPs from the combined assembly were cross-validated by comparing with the SNPs detected by individual assembly, identifying SNPs with only one variant and filtering SNPs with minimum frequency of ≥50%. Details are provided in the Table S3.
Analysis of genome-level variations
To calculate the variations in DNA polymorphisms across the genome, a sliding window of 100-kb interval was used to analyse each of the chromosome for determining the rates of SNPs and InDels in each window. For determining the regions with significant deviation from the expected SNP rate, box–whisker plot was constructed and SNP-rich and SNP-poor regions were obtained. Additionally, regions with <10 SNPs per 100 kb were also considered as regions with extremely low SNP variations.
The annotations of the SNPs and InDels were derived from the annotated sequence of the Nipponbare reference genome and used for further analysis. The outliers for genes with nonsynonymous SNPs higher than the expected number were calculated using five-number summary of the box and whisker plot.
Gopala Krishnan, S., acknowledges Department of Science and Technology, Government of India for the financial support under the BOYSCAST Fellowship and Indian Council for Agricultural Research, New Delhi for granting permission to avail the fellowship. The authors acknowledge technical assistance by Mark Edwards and Stirling Bowen from Southern Cross Plant Genomics, Southern Cross University for Illumina sequencing as well as Peter Bundock for helpful discussions.