• Open Access

Genome-wide DNA polymorphisms in elite indica rice inbreds discovered by whole-genome sequencing

Authors


(Tel +61 7 3346 0551; fax +61 7 3346 0555; email robert.henry@uq.edu.au)

Summary

Advances in next-generation sequencing technologies have aided discovery of millions of genome-wide DNA polymorphisms, single nucleotide polymorphisms (SNPs) and insertions–deletions (InDels), which are an invaluable resource for marker-assisted breeding. Whole-genome resequencing of six elite indica rice inbreds (three cytoplasmic male sterile and three restorer lines) resulted in the generation of 338 million 75-bp paired-end reads, which provided 85.4% coverage of the Nipponbare genome. A total of 2 819 086 nonredundant DNA polymorphisms including 2 495 052 SNPs, 160 478 insertions and 163 556 deletions were discovered between the inbreds and Nipponbare, providing an average of 6.8 SNPs/kb across the genome. Distribution of SNPs and InDels in the chromosome was nonrandom with SNP-rich and SNP-poor regions being evident across the genome. A contiguous 4.3-Mb region on chromosome 5 with extremely low SNP density was identified. Overall, 83 262 nonsynonymous SNPs spanning 16 379 genes and 3620 nonsynonymous InDels in 2625 genes have been discovered which provide valuable insights into the basis underlying performance of the inbreds and the hybrids between these inbred combinations. SNPs and InDels discovered from this diverse set of indica rice inbreds not only enrich SNP resources for molecular breeding but also enable the study of genome-wide variations on hybrid performance.

Introduction

Genetic polymorphisms are major determinants of phenotypic variations, and their interaction with environment is vital for the expression of a trait in an individual. These variations in the DNA have been the basis for the development of a vast array of molecular markers, from restriction fragment length polymorphism to simple sequence repeats (SSR), for use in genetic analysis (Jones et al., 2009). The advent and cost-effective implementation of next-generation sequencing technologies have significantly improved our ability to study genome-wide genetic variation through large-scale resequencing of whole genomes (Bentley, 2006). Next-generation sequencing technologies make it possible to discover a massive number of DNA polymorphisms such as single nucleotide polymorphisms (SNPs) and insertion–deletion polymorphisms (InDels) by comparing the whole-genome sequences of individuals with high-quality reference genome sequences.

Single nucleotide polymorphisms (Henry and Edwards, 2009) have gained importance over other DNA markers in marker-assisted breeding because of their inherent advantage of huge number, stability, high-throughput capability and cost-effectiveness. SNPs are being employed in breeding programmes for marker-assisted and genomic selection, association and QTL mapping, positional cloning, haplotype and pedigree analysis, seed purity analysis and variety identification (McCouch et al., 2010). In addition to massive number of SNPs, whole-genome resequencing also yields information on substantial numbers of InDels. InDels are valuable DNA markers that have been used for QTL mapping (Vasemagi et al., 2010), fine mapping (Liang et al., 2011), marker-assisted selection (Hayashi et al., 2006) and varietal testing (Steele et al., 2008) and have the potential for map-based cloning of genes (Pan et al., 2008) because they are cheap, require relatively simple genotyping with low technical needs and show easy transferability between populations in addition to being abundant, stable and high throughput in nature.

Shen et al. (2004) reported 2 182 582 DNA polymorphisms (including 1 703 176 SNPs and 479 406 InDels) between the whole-genome shot sequences of japonica rice cultivar, Nipponbare, and indica cultivar, 93-11, whereas Feltus et al. (2004) reported 408 898 high-quality DNA polymorphisms (including 384 431 SNPs and 24 557 InDels) between the same two cultivars after eliminating multiple copy and low-quality sequences. Recently, the whole-genome resequencing approach has been utilized in rice for identifying SNPs, taking advantage of the high-quality reference genome sequences of the japonica cultivar Nipponbare (IRGSP, 2005) and the massive parallel sequencing approaches. Based on hybridization-based resequencing, a total of 159 879 nonredundant SNPs distributed over 100 Mb of rice genome were identified in 20 diverse rice cultivars and landraces (McNally et al., 2009). Through whole-genome resequencing by sequencing by synthesis (SBS), 67 051 SNPs were discovered in japonica cultivar Koshihikari (Yamamoto et al., 2010) and 168 228 DNA polymorphisms (including 132 462 SNPs and 35 766 InDels) in japonica land race Omachi (Arai-Kichise et al., 2011). An important consideration in the use of the available SNP resources is that they are based on the differences between a limited number of cultivars and only a subset of them may be applicable to other combinations of genotypes. To enlarge the SNP-discovery pool for rice, diverse Oryza accessions including both wild and cultivated Asian and African rice are being resequenced by international collaborators throughout the world (Tung et al., 2010).

Heterosis or hybrid vigour refers to superior performance of heterozygous F1 hybrid in terms of higher biomass, size, speed of development, fertility and yield compared with their homozygous parental inbred lines (Hochholdinger and Hoecker, 2007). Despite its successful commercial exploitation in crop improvement, the underlying genetic and molecular principles for heterosis remain an unresolved mystery. The development of a successful hybrid with superior performance is still a question of hit or miss, entirely depending on the large number of test crosses and extensive testing. Dissecting the genetic basis of heterosis is a basic step in unravelling the genetic requirements for its expression and facilitates its exploitation in hybrid improvement. The current era of genomic sequencing provide powerful tools to study allelic variations at whole-genome level including the diversity in the sequences (genic/intergenic, coding/noncoding and unique/repetitive), which can be used for mapping or association studies.

Hybrid rice yields 10%–20% more than the elite inbred varieties and accounts for 50% of the total rice area in China (Chen, 2010) and is gaining importance in other rice-producing countries such as India, Indonesia and Vietnam. Currently, hybrid rice breeding is primarily based on a three-line breeding system, involving a cytoplasmic male sterile (CMS) or A line, an iso-nuclear maintainer or B line and genetically diverse restorer or R line. Considerable efforts have been made to study the genetic basis of heterosis in rice using molecular markers but with little consensus, with reports of dominance, overdominance and epistasis as the cause of heterosis (Xiao et al., 1995; Yu et al., 1997; Li et al., 2001; Luo et al., 2001; Hua et al., 2002, 2003). Attempts have also been made to study the patterns of gene expression using cDNA microarrays (Huang et al., 2006) and differential gene expression in the parental lines and their F1 hybrids through genome-wide transcriptome analysis (Zhang et al., 2008; Wei et al., 2009).

With this background, the present study was carried out with the aim of discovering genome-wide DNA polymorphisms in elite indica rice inbred lines, including both CMS lines and R lines. Whole-genome resequencing of six elite indica rice inbreds (three CMS lines, namely 9001 A, 9002 A and 9003 A, and three R lines namely 9001 R, 9002 R and 9003 R) from a hybrid rice breeding programme was carried out by SBS using an Illumina Genome Analyser IIx (GA IIx). The sequence reads generated were then mapped to the high-quality Nipponbare genomic sequence, and genome-wide variations were uncovered through comprehensive detection of SNPs and InDels across the genome. The discovery of these genetic variations (essentially genome-wide sequence variations) in this study will provide vital clues that will help unravelling the genetic basis underlying heterosis in hybrid rice.

Results

Mapping of GA reads to the Nipponbare genome

Whole-genome resequencing of the six elite rice inbreds yielded 3.38 billion 75-bp paired-end reads, which comprised 24.4 Gb of high-quality raw data. After appropriate processing, the short reads were mapped to high-quality genomic sequences of japonica rice cultivar, Nipponbare, using CLC Genome workbench 4.0. A total of 287.67 and 24.96 million reads were mapped to the nuclear and organellar genomes, respectively (Figure 1). Out of these reads, 222.62 million reads were uniquely mapped to the 12 chromosomes, corresponding to 16.2 Gb of the Nipponbare genome, while 65.05 million reads mapped to multiple locations (Table 1). The number of reads mapped to unique reference sequences ranged from 30.04 million reads in inbred 9001 R to 44.5 million reads in inbred 9003 A (Figure S1). On an average, an effective depth of ×43.2 coverage was achieved across the whole genome, with a sequencing depth ranging from 38.6 in chromosome 11 to 48.2 in chromosome 3. Individually, the mean sequencing depth varied from ×5.7 for inbred 9001 R to ×8.9 for 9003 A (Table S1). The reads mapped uniquely (316 652 843 bp) covered approximately 85.4% of the Nipponbare genome (370 792 118 bp) with a minimum coverage of 81.9% in chromosome 11 to a maximum of 91.0% in chromosome 3 (Table 1). About 25.37 million reads obtained from the six inbreds could not be mapped onto the organellar/nuclear genome of Nipponbare.

Figure 1.

 Classification of the reads from six elite indica rice inbreds mapped onto the Nipponbare genome. The total number of reads (338.01 × 106) generated through resequencing of six genotypes is in the centre circle. The number of reads mapped onto nuclear genome, organelle genome and unmapped reads is shown in the middle circle. The outer circle represents reads with unique mapping, multiple mapping on the chromosomes as well as the reads mapped to organelle and unmapped reads.

Table 1.   Coverage of the reads from resequencing of six elite inbred parents using Illumina Genome Analyzer to the Nipponbare genome
ChromosomeNipponbare genome (bp)*Length of consensus sequence aligned to reference (bp)Coverage (%)Uniquely mapped readsSequencing depth (fold)
Total numberbp
  1. *Nipponbare reference genome (GenBank accession—AACV000000000.1).

  2. Length of the consensus sequence.

  3. Total number except for the column of coverage and sequencing depth, which are average values.

  4. §Coverage is calculated based on the consensus of aligned length to the Nipponbare reference. The value is slightly different from average coverage of respective chromosome.

Chromosome 143 261 74038 067 73687.9927 012 3871 960 018 80145.17
Chromosome 235 954 74331 889 07188.6923 765 0161 724 389 56147.84
Chromosome 336 192 74232 938 01891.0124 086 9091 747 746 11748.17
Chromosome 435 498 46929 156 01082.1319 856 4831 440 786 40640.48
Chromosome 529 737 21725 732 32086.5318 783 3991 362 923 43145.76
Chromosome 630 731 88625 991 13584.5718 536 7891 345 029 41043.71
Chromosome 729 644 04324 713 30383.3716 996 6521 233 277 06941.56
Chromosome 828 434 78023 987 84384.3616 629 6831 206 649 79842.41
Chromosome 922 696 65119 119 12184.2413 328 131967089185.442.54
Chromosome 1022 685 90619 138 39484.3613 438 388975089433.342.97
Chromosome 1128 386 94823 255 59381.9215 116 9421 096 885 31238.62
Chromosome 1227 566 99322 664 29982.2215 065 4931 093 152 17239.68
Total/average370 792 118316 652 84385.40§222 616 27216 153 036 69643.24

Detection of DNA polymorphisms in the elite indica inbred lines

A total of 5 111 333 polymorphisms (4 636 538 SNPs and 474 795 InDels) were detected in six elite indica rice inbred lines with minimum parameters (as defined in Experimental procedure). In order to minimize the rate of false-positive SNPs and InDels, two filters were applied: coverage ≥10 and ≤100, and SNPs/InDels from repetitive regions were eliminated. After applying these filters, the total number of DNA polymorphisms was 2 819 086 including 2 495 052 SNPs, 160 478 insertions and 163 556 deletions (Table 2). For the detection of SNPs and InDels between each of the six inbreds and Nipponbare, all the parameters were identical except minimum coverage that was ≥5 reads. Individually, the total DNA polymorphisms detected ranged from 879 916 in the inbred 9001 R to 2 445 994 in 9003 R (Table S2).

Table 2.   Polymorphism in genomic DNA observed in the six elite indica rice inbreds
Chromosome no.DNA polymorphismDNA polymorphism/100 kbNo. of SNPsSNP/100 kbNo. of insertionsInsertions/100 kbNo. of deletionsDeletions/100 kb
  1. SNP, single nucleotide polymorphisms.

  2. *Total number of polymorphisms.

  3. Average polymorphism per 100 kb.

Chromosome 13 251 77749.5284 078656.120 13746.520 28746.9
Chromosome 2279 398774.2243 923677.617 26948.017 49648.6
Chromosome 3242 594668.4211 527584.315 39042.515 06941.6
Chromosome 4264 523743.2236 006664.813 46037.914 36140.5
Chromosome 5185 567622.9162 723547.911 15737.611 10737.4
Chromosome 6228 656741.1201 956655.713 01042.213 32043.2
Chromosome 7212 521713.4188 046633.211 70739.412 11040.8
Chromosome 8223 359781.1197 285692.212 55044.012 78844.9
Chromosome 9171 504752.5151 888669.1936241.2957642.2
Chromosome 10198 664871.6176 433777.210 38045.711 06348.7
Chromosome 11252 366885.9224 589790.813 52147.613 48347.5
Chromosome 12242 853876.9216 598784.812 53545.412 89646.7
Total*/Average2 819 086*765.12 495 052*677.8160 478*43.2163 556*44.1

Distribution of SNP and InDel variations across rice genome

The total number of SNPs and InDels detected varied across different chromosomes. The highest number of SNPs (284 078) and InDels (20 137 insertions and 20 287 deletions) was observed in chromosome 1, while chromosome 9 had the lowest number of these polymorphisms (Table 2). Detection of SNPs and InDels in each of the individuals also produced the same trend with the maximum numbers in chromosome 1 and minimum in chromosome 9 except 9001 A where the DNA polymorphisms (both SNPs and InDels) and 9003 A where SNPs detected were the least in chromosome 12 (Figure 2; Table S2).

Figure 2.

 Distribution of single nucleotide polymorphisms (SNPs) and insertions–deletions (InDels) identified between each of the six elite indica rice inbreds and Nipponbare. The x-axis represents the chromosomes. The y-axis indicates the number of SNPs/InDels. The six genotypes are A—9001 A, B—9002 A, C—9003 A, D—9001 R, E—9002 R and F—9003 R, and the figures in the parenthesis are the total number of DNA polymorphisms identified in the respective genotypes.

The genomic distribution of the patterns of DNA polymorphism between the six inbreds and the japonica Nipponbare genome was explored by calculating the frequency of polymorphisms observed for each 100-kb interval along the chromosome. The average densities of DNA polymorphisms detected per 100-kb window across the genome were 677.8 (SNPs), 43.2 (insertions) and 44.1 (deletions) in these six indica rice inbreds (Table 2). The SNP frequency showed variations within the genome with chromosome 11 and chromosome 5 having the highest (790.8) and lowest SNP density (547.9) per 100-kb interval (Table 2). At an individual level, the average density of the SNPs per 100-kb interval ranged from 215.4 (9002 A) to 288.3 (9003 R), while in case of InDels, 9001 R had the least density (10.7 insertions and 11.3 deletions/100 kb) and 9003 A had the highest density with 31.4 insertions and 31.3 deletions per 100 kb (Table S2). Overall, the number and density of the insertions and deletions detected was found to be proportionate with very slight bias on the higher side towards deletions except chromosomes 3, 5 and 11 (Tables 2 and S2). Additionally, exceptions in this trend were also observed in two individuals namely 9003 A (chromosomes 1, 2, 6, 8 and 9) and 9003 R (chromosomes 1 and 2) where the insertions detected were slightly higher than the deletions (Table S2).

The distribution of SNPs within the chromosomes was nonrandom (Figure 3). Within each of the chromosomes, there were regions with significant variations in SNP frequencies, and as many as 60 intervals (100 kb each), with significantly higher (50 intervals) and lower SNP frequencies (10 intervals), were observed across the genome. This includes a 500-kb contiguous region between 8.7 and 9.2 Mb in chromosome 8, where an unusually high SNP density was observed. No SNPs could not be detected in one interval each in the chromosomes 4 (between 9.1 and 9.2 Mb) and 9 (between 6.5 and 6.6 Mb), while a longer interval of 200 kb in chromosome 11 (between 27.2 and 27.4 Mb) also did not have any SNP. This was also confirmed in the SNP analysis with each of the individual inbreds. The SNPs in each of the single individuals also reflected nonrandom distribution within the chromosomes (Figure S2). The distribution patterns of insertions and deletions in both combined mapping (Figures 4 and 5) and each of the individuals mapped to Nipponbare (Figures S3 and S4) were also nonrandom, which was similar to SNPs distribution. Overall, the distribution pattern of SNP variations in the inbreds lines exhibited patterns wherein the extremely low SNP rates were observed in specific segments with chromosome 5 showing the large contiguous regions (Figure 6). Most of the SNP-poor regions detected in all the six inbred are 200 kb or shorter, except two regions, which were 500 kb or longer. Even though SNP-poor regions with extremely low SNP frequency longer than 200 kb were observed in individual/group of inbred lines, higher SNP rates were observed in one or other of the remaining individuals. The SNP-poor region in chromosome 5 is the longest (4.3 Mb) followed by chromosome 1 (0.5 Mb).

Figure 3.

 Distribution of single nucleotide polymorphisms (SNPs) between six elite indica rice inbreds and Nipponbare in the 12 rice chromosomes. The x-axis represents the physical distance along each chromosome, split into 100-kb windows. The total size of each chromosome is shown in brackets. The y-axis indicates the number of SNPs. The total number of SNPs in each chromosome is shown in the parenthesis.

Figure 4.

 Distribution of insertions between six elite indica rice inbreds and Nipponbare in the 12 rice chromosomes. The x-axis represents the physical distance along each chromosome, split into 100-kb windows. The total size of each chromosome is shown in brackets. The y-axis indicates the number of insertions. The total number of insertions in each chromosome is shown in the parenthesis.

Figure 5.

 Distribution of deletions between six elite indica rice inbreds and Nipponbare in the 12 rice chromosomes. The x-axis represents the physical distance along each chromosome, split into 100-kb windows. The total size of each chromosome is shown in brackets. The y-axis indicates the number of deletions. The total number of deletions in each chromosome is shown in the parenthesis.

Figure 6.

 Single nucleotide polymorphisms (SNP) variations across the genome in the six elite indica rice inbred lines (a) 9001 A; (b) 9001 R; (c) 9002; (d) 9002 R; (e) 9003 A; (f) 9003 R. The x-axis represents the physical distance along the chromosomes, in which each tick-mark is a megabase. The regions with extremely low SNP frequencies are shown as dark red blocks, and the regions with higher than expected frequencies are noted as light green blocks for each of the chromosomes.

Analysis of SNPs and InDels

The SNPs detected in each of the six inbreds were classified as transitions (C/T and G/A) or transversions (C/G, T/A, A/C and G/T) based on nucleotide substitutions (Table 3). The proportions of transitions (Ts) were significantly higher than the proportion of transversions (Tv) in all the six inbreds. Among the transitions, the number of C/T transitions was slightly higher than G/A transitions, while T/A transversions were relatively higher compared with other transversions namely A/C, G/T and C/G. The ratio between transitions and transversions (Ts/Tv) was 2.0 in all the six inbred lines.

Table 3.   Classification of nucleotide substitutions in the single nucleotide polymorphisms detected in the six elite indica rice inbreds
Substitutions9001 A9002 A9003 A9001 R9002 R9003 R
Transitions (Ts)
 C/T279 448266 588278 478266 459321 706358 866
 G/A277 818265 753277 839265 420320 948358 314
Transversions (Tv)
 C/G50 82048 20449 77448 57958 50265 294
 T/A81 65778 83579 25578 14595 31010 2162
 A/C72 97369 60771 58469 70884 15292 259
 G/T72 92569 80471 72169 46783 91491 889
Ts/Tv ratio2.02.02.02.02.02.0

Out of the total 160 478 insertions and 163 556 deletions detected, the variation in the length of the insertions ranged from 1 to 7 bp, while that of deletions was up to 8 bp (Figure 7). The majority of the InDels (65.2%) were mononucleotide (insertions—105 721 and deletions—105 439), 31.2% were 2- to 4-bp InDels (insertions—49 968 and deletions—51 068) and a meagre 3.7%≥5 bp. The InDels detected in each of the six inbreds individually also showed an upward bias towards mononucleotide insertions and deletions ranging from 42 115 (9001 R) to 11 6273 (9003 A) and 39 985 (9001 R) to 1 16 735 (9003 A), respectively (Figure S5).

Figure 7.

 Distribution of insertions–deletions (InDel) polymorphisms in six elite indica rice inbreds based on their length. The x-axis shows the number of nucleotides of deletions (yellowish orange) and insertions (olive green). The y-axis shows the number of InDels at each length.

Annotation of SNPs and InDels

The use of the annotated reference sequence of Nipponbare enabled us to annotate the SNPs and InDels detected between these elite inbreds and Nipponbare. Accordingly, a total of 2 495 052 SNPs, 160 478 insertions and 163 556 deletions were detected in the nonrepeat regions of the genome, out of which 497 250 SNPs (19.9%), 35 871 insertions (22.4%) and 36 731 deletions (22.4%) were located in 25 591, 10 269 and 10 321 genes, respectively. Among these, 146 604 SNPs (3.2% of the total), 1733 insertions (0.9% of the total) and 1887 deletions (0.9% of the total) were found in coding sequences. Altogether, 83 262 nonsynonymous SNPs (1.8% of the total SNPs) were detected in all the six elite inbred lines that were located onto 16 379 genes (Figure 8). The annotation of SNPs in each of the six individuals revealed that the nonsynonymous SNPs ranged from 21 501 (9002 A) to 30 659 (9003 R), while that of the insertions and deletions in coding sequences ranged from 337 (9001 R) to 1417 (9003 A) and 528 (9002 A) to 1729 (9003 A), respectively (Figure S6). Annotations of insertions and deletions showed that 72 601 InDels were observed in 17 852 genes in the inbred lines.

Figure 8.

 Annotation of single nucleotide polymorphisms (SNPs), insertions and deletions identified between the six elite indica rice inbreds and Nipponbare. SNPs, insertions and deletions detected between the inbreds and Nipponbare were classified based on the annotations of Nipponbare reference genome.

The number of nonsynonymous SNPs per kb of each of gene had wide distribution, ranging from 0.04 to 35.71, with a mean of 1.56. The five-number summary of the box and whisker plot was used to calculate the outlier values, and 1347 genes were classified as outliers with nonsynonymous SNP > 3.9 per kb (Figure 9). Individually, the genotypes had 740 (9002 A) to 934 (9003 R) genes classified as outliers based on higher nonsynonymous SNP per kb detected in respective individuals (Figure S7).

Figure 9.

 The degree of distribution and skewness of the number of nonsynonymous single nucleotide polymorphisms (SNPs) per kb in 16 379 genes that were annotated by the 83 262 nonsynonymous SNPs detected in the six elite indica rice inbred lines. The outlier value calculated indicated that 1347 genes (light green bars) had a nonsynonymous SNP number of >3.9 per kb in the gene(s).

Discussion

Whole-genome resequencing for detecting genome-wide DNA polymorphisms

Genome-wide analysis based on sequencing of two rice cultivars Nipponbare and 93-11 has led to the discovery of millions of SNPs across the rice genome (Feltus et al., 2004; Shen et al., 2004). In spite of rapid advances in next-generation sequencing technologies, only a few japonica rice cultivars, Koshihikari (Yamamoto et al., 2010), Omachi (Arai-Kichise et al., 2011), Eiko and Rikuu 132 (Nagasaki et al., 2010) have been sequenced by whole-genome resequencing. Therefore, emphasis has been placed, and efforts are underway to sequence diverse set of additional rice genotypes through whole-genome sequencing to enlarge the pool of DNA polymorphisms (SNPs and InDels) occurring throughout rice genome, which could be utilized for genetic analysis as well as in the improvement of rice (McCouch et al., 2010).

In the present study, the whole genome of six elite indica rice inbred lines was resequenced and mapped to Nipponbare as a reference genome for discovering genome-wide DNA variations. The uniquely mapped reads from these lines covered 85.4% of the reference genome providing an average coverage of ×43.2 across the genome. About 7.5% (25.37 × 106 reads) of the total reads from the indica rice inbreds did not map to either the organellar or nuclear genome of Nipponbare. At least 6% of the genome has been shown to be unusually divergent between japonica rice cultivar, Nipponbare, and indica rice, 93-11 (Tang et al., 2006), and the reason for observing unmapped reads in mapping reads from indica inbreds to the japonica reference may be due to the inherent differences in genomes accumulated through genetic differentiation between the indica-japonica subspecies.

A total of 2 495 052 SNPs and 224 034 InDels were detected between the six inbreds and Nipponbare, with an average density of 677.8 SNPs per 100 kb (Table 2), which is more than six times the SNP density identified between the japonica cultivar, Nipponbare, and indica variety, 93-11 (Feltus et al., 2004). Tenaillon et al. (2001) suggested that greater SNP rates could be correlated with the higher level of diversity, and the SNPs identified in six elite indica inbreds are more than four times that previously reported between Nipponbare and 93-11 perhaps because of the genetic diversity in the inbreds. The average polymorphism rate detected in our study 6.78 SNPs/kb in the nonrepetitive region of the rice genome is significantly higher than the previous estimates of 4.31 SNPs/kb (Nasu et al., 2002) and 1.70 SNPs/kb (Feltus et al., 2004), offering high-density coverage across the entire rice genome which could be utilized for high-resolution scans and genome-wide linkage disequilibrium studies in indica rice.

Nonrandom distribution of DNA polymorphisms in rice genome

An analysis of the number of SNPs and InDels in the six elite indica rice inbred showed that the observed number of these polymorphisms in each of the chromosomes was significantly different from the expected number based on their physical size. Further analysis of the distribution of SNPs in each chromosome of the six inbreds revealed SNP-rich and SNP-poor regions along the chromosomes. The occurrence of such regional variations in the genome has also been previously reported in rice (Feltus et al., 2004) as well as in Arabidopsis (Nordborg et al., 2005), wheat (Ravel et al., 2006) and human (Smith and Lercher, 2002). In general, the regions with higher SNP frequency were shorter than regions with lower SNP frequency. Noteworthy among the SNP-poor regions was an interval of 4.3 Mb between 9.1- and 15.4-Mb region on chromosome 5 in all the six inbreds. Interestingly, this phenomenon popularly termed ‘SNP deserts’ has also been reported between Nipponbare and 93-11 (Wang et al., 2009) and also between Nipponbare and three other japonica rice cultivars, Koshihikari, Rikuu 132 and Eiko (Nagasaki et al., 2010). Although selective sweeps during the process of domestication in rice is a plausible explanation for reduction in SNP frequencies across the genome (Caicedo et al., 2007), this phenomenon in chromosome 5 observed in both the indica and japonica rice needs elucidation through further in-depth study.

Bias in base substitutions

The ratio of transitions to transversions (Ts/Tv) in the six inbred lines was 2.0, an upward bias towards transitions from the expected ratio of 0.5. This phenomenon known as ‘transition bias’ has previously been reported in rice (Morton, 1995) and maize (Batley et al., 2003). This has been attributed to a higher frequency of transitional mutations over transversions because of conformational advantage in case of mispairing, and better tolerance of transitions during natural selection, because transitions are more likely to conserve the protein structure than transversions (Wakeley, 1996). Although the ratio observed in this study is significantly higher compared with earlier reports in rice, such a high bias has also been reported in SNPs from the chicken genome with Ts/Tv ratio ranging from 2.3 to 4.0 (Vignal et al., 2002). The higher ratio may be due to the inclusion of SNPs from throughout the rice genome (both genic and intergenic) in the present study.

Within transitions, the C/T transitions were higher in number, possibly due to the higher frequency of the C to T mutation after methylation (Coulondre et al., 1978). The relative abundance of C/T mutations has also been reported in other crops such as common bean (Ramírez et al., 2005), maize (Batley et al., 2003), grape (Lijavetzky et al., 2007) and citrus (Terol et al., 2008). Among the transversions, T/A transversions were found in large numbers compared with other transversions, which have also been reported in citrus (Terol et al., 2008). Because of the higher stringency used in mapping parameters, the maximum length of the InDels detected in the present study was restricted to only 8 bp, as compared to 36 bp in Omachi/Nipponbare (Arai-Kichise et al., 2011) and 82 bp in 93-11/Nipponbare (Shen et al., 2004).

Nonsynonymous polymorphisms in rice breeding

Annotation of the polymorphisms in the six rice inbreds shows that more than one-third of the SNPs occur in the nonrepeat regions, while 10.7% of the total SNPs have been found in 25 591 genes across the rice genome. Among the SNPs in genic regions, only 29.5% of them have been found in coding sequences, with 2.5% of them being nonsynonymous in nature. The proportion of nonsynonymous SNPs (>52.0%) is higher than that of synonymous SNPs in coding sequences. Overall, 83 262 nonsynonymous SNPs spanning 16 379 genes and 3620 nonsynonymous InDels in 2625 genes have been identified from the six rice inbreds, which would provide valuable insights into the basis underlying performance of the inbreds and the hybrids between these inbred combinations, as these are more likely to have phenotypic effects.

Conclusion

In the present study, whole-genome variations in indica rice were analysed by detecting SNPs, insertions and deletion polymorphisms among the six elite indica rice inbred lines. Even though individual SNPs are not very informative compared with SSR markers because of lower expected heterozygosity, constructing haplotypes using a set of SNPs enhances the capability of SNPs in genetic analysis enormously (Rafalski, 2002). Additionally, InDels have also been proved to be potential markers with applicability in genetic studies. SNP and InDel discovery by next-generation sequencing has become the main option for genetic marker discovery (Bundock et al., 2010; Imelfort et al., 2010).

Genome-wide association analysis with the performance of the hybrids generated from these elite inbreds using the SNPs discovered in our study will provide valuable insights into the molecular basis of heterosis. With further progress in techniques for analysing SNPs, the millions of SNPs and InDels identified from a diverse set of indica rice inbreds in the present study in addition to the already available SNP resources provide high-density coverage across the rice genome, which will be useful for the analysis of genetic diversity, QTL mapping, marker-assisted breeding, positional cloning, comparative mapping and association studies in rice.

Experimental procedures

Sample preparation and sequencing

DNA of the six elite indica rice inbreds was extracted from leaf tissue using a Qiagen DNeasy kit (Qiagen Pty Ltd., Melbourne, Victoria, Australia). Approximately 1 μg of total DNA was sheared, polished and prepared following the manufacturer’s instructions, Kit FC-102-1002 (Illumina sample preparation protocol for paired-end sequencing) with the following modifications. Briefly, DNA was sheared using the adaptive focused acoustics method on a Covaris S2 device with the following settings: duty cycle 10%; intensity 5; cycles per burst 200 for 180 s at 6 °C. Ligation products were purified by electrophoresis on Invitrogen E-gel SizeSelect 2% (Invitrogen Corporation, Carlsbad, CA). The fragments in the range of predominantly 450-bp size were selected from the gel. PCR products were further purified with a QIAquick PCR Purification Kit (Qiagen Pty Ltd.) and quantified using a DNA 1000 chip on an Agilent BioAnalyzer 2100 (Agilent Technologies, Palo Alto, CA). Approximately 4 pmol per individual was sequenced with 76 × 2 cycles carried out on an Illumina Genome Analyser (GAIIx) using sequencing kit v4. Base calling was performed using Illumina software Pipeline 1.4 (Illumina, San Diego, CA).

Mapping of reads to the reference

Paired-end sequence reads were trimmed of low-quality data with a quality score limit of 0.01 and adaptor sequence in CLC Genomics Workbench 4.0 (http://www.clcbio.com), and reads of <30 base pairs (bp) in length were discarded. Trimmed short-read sequences were first aligned to the published rice organellar genomes (Chloroplast genome: GenBank accession—AY522331.1, mitochondrial genome: GenBank accession—DQ 167400.1), and the unmapped reads were taken up for further assembly against the nuclear genome (GenBank accession—AACV000000000.1). The reads were assembled to the Nipponbare reference with CLC Genomics workbench with the following parameters: mismatch cost—2, insertion cost—3, deletion cost—3, length fraction—0.9 and similarity—0.9. Reads that aligned to more than one position of the reference genome were filtered and used for determining reads mapping to multiple positions in the reference and unmapped reads. Combined assembly was performed for all the six inbreds by assembling all the reads from all the six inbreds, and individual assembly for each of the six inbreds was also performed separately.

Discovery of SNPs and InDel polymorphisms

The assembled contigs were screened for SNPs and InDels using CLC Genomics workbench 4.0 for detecting the variants across all the six inbreds mapped to Nipponbare, contigs from combined assembly under the following criteria for the detection of variants; minimum coverage: 10× (average 43×), and minimum frequency of the least frequent allele: 10%, with additional parameters of a window of 21 bp at a central base quality score of 30 with the surrounding base quality of 20 or more for SNP detection. SNPs and InDels for each of the six inbreds were also detected based on the contigs from individual assemblies with a minimum coverage of 4× and other parameters as in the combined assembly. Additionally, the minimum variant frequency was very stringent with ≥90% for SNP and ≥30% for InDel detection in individual inbreds. Assembly and SNP detection using identical parameters of a resequenced sample of rice cultivar Nipponbare to the reference Nipponbare genome identified 1011 SNP. These SNP could be resequencing errors, errors in the reference Sanger sequence, or may reflect within-cultivar sequence differences that arise from drift. Further, the SNPs from the combined assembly were cross-validated by comparing with the SNPs detected by individual assembly, identifying SNPs with only one variant and filtering SNPs with minimum frequency of ≥50%. Details are provided in the Table S3.

Analysis of genome-level variations

To calculate the variations in DNA polymorphisms across the genome, a sliding window of 100-kb interval was used to analyse each of the chromosome for determining the rates of SNPs and InDels in each window. For determining the regions with significant deviation from the expected SNP rate, box–whisker plot was constructed and SNP-rich and SNP-poor regions were obtained. Additionally, regions with <10 SNPs per 100 kb were also considered as regions with extremely low SNP variations.

The annotations of the SNPs and InDels were derived from the annotated sequence of the Nipponbare reference genome and used for further analysis. The outliers for genes with nonsynonymous SNPs higher than the expected number were calculated using five-number summary of the box and whisker plot.

Acknowledgements

Gopala Krishnan, S., acknowledges Department of Science and Technology, Government of India for the financial support under the BOYSCAST Fellowship and Indian Council for Agricultural Research, New Delhi for granting permission to avail the fellowship. The authors acknowledge technical assistance by Mark Edwards and Stirling Bowen from Southern Cross Plant Genomics, Southern Cross University for Illumina sequencing as well as Peter Bundock for helpful discussions.

Ancillary