• Open Access

High-throughput and parallel SNP discovery in selected candidate genes in Eucalyptus camaldulensis using Illumina NGS platform


(fax +91 80 2839 4352;
email prasad.hendre@itc.in)


Next generation sequencing (NGS) technologies have revolutionized the pace and scale of genomics- and transcriptomics-based SNP discovery across different plant and animal species. Herein, 72-base paired-end Illumina sequencing was employed for high-throughput, parallel and large-scale SNP discovery in 41 growth-related candidate genes in Eucalyptus camaldulensis. Approximately 100 kb of genome from 96 individuals was amplified and sequenced using a hierarchical DNA/PCR pooling strategy and assembled over corresponding E. grandis reference. A total of 1191 SNPs (minimum 5% other allele frequency) were identified with an average frequency of 1 SNP/83.9 bp, whereas in exons and introns, it was 1 SNP/108.4 bp and 1 SNP/65.6 bp, respectively. A total of 75 insertions and 89 deletions were detected of which approximately 15% were exonic. Transitions (Tr) were in excess than transversions (Tv) (Tr/Tv: 1.89), but exceeded in exons (Tr/Tv: 2.73). In exons, synonymous SNPs (Ka) prevailed over the non-synonymous SNPs (Ks; average Ka/Ks ratio: 0.72, range: 0–3.00 across genes). Many of the exonic SNPs/indels had potential to change amino acid sequence of respective genes. Transcription factors appeared more conserved, whereas enzyme coding genes appeared under relaxed control. Further, 541 SNPs were classified into 196 ‘equal frequency’ (EF) blocks with almost similar minor allele frequencies to facilitate selection of one tag-SNP/EF-block. There were 241 (approximately 20%) ‘zero-SNP’ blocks with absence of SNPs in surrounding ±60 bp windows. The data thus indicated enormous extant and unexplored diversity in E. camaldulensis in the studied genes with potential applications for marker-trait associations.


Modern tools for large-scale, high-throughout and massively parallel DNA sequence discovery and genotyping have revolutionized the fields of genomics, transcriptomics and their applications for SNP marker discovery and trait mapping (Heard et al., 2010). Next (second/third) generation sequencing (NGS) platforms such as Illumina, 454-pyrosequencing (Roche), SOLID (Invitrogen), and Ion Torrent (Invitrogen) have capability to produce massive sequence output to make high-throughput DNA marker discovery feasible and cost-effective (Harismendy et al., 2009; Paszkiewicz and Studholme, 2012). Among all these platforms, Illumina is preferred for de novo sequencing, re-sequencing, high-throughput SNP discovery and digital gene expression studies because of very high sequencing depth and quantitative nature of data (Harismendy et al., 2009; Paszkiewicz and Studholme, 2012). 454-Roche sequencing is advantageous for longer read lengths and qualitative data necessary to generate high quality contig assemblies especially for de novo genome and transcriptome sequencing (Harismendy et al., 2009). These techniques have been used for large-scale SNP discovery in large representative plant populations to assess and discover extant candidate gene-based SNP diversity in rice (Kharabian-Masouleh et al., 2011), arabidopsis (Schneeberger and Weigel, 2011), pines (Gonzalez-Martinez et al., 2007), sugarcane (Bundock et al., 2009) and eucalyptus (Kulheim et al., 2009). Apparently high-density DNA pools from experimental cohorts can facilitate discovery of SNP landscape in a population with high confidence (Wei et al., 2011). Other NGS approaches such as reduced representation libraries (RRLs) and RNA sequencing can also be undertaken for SNP discovery (Grattapaglia et al., 2011; Novaes et al., 2008; Van Tassell et al., 2008).

The genus Eucalyptus consists of more than 700 species spread across diverse agro-climatic conditions (Bennett, 2010; Grattapaglia and Kirst, 2008). Eucalyptus camaldulensis or ‘river red gum’ is one of the fast growing and hardy eucalypt species well adapted to tropical conditions in India (Bennett, 2010). For eucalyptus improvement, modern/improved molecular techniques such as genome-wide marker screens, trait-marker associations and further marker-assisted selections potentially offer great advantage over traditional tree breeding methods (Grattapaglia and Kirst, 2008; Thumma et al., 2010). These contemporary tools such as whole/partial genome/transcriptome-based single-nucleotide polymorphism (SNP) markers are still scarce but being explored and studied in different Eucalyptus species (Grattapaglia and Kirst, 2008; Grattapaglia et al., 2011; Kulheim et al., 2009; Novaes et al., 2008; Sexton et al., 2010; Thumma et al., 2005, 2009). Initial NGS-based large-scale SNP discovery in E. grandis was undertaken in pooled RNA sequencing by Novaes et al. (2008), whereas Kulheim et al. (2009) did it over a high-density DNA pool of 1764 samples for 23 genes from secondary metabolite pathways. Recently, one of the first efforts to convert SNPs into high-throughout multiplex genotyping assay employed 768-SNP golden-gate chip over five Eucalyptus species (Grattapaglia et al., 2011). Although these tools are still under development, there have been a few reports to establish strong associations between SNP markers and phenotypes in eucalypts. Thumma et al. (2005) have successfully associated two SNP markers in the gene cinnamoyl coA-reductase (CCR) with micro-fibril angle in two Eucalyptus species. Similarly, Thumma et al. (2009) have also shown association of an SNP marker EnCOBL4 with pulp yield in a population of E. nitens. Sexton et al. (2010) have putatively associated several SNPs present in four genes (CCR, CAD, MYB1 and MYB2) with wood traits in E. pilularis. Till now, such efforts were primarily constrained because of unavailability of genomic sequence information but, with the recent release of (almost) whole genome sequence of E. grandis (EUCAGEN consortium website, http://eucalyptusdb.bi.up.ac.za/gbrowse8x; presently hosted by phytozome, http://www.phytozome.net/cgi-bin/gbrowse/eucalyptus), umpteen new possibilities have emerged for potential re-sequencing.

To identify SNPs in E. camaldulensis, we employed 72-base paired-end Illumina high-throughput sequencing (HTS) over DNA/PCR product pool to discover SNPs present in 41 candidate genes implicated in growth (Alonso-Blanco et al., 2005; Busov et al., 2008; Cronk, 2001; Kim and Cho, 2006; Piazza et al., 2005). The DNA/PCR product pool of 96 individuals (1/4th of the association mapping population) resulted in identification of 1191 informative SNPs with a frequency of 1 SNP every 83.9 bp. The SNPs were further studied for type of base change (transition versus transversion), coding potential (non-synonymous or radical versus synonymous or silent), nucleotide diversity indices and ‘zero-SNP’ blocks. Nearest neighbourhood (NN) frequencies also provided an indication for probable linkage disequilibrium (LD) blocks as equal frequency (EF) blocks.


Selection of candidate genes

Growth-related candidate genes selected for present study included those involved in DNA binding/transcription (transcription factors, TFs), molecular transport, hormonal pathways/synthesis, signalling pathway, protein binding and enzyme catalysis. Their TAIR gene IDs, reference sequence co-ordinates, functional classification, role in plant growth and covered reference regions are described in Table S1. All the genes were successfully annotated and classified into various genic regions (Table S2), except for TASIR-ARF, which is a non-translating trans-acting siRNA (Wang et al., 2010b). A pair of paralogous genes ABCTRII_1 and ABCTRII_2 from gene family ABC Transporter II was included in the study to assess the effect of paralogous regions on the process of SNP discovery.

Illumina NGS for indel and SNP discovery

Approximately 15 million reads (62.2% of all the raw >Q20 reads) were aligned onto ∼100 kb of the reference E. grandis genome of ∼106 kb size (94.2%). Average read depth for all the bases was 6124×. In the preliminary alignment/assembly a total of around 12 000 putative SNPs were discovered but, after SNP definition filters (Materials and methods) were applied, the number reduced to 1191 (approximately 12.5% of the total). Approximately 100 bases present at the beginning and end in each of the contig assembly were trimmed because of very low depth of 131 SNPs located in these regions (176×, 35 times less than overall average depth). Most of the SNPs were biallelic, except for a meagre proportion of 0.67% triallelic SNPs. The SNPs had an overall average spread of 1 SNP/83.9 bp (or 11.9 SNPs/kbp; Table 1). A detailed distribution of SNPs in different genic compartments and their frequencies is given in Table S2 and depicted in Figure 1. Among different genic compartments across all the genes, introns had greater SNP frequency (1/65.6 bp) than exons (1/108.4 bp; 1.3 times more SNPs in introns than exons). There were a total of 779 transitions (Tr) and 412 transverisons (Tv), which indicated an overall Tr/Tv ratio of 1.89. Within different compartments, exons had the highest Tr/Tv ratio (2.73) followed by others with almost similar ratios (introns: 1.63, 3′UTR: 1.57, unclassified: 1.38, non-genic: 1.66). In exons, transitions were majorly synonymous (Ks; approximately 65%), whereas transversions were majorly non-synonymous (Ka; approximately 62%) with average Ka/Ks as 0.72 (Table 1). Overall, around 42% SNPs were non-synonymous (Table 1).

Table 1.   Different attributes of SNPs discovered in 41 candidate genes in Eucalyptus camaldulensis
Gene IDContig attributesExonic attributesIndelsEF-blocksSNPs with ‘Zero-SNP’±60 bp window
Tr + TvTotalTr/TvSNP frequency (bp/SNP)βTTr + TvTotalTr/TvSNP frequency (bp/SNP)KaKsKa/KsβNβSInsertions (exonic)Deletions (exonic)Total blocksAverage length in bp (±SD)No. of SNPs
  1. *Other paralogue showing high simsilarity with AT1G59870. na, Not applicable; TTotal; AAverage.

ABCTRII_1 29 + 13422.2391.33.31 × 10−513 + 7201.86121.910101.005.93 × 10−55.82 × 10−5039115.9 (±113.2)238
ABCTRII_2* 5 + 8130.6374.81.65 × 10−42 + 132.00159.0120.504.06 × 10−42.88 × 10−411277.5 (±14.9)53
ACS5 4 + 26100.67206.26.82 × 10−52 + 241.00277.8313.001.47 × 10−42.19 × 10−4132296.0 (±251.7)56
ANT 2 + 241.00685.06.42 × 10−51 + 011343.0101.52 × 10−47.47 × 10−41 (1)3 (1)1407.024
ARGOS 2 + 241.00243.31.86 × 10−42 + 241.00143.8313.002.21 × 10−43.08 × 10−4120na04
HB16 5 + 6110.83140.39.61 × 10−53 + 03279.7212.001.80 × 10−42.58 × 10−42 (2)2 (2)159.026
AUX1 30 + 17471.7664.64.45 × 10−54 + 154.00270.8230.671.33 × 10−41.18 × 10−456987.8 (±107.9)217
VP1 42 + 13553.2365.43.58 × 10−522 + 4265.5080.56200.306.07 × 10−58.05 × 10−563698.0 (94.6)1710
BRI1 31 + 11422.8288.43.48 × 10−530 + 10403.0087.216240.673.45 × 10−53.42 × 10−5126882.0 (±78.9)1710
BRX 30 + 9393.3356.26.29 × 10−56 + 0689.8150.204.29 × 10−42.37 × 10−432785.3 (±38.9)186
BXL1 54 + 17713.1838.34.95 × 10−539 + 10493.9035.027221.236.91 × 10−56.75 × 10−55 (5)13 (6)937.1 (±33.9)216
CRE1HK4 36 + 24601.50133.21.54 × 10−58 + 3112.67264.6560.834.99 × 10−54.50 × 10−544 (1)10295.4 (±392.6)2921
CUL2 59 + 31901.9038.93.75 × 10−513 + 8211.6359.61472.001.18 × 10−41.28 × 10−4121530.9 (±28.0)363
CYC3-1 5 + 165.00170.51.74 × 10−43 + 143.00165.5130.332.71 × 10−41.88 × 10−4211436.036
CYC3-3 13 + 6192.1778.11.07 × 10−411 + 2135.5079.8490.441.20 × 10−41.12 × 10−401 (1)324.3 (±12.6)72
DA1 12 + 6182.0089.79.35 × 10−56 + 392.0094.2360.501.96 × 10−41.67 × 10−42 (1)135.7 (±3.1)65
DWARF4 6 + 6121.00134.99.18 × 10−52 + 132.00292.0120.502.27 × 10−41.74 × 10−411221.5 (±13.4)66
EBP1 3 + 580.60166.51.33 × 10−40 + 330.00213.7120.503.70 × 10−42.74 × 10−402231.5 (±7.8)51
ent-KS 12 + 14260.8663.18.50 × 10−55 + 6110.8355.3741.752.33 × 10−42.56 × 10−401673.2 (±72.6)142
ERECTA 76 + 391151.9550.22.13 × 10−519 + 5243.8071.212121.008.06 × 10−56.57 × 10−5452046.8 (±59.4)4517
ETO1 49 + 18672.7250.83.87 × 10−538 + 14522.7152.916360.444.41 × 10−54.45 × 10−5011461.8 (±57.1)342
GA20OX 34 + 27611.2658.63.63 × 10−511 + 3143.6791.9861.331.41 × 10−41.63 × 10−410864.5 (±59.0)236
GA2OX 10 + 10201.00137.74.73 × 10−54 + 371.33145.3522.501.21 × 10−41.65 × 10−416425.5 (±10.2)84
GRF1 8 + 5131.6086.41.51 × 10−54 + 262.00111.7331.002.30 × 10−42.33 × 10−421286.0 (±9.9)74
GRF5 3 + 361.00339.78.17 × 10−51 + 01729.0102.78 × 10−41.37 × 10−3112642.0 (±591.1)56
HK1 11 + 11221.0094.46.65 × 10−56 + 392.0093.8270.291.84 × 10−41.46 × 10−40141003.0 (±57.2)97
HOG1 32 + 11432.9165.84.65 × 10−511 + 4152.7597.12130.151.07 × 10−47.43 × 10−543414.3 (±19.4)87
JAG 1 + 01na974.02.91 × 10−40 + 00nananana4.50 × 10−34.50 × 10−3100na01
KAT2-5 25 + 9342.7897.74.09 × 10−59 + 1109.00110.8280.251.51 × 10−41.13 × 10−421672.3 (±115.1)3412
KLU 33 + 19521.7466.33.78 × 10−59 + 5141.8086.8861.331.08 × 10−41.12 × 10−4428113.6 (±189.7)5010
LS 2 + 132.00378.01.61 × 10−42 + 132.00361.7030.009.18 × 10−51.76 × 10−400123.022
NAC1 41 + 26671.5861.33.10 × 10−59 + 2114.5071.3380.383.44 × 10−41.74 × 10−462955.8 (±81.1)229
OBP1 1 + 121.00491.52.09 × 10−40 + 00nananana1.67 × 10−31.67 × 10−3100na02
PAS2 15 + 6212.5062.71.21 × 10−41 + 01326.0010.003.08 × 10−37.30 × 10−4314127.3 (±83.6)104
PIN1-6 3 + 143.00372.31.22 × 10−42 + 132.00354.3030.009.41 × 10−41.25 × 10−41 (1)20na04
TASIR-ARF 5 + 272.50218.31.03 × 10−40 + 00nananananana131113.027
TB1/TCP20 7 + 187.00184.01.10 × 10−43 + 03286.7120.502.44 × 10−41.82 × 10−4022566.0 (±103.2)71
TORKINASE 26 + 14401.86131.42.37 × 10−53 + 03415.3030.007.92 × 10−41.20 × 10−4539329.9(±397.3)3018
TTL 3 + 03na225.03.28 × 10−40 + 00nananana5.00 × 10−35.00 × 10−32 (1)1 (1)0na00
UBP15 9 + 11200.8272.81.07 × 10−42 + 460.50102.5422.002.37 × 10−42.81 × 10−400320.7 (±16.6)62
WUS 5 + 05na331.21.03 × 10−40 + 00nananana1.43 × 10−31.43 × 10−301 (1)111.020
Total/Average779 + 412T1191T1.89A83.9A9.40 × 10−5A306 + 112T418T2.73A108.4A175A243A0.72A5.81 × 10−4A5.16 × 10−4A75 (11)T89 (13)T196T105.2 (±182.3)A541T241T
Figure 1.

 Distribution pattern of SNPs (transitions and transversions) across the studied candidate genes.

There were large variations in Tr/Tv ratios for individual genes, which ranged from 7.00 for TCP20 to a total absence of transversions in JAG, TTL and WUS (Table 1). Sixteen genes showed purifying selection pattern with Ka/Ks ≤ 0.50 (LS, PAS2, PIN1-6, TORKINASE, HOG1, BRX1, KAT2_5, HK1, VP1, CYC3-3, ABCTRII_2, ETO1, DA1, DWARF4, EBP1, TCP20), whereas seven genes showed diversifying selection pattern (Ka/Ks ≥ 2.0; HB16, CUL2, UBP16, GA2OX, ACS5, ARGOS, GRFs). There were six SNPs, which introduced ‘STOP’ codon in the coding frame (considered as non-synonymous SNPs) in genes VP1, BRI1, ERECTA, CUL2 (two SNPs) and GA2OX.

Pairwise sequence alignment between the paralogues ABCTRII_1 and ABCTRII_2 indicated twelve conserved regions. The identified conserved regions had 284 perfect matches with average similarity of 23.7 bases/region (range: 15–59 bases/region). Primer binding sites had few mismatches and indels (9/24 mis-matches and three gaps for EgrABCTRII_2F and 3/24 mis-matches and no gaps for EgrABCTRII_2R) when compared with ABCTRII_1 sequence. The PCR products for both the paralogues were visibly clean. Also, all the SNPs from aligned regions had a very high average mapping quality of 57.5 (n = 13, SD: ±2.37) for ABCTRII_1, whereas that for ABCTRII_2 was 49.5 (n = 13, SD: ±7.46), which signified a negligible mapping error rate of 10−5 to 10−6 (Li and Durbin, 2009).

Modified nucleotide diversity estimates for the pooled amplicons of each gene are described in Table 1. Overall divergence (βT) ranged from 10−5 to 10−4 (mean: 9.40 × 10−5; SD: ±7.02 × 10−5), whereas exonic nucleotide divergence for non-synonymous (βN) as well as synonymous SNPs (βS) ranged from 10−4 to 10−3N: mean, 5.81 × 10−4, SD: ±1.12 × 10−3; βS: mean, 5.16 × 10−4, SD: ±1.06 × 10−3). Exonic nucleotide diversity showed highly significant positive correlation (0.93; P value = 0) between the βN and βS indices, whereas that between βT and βN (0.65, P « 0), and βT and βS (0.66, « 0) was relatively less but positive and significant.

Apart from SNPs, 75 insertions and 89 deletions were also detected. Insertions ranged from 1 to 4 bases and deletions ranged from 1 to 7 bases. The proportion of reads which showed variation also ranged from 10% to 90% for insertions and 10% to 100% for deletions. There were two genes LS and UBP15 with no indels, whereas eight contained only deletions, three only insertions and 28 genes both the insertions as well as deletions. Table 1 shows distribution of indels in different genes and their exonic regions. There were 13 deletions and 11 insertions in exons spread over nine genes as listed in Table 1. Out of ten TFs included in the study, only two had indels. ANT had one insertion and deletion each in the last exon towards end of the encoded protein. An other TF HB16 had two insertions with frame shift and premature STOP codon and two other deletions which caused frame shift. Most of the other exonic indels caused a frame shift with STOP codon in BXL1, CYC3-3, DA1, TTL or without a STOP codon in CRE1HK4. The trans-acting TASIR-ARF amplicon also has one deletion.

The correlation coefficient between Tr/Tv and Ka/Ks (0.44, = 0.02) as well as between Tr/Tv and βT (0.54, P = 0.001) was generally significant but weak. The correlation coefficients between Tv/Tr (transversion/transition) and Ka/Ks (0.51, « 0), βT and βN (0.65, P « 0), and βT and βS (0.66, P « 0) were also highly significant but weak (0.51, P « 0). Other correlations were insignificant (data not shown).

Equal frequency (EF) blocks and ‘zero-SNP’ windows

There were 541 SNPs in 196 equal frequency (EF) blocks with almost similar nearest neighbouring SNP frequencies. Their size and distribution pattern is described in Table 1 and Figure 2. Most of the EF-blocks (74.5%) had a size of <100 bp (average: 105.2 bp; SD: ±182.3 bp), whereas 56.6% had it <50 bp and 38.3% had it <30 bp. There were three EF-blocks each with length longer than 1000 bp (GRF5:1060 bp; TORKINASE: 1216 bp and CRE1HK4: 1234 bp) and the least possible one bp distance (CUL2, ERECTA, NAC1). To compare sizes of EF-blocks to different genomic regions, these blocks were normalized for their total length (by dividing with the length of amplicon) and averaged. This analysis indicated that average size of EF-blocks in intronic and exonic regions was almost comparable (exons: n = 55, mean = 0.015, SD = ±0.015; introns: n = 65, mean = 0.014, SD = ±0.016), whereas those with intronic–exonic overlap (n = 34) had an average size of 0.028 (SD = ±0.023) and were significantly longer than those which belonged to only exonic (P = 0.017) or intronic (P = 0.001) overlaps. The regions which had overlap between 3′UTR and exon or non-genic regions (n = 3) had an average size of 0.024 (SD = ±0.018) but their comparison with exonic/intronic blocks showed insignificant difference. Around 20% SNPs (241) had ‘zero-SNP’ nucleotide composition in ±60 bp (Table 1) and 37.2% (443) had the same for ±30 bp windows (data not shown).

Figure 2.

 Figure showing size distribution of equal frequency (EF) blocks.


Identification of candidate genes

Growth-related candidate genes used in the study were identified by literature search with sufficient experimental support and evidence (http://www.arabidopsis.org). These included various regulatory genes like TFs, one siRNA and a few other genes with temporal/spatial and transient expression often at low levels (Chen et al., 2002; Wang et al., 2010a). Because of their low transcription levels and rare expression, they are difficult to get sampled in other approaches such as RNA sequencing. Most of these genes were well conserved, which also included the trans-acting siRNA (TASIR-ARF; Wang et al., 2010b).

Illumina re-sequencing platform for SNP discovery

Illumina-based HTS is suitable for such studies because of very high read depth which leads to a very high confidence in the contig assembly over a reference sequence (Bentley et al., 2008) and the same was corroborated in this study. Every individual was read at an average of 60 times (from average read depth of 6124×) in the pooled sequencing, a depth more than sufficient to detect variant allele enough number of times to distinguish it from sequencing errors that are unlikely to reach the same threshold as that of a variant allele (Bentley et al., 2008). High output from HTS technologies like Illumina and 454 also facilitates re-sequencing of high-density DNA pools with reasonable population representation (Kulheim et al., 2009; Novaes et al., 2008). In an effort to employ SNPs discovered by these and other approaches in Eucalyptus spp., Grattapaglia et al. (2011) found a low validation and conversion rates for assayed SNPs. Primary reason for this behaviour was attributed to undetected SNPs in the surrounding regions (Grattapaglia et al., 2011; Lepoittevin et al., 2010), perhaps because of limited representation of samples in the process of SNP discovery. In this context, importance of sample size as well as the depth of coverage on power of SNP discovery and variant calling becomes imperative (Wei et al., 2011). Thus, in the absence of any conclusive data about optimal sample size for NGS-based SNP discovery in Eucalyptus spp., it can be hypothesized that relatively large sample size may help in the discovery of high quality as well as minor SNPs. But to achieve this, use of single high-density DNA pool may introduce amplification biases because of competitive advantages/disadvantages for a few samples/alleles. Possibility of representational biases in a mix of amplicons with different sizes while shearing for NGS library construction also cannot be ruled out. Thus, to have an optimal sample representation that will avoid representational biases, a hierarchical pooling strategy was employed as depicted in Figure 3 (see Materials and methods). Despite these efforts, there was uneven spread of read depths and unequal representations of different amplicons and regions (data not shown).

Figure 3.

 Flow chart showing hierarchical pooling strategy followed for SNP discovery. Each level shows individual units formed at that level in outlined rectangles. In level 5, the number in closed brackets indicates number of amplicons pooled to form the respective secondary PCR pool (SPP).

SNP/indel distribution and their properties

Overall SNP frequency reported across all the genes varied significantly, and the average SNP frequency (1 SNP per 83.9 bp) was more akin to those reported in highly heterozygous species like pines (1 per 102.6 bp, Dantec et al., 2004), grapevine (1 every 64 bp, Lijavetzky et al., 2007), maize (1 every 60 bp, Ching et al., 2002) and rye (1 every 52 bp, Li et al., 2011). In comparison to other plant species like rice (1 SNP per 588 bp, Feltus et al., 2004; 1 per 232; Kharabian-Masouleh et al., 2011), sorghum (1 SNP per 357 bp, Feltus et al., 2004), soybean (1 per 273 bp, Zhu et al., 2003), wild soybean Glycine soja (1 per 425 bp, Hyten et al., 2010) and bread wheat (1 per 335 bp, Ravel et al., 2006), the SNP frequency in E. camaldulensis was found to be very high. Kulheim et al. (2009) have already reported the SNP frequency in E. camaldulensis as one of the highest in plant species at 1 per 16 bases. The difference in this and our estimate can be attributed to the genes used for SNP discovery. Here, we have studied genes that are expected to be under stringent selection pressure. This tends to be supported by our observation that four genes were found to have no exonic SNPs, two had only one exonic SNP and 18 genes had Ka/Ks below 0.50. Thus, selection of candidate genes, genomic regions, definition of a SNP, density of DNA pool and biology of the system (cross-pollinated have higher frequency than self-pollinated; Grattapaglia et al., 2011; Lepoittevin et al., 2010) can affect these estimations. Another important reason for high diversity in forest tree species is their recent domestication, which is largely unaffected by the genetic erosion experienced by domesticated crop plants (Doebley et al., 2006). Thus, forest tree species in general and E. camaldulensis in specific can be considered as species with enormous extant but unexplored and unutilized allelic richness. Such a high SNP frequency also makes development of SNP assays difficult as observed by Grattapaglia et al. (2011) in various Eucalyptus species. Other studies which used tree species for SNP genotyping have also reported poor validation and conversion rates mainly because of undetected SNPs in the surrounding regions (Grattapaglia et al., 2011; Lepoittevin et al., 2010). The present data were studied in this context to find out that only around 20% SNPs passed the filter of ±60 bp ‘zero-SNP’ windows on both the sides. Thus, this enormous allelic richness appears as a major limiting factor to design good quality SNP genotyping assays.

The SNPs were more frequent in intronic/UTR/non-genic regions in comparison with exons as they are subjected to evolutionary selection. The Tr/Tv ratio found here was similar to that reported for rice (1.89, Feltus et al., 2004), pine (1.40, Dantec et al., 2004), grapevine (1.46, Lijavetzky et al., 2007) and potato (1.5, Simko et al., 2006) but higher than soybean (0.92, Zhu et al., 2003). The SNPs also exhibited a differential prevalence of transitions and transversions within exons vis-à-vis other regions. In exonic regions, the Tr/Tv ratio was found to be high (2.73) as also seen in human exons (3.02, Freudenberg-Hua et al., 2003; Rosenberg et al., 2003). The hyper-mutability of CpG islands is considered as a primary reason for higher proportion of transitions than transversions in exons present in plant as well as animal genomes (Duncan and Miller, 1980; Jiang et al., 2008). Further, it was also observed that the transitions resulted in more synonymous mutations, whereas the transversions resulted in more non-synonymous mutations as expected (Dagan et al., 2002; see Table S2). Prevalence of indels was notable but most of them were in non-coding regions, whereas those present in exonic regions had low potential to change the context of encoded amino acid chain because of their location towards the end.

The average Ka/Ks ratio was 0.72 (range: 0–3.00) and was comparable with other such studies (rice: 0.11–2.40, Kharabian-Masouleh et al., 2011; white spruce: 1.5, Pavy et al., 2006; Arabidopsis: 2.0, Schmid et al., 2003; soybean: 1.28, Zhu et al., 2003; grapevine: 1.0, Lijavetzky et al., 2007) but higher than that found in bread wheat (0.50, Ravel et al., 2006) and eucalyptus (0.04–0.95, average 0.30, Kulheim et al., 2009). The RNA sequencing in E. grandis found an average Ka/Ks of 0.20 (Novaes et al., 2008), which was far lower than reported here; however, the estimate varied significantly from gene to gene. Within the gene categories, most of the TFs (JAG, LS, NAC1, OBP1, TCP20 and WUS), half of the genes involved in hormonal biosynthesis/signalling (KAT2-5, TTL, BRX, DWARF4, ETO1) and half of the transporters (PIN1-6, ABCTRII-2) were under purifying selection (Ka/Ks ≤ 0.50). In contrast, one TF (HB16) and two of the hormonal biosynthesis/signalling genes (ACS5, GA2OX) were under diversifying selection (Ka/Ks ≤ 2.00). All the other genes did not show any specific pattern but on an average displayed purifying selection pattern. A large number of TFs and signalling pathway genes are supposed to be under purifying selection as they are part of regulatory network and thus supposed to be conserved (Kiełbasa and Vingron, 2008).

Co-amplification of non-targeted and duplicated regions such as pseudogenes and paralogues along with genuine regions has a potential to report ‘false’ SNPs in sequence assembly. These false SNPs may modify annotations of the assembled contigs to different amino acid or STOP codons. This possibility can be substantially reduced by stringent primer designing and annealing conditions in PCR as well as by proper alignment cut-offs for sequence assembly such as size of sliding window, match/mismatch allowance, mapping quality score, etc. The conditions used to amplify and analyse ABCTRII_1 and ABCTRII_2 paralogues substantially met these requirements because the PCR amplification as well sequence assembly was found to be mutually exclusive of each other. They largely produced unambiguous and acceptable PCR product as well as high quality SNPs. However, ambiguous SNP discovery in extremely conserved genes/gene families cannot be completely ruled out in this approach. This can be one of the possible reasons to identify multiple SNPs which introduced STOP codons in HB16 which is supposed to be a highly conserved gene being a TF (Kiełbasa and Vingron, 2008).

The nucleotide diversity indices (β) showed varying patterns across genes. These NGS-based indices represent overall nucleotide diversity in pooled samples, where the individual sample-specific haplotype information is lost while pooling (Novaes et al., 2008). Usually, a nucleotide diversity of 0.001 should roughly get translated to a frequency of 1 SNP per 1000 bases (Hyten et al., 2010). But in these cases, they are much lower than this generalized prediction as well as than those reported by Lepoittevin et al. (2010) and Singh et al. (2011) in eucalypts and other tree species. The main reason for under-estimation of these values is the high read depths (approximately 60 times more than the sample size of 96). Nevertheless, unequal contig coverage may also contribute to this anomaly (data not shown). With increased read depth, as it would happen in NGS platforms, the value of β would be under-estimated as they are inversely proportional. In contrast to this scenario, non-targeted paralogues or duplicated regions amplified along with target genes may inflate these values. Thus, these indices in the context of NGS may merely act as diversity indicators as discussed by Novaes et al. (2008) and not as index values.

Prioritizing SNPs for assay designing

It is advantageous to prioritize SNPs based on biological/technical criteria for efficient genotyping especially for a species like Eucalyptus where SNP diversity is enormous (Grattapaglia et al., 2011; Kulheim et al., 2009). Accordingly, EF-blocks were calculated to approximately represent the linkage disequilibrium (LD) blocks. Ideally, all the SNPs which belong to one LD block should inherit together and so the respective SNP frequencies should not differ significantly (Kim et al., 2007). This technique can be compared with ‘frequency matching’ described by Eberle et al. (2006). The sizes of these EF-blocks varied significantly (1 to >1000 bp), and the average size over all the genes (105.2 bp; SD: ±182.3 bp) was found closer to the LD block sizes reported in species like grapevine (100–200 bp, Lijavetzky et al., 2007), rye (0–380 bp, Li et al., 2011) and barley (300 bp, Morrell et al., 2005). Moreover, out-crossing tree species are generally known to undergo a fast LD decay which makes the LD block size small (Neale, 2007) comparable with smaller sizes of EF-blocks in the present case. Although the actual LD block size would depend upon sample size, population history, recombination, mutation, synonymous/non-synonymous SNPs, genic/non-genic regions, etc. (Eberle et al., 2006; Kim et al., 2007), these estimates may be used as an a priori surrogate to the actual LD block. Thus, in the present case where 196 EF-blocks (541 SNPs) were detected, one representative (tag-SNP) from each of them would be sufficient to cover all the EF-blocks. Selection at this level would reduce the number of SNPs that can be used to design assay (196/541 SNPs representing 196 EF-blocks; 846/1191 SNPs for potential assay design). This proposed strategy is similar to the ‘tag-SNP’ to represent an LD block as proposed by Hinds et al. (2005). However, this hypothesis needs further confirmation from real experimental data, wherein the range for allele frequency difference to realistically define an EF-block can be deciphered.

To further enhance the genotyping efficiency, identification of ‘zero-SNP’ windows (60 bp windows, Grattapaglia et al., 2011; Lepoittevin et al., 2010) and to filter SNPs with more than two alleles (Muchero et al., 2009) would reduce the genotyping failures. Accordingly, in the present case, there were approximately 20% SNPs with ‘zero-SNP’ window in ±60 bp regions, almost less than half (205/508) than that reported by Grattapaglia et al. (2011) in Eucalyptus spp., whereas the proportion of third allele was approximately 0.67%, almost 2.5–3.0 times more than that reported in grapevine (0.25%, Lijavetzky et al., 2007).

In addition, selection of SNPs with very high mapping quality scores, an indication of unambiguous/faithful alignment, would increase validation and conversion rates in the SNP assays.

These considerations can be combined to prioritize SNPs to design assays and to make genotyping successful and cost-effective.

Thus, Illumina-based re-sequencing was highly successful to discover SNPs and indels in E. camaldulensis which also provided generalized outlook for the extant nucleotide diversity. This has shed important light on the distribution and attributes of SNPs in the tested candidate genes in E. camaldulensis, which will serve as potential candidates for marker-trait associations.

Materials and methods

SNP discovery panel and DNA isolation

The SNP discovery panel comprised 96 E. camaldulensis individuals, randomly selected from approximately 400 individuals of an association population (AP) sampled from four natural provenances. Seed lots of four provenances: (i) Morehead River (CSIRO seedlots 19010, 20655; E. camaldulensis ssp. simulata), (ii) Laura River (CSIRO seed lots: 20660, 19962, 18276, 20475; E. camaldulensis ssp. simulata), (iii) Kennedy River (CSIRO seed lots: 20 654, 20907, 19963; E. camaldulensis ssp. simulata) and (iv) Petford Area (seed lot: 16720; E. camaldulensis var. obtusa; Butcher et al., 2002; Doran and Burgess, 1993) were obtained from Australian Tree Seed Centre, CSIRO. The population is being grown and phenotyped across three diverse agro-climatic regions in India. DNA was isolated from leaves using modified CTAB method (Bhat et al., 2002; Porebski et al., 1997; buffer composition: 100 mm Tris–Cl, pH 8.0; 20 mm EDTA, pH 8.0; 1.4 m NaCl; 2% CTAB, 2% PVP and 1%β-mercapto-ethanol). Individual DNAs from 96 individuals were pooled into 12 DNA pools (DPs) by mixing equimolar quantities of eight constituent DNAs each (concentration of 10 ng/μL). These 12 pooled DPs were used independently for amplification of all the genes/gene segments used for SNP discovery (Figure 3).

Selection of genes, primer designing and hierarchical pooling

The growth-related candidate genes were defined in the present context as the genes which influence total biomass production, plant stature, leaf/organ shape/size, and overall growth proven either by forward or by reverse genetics approaches in arabidopsis, rice or other plants/crops. Forty-one genes known to influence different growth traits were selected from literature search (see the reviews by Alonso-Blanco et al., 2005; Busov et al., 2008; Cronk, 2001; Kim and Cho, 2006; Piazza et al., 2005). Their respective TAIR 9 gene IDs (http://www.arabidopsis.org; Table S1) were used to search and download respective gene orthologues from E. grandis genome database, EUCAGEN (4.5× or 8× coverage, http://eucalyptusdb.bi.up.ac.za/gbrowse8x; now hosted by phytozome, http://www.phytozome.net/cgi-bin/gbrowse/eucalyptus; Table S1). The primers were designed to amplify approximately 0.7–8 kbp gene fragments using Primer Premier 6.1 (Premier Biosoft, Pao Alto, CA) with stringent conditions (predicted Ta: 55±3 °C, absence of secondary structures, primer length: 18–30 bases). Table 2 describes details of designed primers, respective amplification temperature (Ta) and time used for in-cycle extension for 41 candidate genes. Each amplification reaction of 15 μL volume consisted of 10 ng of pooled DNA, 200 mm of each dNTP, ½ U of Paq (Pyrococcus) DNA polymerase (Agilent Technologies, Santa Clara, CA), 1× PCR buffer and 5 pmols of each forward and reverse primer. The DNA pools were amplified in a Veriti thermal cycler (ABI) as follows: initial denaturation at 94 °C for 3 min, 35 cycles of denaturation at 94 °C for 30 s, annealing at respective Ta for 30 s, extension at 72 °C for respective extension time (Table 2) and final extension at 72 °C for 10 min. The amplicons from each of the 12 DPs for each gene/gene segment were mixed to form primary PCR product pool (PPP: primary PCR pool), precipitated using absolute ethanol and dissolved in 50 μL 0.1 TE, which formed 41 PPPs. These 41 PPPs were segregated based on amplicon sizes, quantified and mixed in equimolar quantities to form seven secondary PCR product pool (SPP: secondary PCR pool) in each amplicon size category (Table 2). The scheme of mixing and forming hierarchical pools is depicted in Figure 3.

Table 2.   Details of primers designed to amplify 41 candidate genes
Primer IDForward primer*Reverse primer*Amplicon size (bp)Ta (0C)Extension time (min.)Secondary PCR Pool (SPP) lot
  1. *The co-ordinates are as per the downloaded sequences specified in Table S1.


The sequences of ABCTRII_1 and ABCTRII_2 were aligned using BioEdit ver 7 (Hall, 1999) to identify similarity using default parameters (gap initiation penalty of 3, gap extension penalty of 1, base match score 2 and base mismatch penalty 1).

Illumina NGS library preparation, sequencing and SNP discovery

The SPPs were sheared independently using incremental shearing criterion based on their amplicon size. It was found that amplicons below 1 kb could be sheared using 20% amplification, on and off for 10 s each and two cycles, whereas those above 1 kb by 20% amplification, on and off for 10 s each and four cycles on a sonicator (SONICS, Vibra Cell). Each of the sheared SPP was mixed in equimolar quantities to prepare a single 72-base paired-end library (elution range: 150–200 bp) and sequenced on an Illumina GAIIx sequencer following standard recommended procedures by the manufacturer (Illumina Part # 1005361 Rev. C Feb 2010). The quality passed sequencing reads (proprietary QC tool by Genotpyic Technologies Ltd., Bangalore, India) with minimum quality value of Q20 were then assembled using bwa (http://bio-bwa.sourceforge.net; Li and Durbin, 2009) using appropriate parameters (seed length of 32, maximum numbers of gap openings 2, maximum numbers of gap extensions 10, maximum differences in seed 2) over reference gene sequences from E. grandis (Table S1). The SNP calling was performed by running pileup file on SAM tools (http://samtools.sourceforge.net) with default parameters. All the steps from shearing, library preparation, sequencing and analysis were performed by Genotypic Technologies Ltd., Bangalore, India (http://genotypic.co.in/). The SNPs present in approximately 100 bases in the beginning and end of the contigs were filtered out as they were found to have very low coverage in comparison with the average. An SNP was then defined as informative polymorphism based on the proportion of minor allele(s) and declared as SNP if it showed a maximum minor allele frequency of 5%.

Gene annotations, SNP classification and properties of SNPs

The gene sequences were used to annotate and translate using FgenesH gene prediction tool (http://linux1.softberry.com/), genemark (http://exon.gatech.edu/eukhmm.cgi) as well as genescan (http://genes.mit.edu/GENSCAN.html) using Arabidopsis thaliana gene models. The predictions from FgenesH were considered as the most appropriate but in very few cases consensus but consistent annotation from other two prediction programs were also accepted. The SNPs were then classified into different genic regions such as 5′UTR, exons, introns, 3′UTR, non-genic regions and any region which could not be placed into any of these categories as ‘unclassified’. The SNP frequency was calculated as length of gene/genomic region (in bp) where one SNP occurs (total number of SNPs/length of covered region).

The coding SNPs were classified as non-synonymous (Ka) and synonymous (Ks) using web-based Expasy translation tool (http://web.expasy.org/translate) by visual inspection and Ka/Ks ratio was calculated. The relative nucleotide diversity for NGS-based pooled SNP discovery (β) was calculated for each gene as per Novaes et al. (2008) as follows:


where S is number of SNPs detected in the contig, L is the contig sequence length and D is the sequencing depth estimated by average number of reads aligned to each nucleotide position in the contig assembly. Constant ‘1’ was added to the absolute number of detected SNPs in the numerator to enable comparisons with contigs with no SNPs (Novaes et al., 2008). The approximate harmonic summation was arithmetically calculated for each SNP point as logn(− 1) + 0.57721 (Euler–Mascheroni constant). Three β parameters were calculated as: βT for all the SNPs in a contig; βN for the non-synonymous exonic SNPs and βS for the synonymous exonic SNPs (Novaes et al., 2008).

The SNPs were studied in the context of their nearest neighbourhood (NN) frequencies to estimate equal frequency (EF) blocks. The EF-blocks were defined as continuous stretches of genomic regions which contain SNPs with similar minor allele frequencies (frequency difference of <0.02 to 0.03) between the nearest neighbouring SNPs. Further, the ‘zero-SNP’ blocks were defined as the SNPs with absence of other SNPs in the surrounding upstream and downstream window of 60 bp. It should be noted that ‘other allele frequency’ and ‘minor allele frequency’ terms are used interchangeably as majority of the SNPs lack the third allele.

All the routine mathematical/statistical calculations were performed using Excel™ spread sheets (MS Office 2007, Microsoft, Bangalore, Karnataka, India).

Author’s contributions

PSH conceptualized, designed and executed the present experiment, analysed the data, written and communicated the manuscript. MV conceptualized the MAS program in E. camaldulensis and designed the silvicultural aspects along with RK. RK was involved in management and implementation of replicated field trials. All the authors read and approved the manuscript.


The authors acknowledge Drs Navin Sharma, D. Gurumurthy, R. Rajkumar and other staff of ITC R&D Centre, Bangalore, India; Drs. Simon Southerton and Bala Thumma, CSIRO, Plant Industry, Canberra, Australia; Dr. Roby Mathew, Mr. Arut Selvan and Mr. P.G. Suraj, ITC R&D Centre, Bangalore for discussions and help offered during planning/laying/managing the field trial. The authors also acknowledge CSIRO, Australia for supplying seed material, Genotypic Technologies Ltd. for customizing the library construction & analysis pipeline and anonymous reviewers for valuable suggestions and comments. The authors thank EUCAGEN consortium to make E. grandis genome sequence available publicly.